CPU emulated on embedded board

16 replies [Last post]
DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946

Ok, if you have to ask "Why?" you'll never get it, but...

After somewhat insane discussions on PPCMLA and 68kMLA, the germ of a potential insane-but-just-barely possible hack seems to emerge.

Take a fast embedded microcontroller or system on chip, pin-map to an m68k CPU socket (insert inbetweener gubbins), emulate a 68k CPU in software (using O/SS emulators under an RTOS/ucLinux) or in assembly {insert code to bitbang the pins)). Plug it in: it looks like a 68k to the system, only faster (we hope)

Possibles include ColdFire and Gumstix

quotes from the above discussions to be ported here as needed.

comments from people who actually understand this stuff welcome.

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
[quote="Bunsen"]ie. a 600MHz

"Bunsen" wrote:

ie. a 600MHz Gumstix board in a 33MHz Quadra = 18 clock cycles

Cycle Action
==================
0 Read pins
1-16 [Do stuff]
17 Write pins
==================

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

catmistake's picture
Offline
Joined: Dec 20 2003
Posts: 1098
sorry to barge in like this...

Why not emulate the whole thing? Chip, system... and to make it really really fast... the PPCMLA & 68kMLA, too?
(the next big thing... emulated user forums!)

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Some of us just want faster S

Some of us just want faster SE/30s/Blackbirds. If you have to ask...

Although I must admit I've considered gutting a Powerbook 540 and sticking a MiniITX board in there, this isn't the case at hand. It's not really a serious proposal (at this stage Tongue ), more just a "what if"

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

catmistake's picture
Offline
Joined: Dec 20 2003
Posts: 1098
Re: Some of us

Processing power is getting so ridiculous... and this will tangent even more in the next few years...

You could run Basilisk II on linux, in Parallels, on a newer Mac, and it would still be sickeningly faster than those boxes. My point is... I don't think its about speed. We just love the old boxes. I'd really love to see what unabashed power Apple, today, could fit inside an SE/30 (my fav box) — a half-decent cluster of minis with TB of space (disk & ram) at the very least (LCD for CRT, though).

Eudimorphodon's picture
Offline
Joined: Dec 21 2003
Posts: 1204
Hrm...

DrBunsen wrote:

ie. a 600MHz Gumstix board in a 33MHz Quadra = 18 clock cycles

Cycle Action
==================
0 Read pins
1-16 [Do stuff]
17 Write pins
==================

Unless you have some fairly sophisticated glue hardware between your CPU socket and the other processor to buffer the states of the CPU pins you'll be wasting a *lot* of cycles blocking on I/O. (You'll already need glue to convert between voltage levels, etc, of course.)

Let's say, ball-park, you'll need to sample pin state at 33Mhz. If you're having to read, say, 90 pins. (That's wild guess, based on figuring that 1/2 of the 179 pin PGA-package are grounds. That's probably an overestimate. I don't feel like counting signal pins.) That's three or four 32 bit I/O port's worth of bits to read. Unless you have *very* sophisticated glue able to directly stuff all those bit states into RAM via DMA figure you're blowing three or four I/O instructions just to read the pins. Even assuming you could read those states out of latching buffers at the full 200Mhz bus speed of your StrongArm CPU in one cycle (which is unlikely), you'll be spending almost half the time until your next clock tick getting that data into your emulator. Which leaves you pretty boned since you'll spend the *other* half of your time *writing* pin states into your I/O latches. (Note of course that said output latches will have to be designed to only *actually* change states at times appropriate points in the CPU socket's clock cycle. This is starting to look really ugly, and possibly impossible to do without DMA unloading some of your I/O overhead.)

Basically what you're proposing doing is making an In-Circuit-Emulator (Which is an *extremely expensive* engineering testing tool involving a wad of circuitry which watches a CPU bus, logs data to a fast built-in memory buffer, and can halt the included CPU at specific "breakpoints" determined by counting cycles) subtracting the actual CPU being traced and streaming the collected data into a host CPU fast enough for it to respond at wire speed. My gut suspicion is it probably is "doable", but you'll need a *lot* of very sophisticated hardware glue out to allow enough clock cycles for the host CPU to run an emulator fast enough to even match the real hardware.

Probably the only chance you'd have to do it at anything approaching a reasonable size/power budget would be to use one of the FPGA/CPU core hybrid chips. (Something like a Xilinx Virtex-II Pro, which has an embedded PowerPC core.) The embedded core on those devices isn't exactly greased lightning, but with enough work you *might* match the performance of the original CPU. Any improvement is going to require an FPGA or two *plus* a huge hairy-chested CPU. Good luck sticking that in a Duo. ;^)

--Peace

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Re: Hrm...

Wow, Eudi, thanks for the reply. Clearly you're the man to come to when I come up with something realistically doable. You even managed to explain it in terms I understand.

I realise this is simply an idle brainfart. I'm learning a lot about analogue and digital design at the moment, and I'm interested in discussing what's feasible, and what's not - at least in part so I don't spend months chasing wild geese.

Thanks.

Eudimorphodon wrote:

and possibly impossible to do without DMA unloading some of your I/O overhead.)

So would adding a DMA controller move it towards possible?

Quote:

Basically what you're proposing doing is making an In-Circuit-Emulator

Yeah, I've heard of them. I realised that was basically what I was proposing, but on a SOC rather than piped to outboard analysis gear.

Quote:

Probably the only chance you'd have to do it ... would be to use one of the FPGA/CPU core hybrid chips. (Something like a Xilinx Virtex-II Pro, which has an embedded PowerPC core.)

Oh no, now look what you've gone and done...

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
an FPGA or two *plus* a huge hairy-chested CPU.

"Xilinx" wrote:

Virtex-4 Multi-Platform FPGA

Up to two 450MHz embedded IBM PowerPC 405

SelectIO™
Ultimate Parallel Connectivity
* 1+ Gbps differential I/O
* 600 Mbps single-ended I/O
* ChipSync™ source-synchronous interface
* 16 I/O banks

RocketIO™
* connect or bridge "anything to anything" with up to 24 serial transceivers, 622 Mbps to 6.5 Gbps, full duplex

Integrated 10/100/1000 Ethernet

Yers, that should do nicely...

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

Eudimorphodon's picture
Offline
Joined: Dec 21 2003
Posts: 1204
Re: an FPGA or two *plus* a huge hairy-chested CPU.

DrBunsen wrote:

Yes, that should do nicely...

It's probably capable of pulling off something like this, yeah. Do keep in mind that its primary application is in high speed routers and firewalls (we're talking heavy-iron $10,000 Cisco interface cards, not NetGear paperweights) so you're still not putting it in a laptop. Low power consumption it *ain't*.

For reference, here's a block diagram of the 68040:

At the very minimum you're going to have to become *very familiar* with the workings of the box labeled "Bus Controller" and duplicate that in the FPGA. Ideally you'd also provide some hardware to streamline the instruction fetch and write-back stages so your emulator doesn't need to block on polling I/O, and can be presented with an instruction/data stream rather then an array of pin states. The *really* hard part of this is going to be emulating the instruction and data cache units. Said units are optimized to squeeze data as fast as possible through the 68040's bus and operate semi-independently of the ALU portion of the chip. If you end up running your cache in software that's going to seriously impact the cycles left over for executing code.

The ironic part of this is by swapping the integer/floating point units for another CPU you'll basically be exchanging a lot of specialized CISC execution hardware for what's basically a "microencoded" CPU engine. (The emulation software taking the place of a microcode ROM.) In principle this lashup is *vaguely* similar to how "sixth generation" x86 CPUs operate, in that the ALU core is generally designed as a simple RISC or VLIW engine which processes "micro-ops" derived from CISC code via the very sophisticated decoder stages in those processors. The difference is, of course, is you're trying to substitute software on a general-purpose CPU for that specialized decoding/reording hardware. Getting good performance out of this is going to be *hard*. I could almost see making use of *two* PowerPC cores each running half the emulator, with one doing the instruction decoding, constantly "just in time compiling" incoming data to a native PowerPC instruction stream, the other executing said instructions and feeding results back out to the bus controller...

You know, it's starting to look a lot more economical to just piggyback a PowerPC G3/G4 onto the socket alongside the 68040 in the manner of those old Quadra 605 PPC 601 CPU upgrades and run it natively. Apple already wrote the software for it (just take a bit of hacking), and you get native-mode execution to boot. :^b

--Peace

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Re: an FPGA or two *plus* a huge hairy-chested CPU.

Eudimorphodon wrote:

You know, it's starting to look a lot more economical to just piggyback a PowerPC G3/G4 onto the socket alongside the 68040 in the manner of those old Quadra 605 PPC 601 CPU upgrades and run it natively. Apple already wrote the software for it (just take a bit of hacking), and you get native-mode execution to boot. :^b

That's much more what I was thinking as a possible use for those Xilinx boards - replicating the non-CPU parts of Apple's 601 upgrade cards in the FPGA, hacked to support the dual 405 CPUs.

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

Eudimorphodon's picture
Offline
Joined: Dec 21 2003
Posts: 1204
Forget the integrated cores

DrBunsen wrote:

That's much more what I was thinking as a possible use for those Xilinx boards - replicating the non-CPU parts of Apple's 601 upgrade cards in the FPGA, hacked to support the dual 405 CPUs.

You'd be better off using a stand-alone FPGA and a G3/G4 CPU. (Preferably one with on-die cache if you can make it go.) The 405 core lacks an FPU, has a somewhat different MMU from the 60x/7xx series, and has some instruction/hardware extensions that could all pose difficulties for running unmodified Macintosh binary code on it. The lack of an FPU alone means you'd have to write a floating-point emulator that works with the OS. Undoubtedly you could borrow code from linux or the like, as there are embedded variants that run on the 405, but since you don't have the source code to MacOS integrating it is going to be a problem.

I'm assuming this is all a paper hack? Duplicating the Apple PPC upgrade card is probably very doable, but figure spending several thousands of dollars in hardware (you'll need high-quality PCB boards for this, not breadboards) and a lot of reverse engineering time and expertise. Having the complete specs for the ASICs in the original cards would be a huge help.

Somehow I don't see you recouping your investment.

--Peace

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Coldfire

Eudimorphodon wrote:

I'm assuming this is all a paper hack?

Oh absolutely.

I chased up some Coldfire links as well. Seems like there's a couple of vague possibilities that go something like this:

Either:

Mac -> FPGA -> Coldfire SBC -> ucLinux/Debian/BSD -> JIT 68k emulator for Coldfire

Or:

hw/~ Mac PDS -> FPGA -> Coldfire CPU and support chipset
+
sw/~ Open Source OS (say NetBSD/mac68k) recompiled for Coldfire/homebrew PDS support, with CF68kLib (see below). Run JIT version of Basilisk II for MacOS support.

From MLA thread

"Bunsen" wrote:
"wikipedia" wrote:

The Freescale ColdFire is a 68k / for embedded systems / not entirely object code compatible with the 68000. /

Newer models of ColdFire are compatible enough with 68k processors that it is now possible to create binary compatible Amiga clones. The Debian project is currently working {sic} on making its m68k port compatible with the ColdFires / They can be clocked as high as 300MHz / without overclocking.

.

Quote:

CK68KLib: 68K Emulation for ColdFire

Key features /

* Emulation library / to implement 680x0 / instructions and addressing modes missing from the ColdFire architecture. /
* Runs code written in any language, typically with no modifications /
* / specify which 680x0 family processor you wish to emulate /. The utility then generates ColdFire assembly-language source code /

CF68KLib is available for download free of charge!

/

You might also be interested in this PPCMLA discussion

CF86kLib FAQ
CoLiBREE - Coldfire Linux Brisk Embedded Engine - a ucLinux port

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Re: ColdFire

Some variant of the ColdFire looks a likely suspect: take your pick of up to 266MHz, 142 I/O pins @ 3.3V or 5V, external SDRAM/DDR support +/-DMA, Ethernet, USB Host or OTG, SVGA LCD driver and all the usual serial board-level I/O interfaces.

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Re: an FPGA or two *plus* a huge hairy-chested CPU.

Eudimorphodon wrote:

For reference, here's a block diagram of the 68040:

At the very minimum / "Bus Controller" / duplicate that in the FPGA. / also provide some hardware to streamline the instruction fetch and write-back stages / The *really* hard part of this is going to be emulating the instruction and data cache units. / If you end up running your cache in software that's going to seriously impact the cycles left over for executing code.

So would it be possible to combine the "streamline" fetch/write hardware you mention and the pseudo-cache? I mean what you're talking about is essentially a real cache before the pseudo cache if I'm reading this right.

So emu the cache stages on the FPGA, or use physical RAM (DDR perhaps) with a DMA chip as cache, and emulate only the ALU and FPU in software?

There seems to be a lot of documentation available for the 68030; I haven't looked for much on the '040 yet. As the cult machines people want to overdrive (eg SE/30) are often '030 based, maybe this is a better target than the '040.

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

Offline
Joined: Dec 20 2003
Posts: 57
Re: an FPGA or two *plus* a huge hairy-chested CPU.

DrBunsen wrote:

Yers, that should do nicely...

The Virtex-4 starts at about $200 per chip and has very little additional logic available (relatively speaking) to tie things together. Because the PPC chips are doing most of the work already, one might not need many gates to tie things together, but the chip cost alone makes them impracticle.

But in addition to the chip cost, the only packages available are BGA packages, which means that you'd have to pay another large sum to have them soldered to boards in small quantities. If you could sell 10,000 of them, the cost of soldering would be low, but in 100 or 200 lots, the soldering cost would probably be close to $100 each.

Commenting on a later post, by a different author, I think, the Turbo601 upgrade from Daystar actually had a set of 6100 ROMs on board. This causes me to suspect that any upgrade which uses a PPC chip at its heart is going to need something similar.

I couldn't find specs on the GumStix website, but from the description, I bet that's something like an 8 bit microcontroller. No matter how fast the clock speed, trying to emulate a 32 bit machine in an 8 bit processor is going to be a losing proposition.

DrBunsen's picture
Offline
Joined: Dec 20 2003
Posts: 946
Re: an FPGA or two *plus* a huge hairy-chested CPU.

trag wrote:

the Turbo601 upgrade from Daystar actually had a set of 6100 ROMs on board.

There are G3 upgrades available cheaply for the 6100 ... so what about a Turbo601 clone or adapter with a 6100 PDS slot on it?

Quote:

I couldn't find specs on the GumStix website, but from the description, I bet that's something like an 8 bit microcontroller.

It's a 32 bit ARM running ucLinux

__________________

Damn the Torx screws, full speed ahead!
Apple and Wireless FAQ

Offline
Joined: Jan 10 2006
Posts: 22
Re: an FPGA or two *plus* a huge hairy-chested CPU.

trag wrote:

Commenting on a later post, by a different author, I think, the Turbo601 upgrade from Daystar actually had a set of 6100 ROMs on board. This causes me to suspect that any upgrade which uses a PPC chip at its heart is going to need something similar.

Apple did in fact license the ROMs to Daystar for the Turbo 601 PPC upgrade card, which they also sold under their name.
Sequioa Mac Users group also reports the licensing of the ROMS.
Apparently, Daystar's planned 68060 accelerator needed to have Apple ROMs to function and Apple did not want this to compete with the early PPCs, so denied them a ROM license for the 68060 accelerator.

I think we need to transfer this to one of the Japanese Old Mac discussion boards. If there's anyone who going to take the time (and have the tools) to do this, it's them.