??? 01/01/08 19:47 Modified: 01/01/08 19:48 Read: times |
#148891 - The key, IMHO, is in the number of data paths Responding to: ???'s previous message |
Russell Bull said: I found the TTL and bit slice parts too slow, though I liked the Motorola MC108K series of ECL bit sliceRuss, historically the ALU was treated as a functional block - you had the the two inputs,output and the function selection bits. Also you had the status bits. The 74181 and 74182 parts were a popular choice in the early 70's then along came the 29xx series of bitslice parts, so the choice of ALU functionality was sorta set in concrete (silicon??). I can think of two methods of implementing an ALU using a FPGA - implement it as the bare logic as you would in TTL gates or as a lookup table driven approach. The first method is usually not the most efficient in FPGAs, so the solution might be in using both approaches. This choice may be made by Verilog - I haven't any experience with this. Nevertheless, the status bits can be expressed as a function of the inputs - whether you choose to intercept the carry at the adder to adder level as you would in a TTL implementation or let Verilog decide by giving it the equation. If it were me, I'd try a few solutions and see what generated the best result - best in LUT usage is usually a good gauge or speed.
I can understand the point Richard is making. I was taught cpu design in the early 80's using TTL and bitslice. The general push was to get as much performance out of the least silicon - it made good sense back then. However, with FPGAs you can easily choose between having the address calculation done by its own logic or through the common ALU - the tradeoff is LUT usage. Depending on the design, it may make sense to feed it through the common ALU as the data paths are there and there maybe no penalty in performance. I've learnt over the years that applying old design techniques to FPGAs are rarely the most efficient solution. I think if Richard was to point out the flaws in your process, he'd end up designing the whole thing for you - which is not what you want and not what he'd want to do (otherwise he would have done it already). So, soldier on and I dare say once you've achieved the result you'll understand where you could have improved things. It's all good learning. BTW, the software cpu emulations I've looked at use arrays to solve the arithmetic operations - the fastest and most portable way of doing it. The design flows from the way of thinking. If one thinks in terms of software, one easily falls into the trap of using a different path for each operation. That leads to little consequence in software, as software isn't impacted by separate paths, regardless of whether one path takes a millisecond while another takes a microsecond. In hardware, where things happen concurrently, it makes a great deal of difference, and hence, requires analysis for each path, multiplied by the number of paths. If all input paths are similar, and all output paths are similar, then only the variations in path length through the ALU are of consequence. If those are all similar, then the required analysis becomes reasonable and can be realized within a lifetime. If the path delays from source to ALU, and from ALU to destination can be balanced with the delay through the ALU, then pipelining can make the latency 3 clocks and increase the rate to one clock. Now, as Andy has implied, it makes little sense to put an 805x core in FPGA. That's borne out by the price comparison. You can buy a pretty fast 805x with a lot of on-chip peripherals you generally don't need for less than $10US at q100. You can buy a moderate-size FPGA, e.g. XC3S500E, as is on my Spartan 3E eval board for about $30US. The FPGA has only 360k bits of block RAM and 73k bits of distributed RAM. That's not enough for anything beyond about 54KB of memory, whether code or data, and that allows for no block ram usage as ROM lookup table for decoding opcodes. Further, it's not likely one would be able to use all the distributed RAM constructively, since that's LUT memory normally used for logic. One conclusion one might reach might be that the implementation of a commercially available MCU core in FPGA is justifiable only in cases where (a) it is the object of a learning exercise, as it is in Russ's project, or (b) where one has very specific specialized requirements of size, power, and performance not attainable in any other way in combination with very unusual hardware only practically implementable in FPGA. RE |