| ??? 02/13/03 13:30 Read: times |
#38966 - RE: Redundant controller system Responding to: ???'s previous message |
Menno:
Back in the early 1980's I worked on a fault tolerant computer system concept using 8086 processors from Intel (back when these processors were just introduced !!). We studied the bus comparison concept using external hardware and determined that the huge mound of extra hardware involved lowered the calculated system MTBF numbers to unacceptable levels. So it was determined to use a software techinque instead that was aided by a small amount of hardware. The company eventually patented the concept (which expired a few years ago). Let me try to describe how the system worked. Note that it was cool becasue the scheme did not add huge expense to each processor module and it also allowed each processor to be identical or to be implemented in completely different technology. Before I describe how it worked I would comment that the level of effort to support the concept in the software was not too huge but did require that the implementation be incorporated into the software from the start. It would not have been useful to have had an existing system and software design of arbitrary configuration and say "lets build a fault tolerant version of this". So to do this right you need to plan it from the beginning. The concept required each processor module to have its processing organized into a timed task architecture. In other words there needed to be a modularity of the main task/loop in the software that used a fixed period of time to complete. (We used a period of 100 msec in the designs we did). This processing task had a job to compute a 16 bit number that was system algorithn derived. In other words this number was derived from the computational results of each of the individual tasks in the main loop and was captured once each 100 msec. It may sounf complicated, but actually the deriving of this number was relatively easy. Each task or process in the main loop could simply maintain some variable that had a value that was dependant upon that that task was performing while at the same time having that value synchronized and updated at the 100 msec rate. The main loop special task I described first had the job of combining the results of each individual task into a final 16 bit number. The algorithm used could be anythying you wanted to come up with. It could be the sum of the individual task results, it could be a CRC if the individual task results, or it could even be a set of bit fields in the final word with a certain number of bits being contributed by each task. The special top level task would then take this 16 bit number and send it out to a special hardware module once each 100 msec. The hardware then implemented the comparison of the words from the three processors of the system and would make the determination of the "who is right and who is wrong". The hardware would then raise the alert and show what went wrong via a scheme appropriate to the system at hand. The circuit would assert the reset pin to the faulty processor and hold it off while letting the other two continue to run. If the other two ever failed to compare then the whole system was stopped. On those 8086 systems we had parallel bus processors that were hosted in the Intel Multibus and so the transmission of the 16 bit check words was done in a parallel scheme. These days with embedded processors with on board FLASH without so much as a bus being used in many of todays designs this scheme can still be used. I would devise an FPGA to do the hardware support logic and have it accept the 16-bit words from each processor in a serial manner. This could be implemented as applicable in your design. The simplest is just two pins containing a data line carrying NRZ output data and a clock line with 16 pulses clock out the data. Another inportant aspect of the design concept we had was that the special hardware circuitry prodvided control of all the resets of each processor and it provided the 100 msec timing signal to each processor. So at power on (or system restart time) the resets of each processor were released together and then each processor's software synchronized itself to the 100 msec timing signal. There are many other details to consider in such a system that are too involved to discuss here and so I'll leave for an implementer to work out. The hardest one of these details by far is to figure out how to get a newly hot plugged processor up to the state of the already running two processors so that very soon after start up the 16-bit numbers being generated by the newly introduced processor come into aliignment with the values being generated by the other two processors. The solution to this takes some thinking. The algorithms for the 16-bit number generation need to be based on system state to a large degree and then on "short term" computational results. So thus the new processor can "come up to speed" in a few seconds or so. One does not want the 16-bit number generation to be a totally cumulative result dependant upon how long the software has been running from reset. Have Fun !! Michael Karas |
| Topic | Author | Date |
| Redundant controller system | 01/01/70 00:00 | |
| RE: Redundant controller system | 01/01/70 00:00 | |
| RE: Redundant controller system | 01/01/70 00:00 | |
| RE: Redundant controller system | 01/01/70 00:00 | |
RE: Fault tolerant system / Michael | 01/01/70 00:00 |



