05347450

11
978-1-4244-4078-8/09/$25.00 ©2009 IEEE. 6.B.1-1 SPACE SHUTTLE FAULT TOLERANCE: ANALOG AND DIGITAL TEAMWORK Hugh Blair-Smith, Down to the Metal, Dennis, MA (formerly of C. S. Draper Lab, Cambridge, MA) Abstract The Space Shuttle control system (including the avionics suite) was developed during the 1970s to meet stringent survivability requirements that were then extraordinary but today may serve as a standard against which modern avionics can be measured. In 30 years of service, only two major malfunctions have occurred, both due to failures far beyond the reach of fault tolerance technology: the explosion of an external fuel tank, and the destruction of a launch- damaged wing by re-entry friction. The Space Shuttle is among the earliest systems (if not the earliest) designed to a “FO-FO-FS” criterion, meaning that it had to Fail (fully) Operational after any one failure, then Fail Operational after any second failure (even of the same kind of unit), then Fail Safe after most kinds of third failure. The computer system had to meet this criterion using a Redundant Set of 4 computers plus a backup of the same type, which was (ostensibly!) a COTS type. Quadruple redundancy was also employed in the hydraulic actuators for elevons and rudder. Sensors were installed with quadruple, triple, or dual redundancy. For still greater fault tolerance, these three redundancies (sensors, computers, actuators) were made independent of each other so that the reliability criterion applies to each category separately. The mission rule for Shuttle flights, as distinct from the design criterion, became “FO-FS,” so that a mission continues intact after any one failure, but is terminated with a safe return after any second failure of the same type. To avoid an unrecoverable flat spin during the most dynamic flight phases, the overall system had to continue safe operation within 400 msec of any failure, but the decision to shut down a computer had to be made by the crew. Among the interesting problems to be solved were “control slivering” and “sync holes.” The first flight test (Approach and Landing only) was the proof of the pudding: when a key wire harness solder joint was jarred loose by the Shuttle’s being popped off the back of its 747 mother ship, one of the computers “went bananas” (actual quote from an IBM expert). MIT Roles In Apollo And Space Shuttle I was never on NASA’s payroll, but worked as a staff member (1959-1981) at the Charles Stark Draper Laboratory, formerly the MIT Instrumentation Lab in the Apollo years when it was the prime contractor for Guidance, Navigation & Control. CSDL, spun off as an independent corporation in 1973 but still a part of the MIT community, was in a consulting role to NASA and IBM Federal Systems in the Shuttle program. We who had designed the Apollo Guidance Computer knew that NASA wasn’t going that way again, so we re-invented ourselves as fault tolerance experts. Fault tolerance for hard failures in Apollo was relatively simple: if the crew thought the Primary GN&C System (PGNCS) had stopped working, they would switch to a totally different backup system. We took that as a requirement that the PGNCS shall suffer no such failures, ever, and moved heaven and earth to achieve that goal—and succeeded! Part of that effort was a not-so-simple approach to computer recovery from transient faults: readiness to stifle low- priority tasks in overload conditions, and saving the status quo frequently and effectively rebooting (very quickly) when things got too weird. That combination was what saved the Apollo 11 lunar landing when a radar interface went haywire; don’t let Richard Nixon [1] or anyone else tell you that the computer “failed” and a heroic human took over and saved the day! NASA was under pressure to maximize use of Commercial Off-The-Shelf (COTS) subsystems in the Space Shuttle, theoretically for economy. They had less confidence in such units, as we did, and so designed in massive redundancy and advanced techniques to manage it. Thus it was that our role focused on working with NASA on the architecture of the whole vehicle to make the analog and digital parts cooperate in fault tolerance, and specifically

Transcript of 05347450

Page 1: 05347450

978-1-4244-4078-8/09/$25.00 ©2009 IEEE. 6.B.1-1

SPACE SHUTTLE FAULT TOLERANCE: ANALOG AND DIGITAL TEAMWORK Hugh Blair-Smith, Down to the Metal, Dennis, MA

(formerly of C. S. Draper Lab, Cambridge, MA)

Abstract The Space Shuttle control system (including the

avionics suite) was developed during the 1970s to meet stringent survivability requirements that were then extraordinary but today may serve as a standard against which modern avionics can be measured. In 30 years of service, only two major malfunctions have occurred, both due to failures far beyond the reach of fault tolerance technology: the explosion of an external fuel tank, and the destruction of a launch-damaged wing by re-entry friction.

The Space Shuttle is among the earliest systems (if not the earliest) designed to a “FO-FO-FS” criterion, meaning that it had to Fail (fully) Operational after any one failure, then Fail Operational after any second failure (even of the same kind of unit), then Fail Safe after most kinds of third failure. The computer system had to meet this criterion using a Redundant Set of 4 computers plus a backup of the same type, which was (ostensibly!) a COTS type. Quadruple redundancy was also employed in the hydraulic actuators for elevons and rudder. Sensors were installed with quadruple, triple, or dual redundancy. For still greater fault tolerance, these three redundancies (sensors, computers, actuators) were made independent of each other so that the reliability criterion applies to each category separately. The mission rule for Shuttle flights, as distinct from the design criterion, became “FO-FS,” so that a mission continues intact after any one failure, but is terminated with a safe return after any second failure of the same type.

To avoid an unrecoverable flat spin during the most dynamic flight phases, the overall system had to continue safe operation within 400 msec of any failure, but the decision to shut down a computer had to be made by the crew. Among the interesting problems to be solved were “control slivering” and “sync holes.” The first flight test (Approach and Landing only) was the proof of the pudding: when a key wire harness solder joint was jarred loose by the Shuttle’s being popped off the back of its 747 mother

ship, one of the computers “went bananas” (actual quote from an IBM expert).

MIT Roles In Apollo And Space Shuttle

I was never on NASA’s payroll, but worked as a staff member (1959-1981) at the Charles Stark Draper Laboratory, formerly the MIT Instrumentation Lab in the Apollo years when it was the prime contractor for Guidance, Navigation & Control. CSDL, spun off as an independent corporation in 1973 but still a part of the MIT community, was in a consulting role to NASA and IBM Federal Systems in the Shuttle program. We who had designed the Apollo Guidance Computer knew that NASA wasn’t going that way again, so we re-invented ourselves as fault tolerance experts.

Fault tolerance for hard failures in Apollo was relatively simple: if the crew thought the Primary GN&C System (PGNCS) had stopped working, they would switch to a totally different backup system. We took that as a requirement that the PGNCS shall suffer no such failures, ever, and moved heaven and earth to achieve that goal—and succeeded! Part of that effort was a not-so-simple approach to computer recovery from transient faults: readiness to stifle low-priority tasks in overload conditions, and saving the status quo frequently and effectively rebooting (very quickly) when things got too weird. That combination was what saved the Apollo 11 lunar landing when a radar interface went haywire; don’t let Richard Nixon [1] or anyone else tell you that the computer “failed” and a heroic human took over and saved the day!

NASA was under pressure to maximize use of Commercial Off-The-Shelf (COTS) subsystems in the Space Shuttle, theoretically for economy. They had less confidence in such units, as we did, and so designed in massive redundancy and advanced techniques to manage it. Thus it was that our role focused on working with NASA on the architecture of the whole vehicle to make the analog and digital parts cooperate in fault tolerance, and specifically

Page 2: 05347450

6.B.1-2

debating with IBM Federal Systems on how to give the spacecraft computers “instantaneous” fault tolerance.

Aside from confidence in suppliers, another issue that excited the crews’ interest in high reliability was that the Shuttle would be one of the first aircraft to implement control in the “fly-by-wire” mode. It was all very well for Apollo-style spacecraft to be fly-by-wire because pilots hadn’t known any other type, but if it looked and flew more like an airplane, the astronauts tended to remember how good it felt to know that there were strong steel cables between their hand controls and the aero-surfaces, albeit with a power boost from hydraulics. It was our very good fortune to have Bob Crippen as the astronaut expert in this area, with his broad and deep understanding of what we were doing.

Once we were satisfied that the Orbiter architecture would support analog-digital teamwork (as in Apollo, we weren’t involved with the booster), we focused primarily on the Flight Computer Operating System (FCOS), which in turn was focused on task management, including redundancy management, among the General Purpose Computers (GPCs) and the sensors and actuators. It wasn’t any sort of a DOS (since there were no disk drives), and it didn’t have to handle directly the crew displays and inputs, as there are separate Display Electronics Units (DEUs) for that. The focus on task management was well suited to our experience in the same area in Apollo, whose multi-tasking “executive” was much too simple to be called an Operating System. As we’ll see, the exact architecture of the priority-oriented task management was crucial to our approach to redundancy management.

Reliability Requirements And Degrees Perhaps the biggest difference between Apollo

and the Shuttle was the degree to which the latter, though a glider, had to emulate a regular powered aircraft. One requirement is for a cross-range capability of 1500 miles, meaning that the airframe has to be highly maneuverable—the downside of that is a capability (if not properly controlled) of going into an unrecoverable flat spin during descent. It was soon determined that “properly controlled” meant no major defects in control could last longer than 400 msec. Obviously, the crew couldn’t switch to a backup system under that constraint, so fully

automatic recovery from any single failure became a requirement. Such a recovery can be categorized as either Fail Safe (FS) or Fail Operational (FO), depending on whether the mission is aborted or allowed to continue undiminished after that failure. Since Shuttle missions can be quite long (weeks), treating the first recovery as FO requires an equally fully automatic recovery from a second failure, even of the same type of unit. The only relief in that requirement is an assumption that failures do not happen simultaneously; that there’s time to clean up the configuration after a first failure before a second failure of the same type takes place. The combination of two similar failures can then be called either FO-FS or FO-FO, depending on whether an abort is called on the second failure. The presence of an independent backup system, which the crew can switch to after a third failure if it doesn’t happen in one of those control-critical moments, suggests that the Shuttle avionics architecture could be called FO-FO-FS—Fail Operational-Fail Operational-Fail Safe—well, mostly safe anyway. NASA’s mission rules take the conservative approach of the FO-FS level, since two failures of one type of unit may not be a coincidence.

Reliability Approach Throughout The deliberations on vehicle-wide fault tolerance

architecture settled on quadruple modular redundancy (QMR) with voting for critical units, with a sprinkling of triple modular redundancy (TMR) with voting for units that were the best but not the only way to do their jobs, and a few dual-redundant, less critical, units. To obtain the benefits of analog-digital teamwork, this approach extended well beyond GPCs: to data buses, peripheral electronics, and even hydraulics, where servovalves driving secondary actuators perform QMR voting of their own.

The advantage of modular redundancy with voting is that reliability does not depend on built-in tests, which can always be shown to have incomplete coverage; the challenge is that voting requires strict rules for which signals or actions are enough alike to be considered the same (and thus qualified to be part of the majority). Units with significant analog parts have to be forgiven differences within a tolerance, while all-digital units must produce identical results. Given this, and given the stipulation that failures happen one at a time, QMR voting is 4-out-of-4

Page 3: 05347450

6.B.1-3

initially, then 3-out-of-4 when a failure occurs; then it becomes TMR by ceasing to try working with the failed unit. Similarly, TMR votes 3-out-of-3 until one of those 3 fails, then 2-out-of-3. As you can’t make a majority out of an electorate of 2, the only assured recovery from a third failure is a resort to the independent backup.

Voting With Multi-Port Hydraulics Before getting into the computer system, let’s

look at how the secondary hydraulic actuators achieve the FO-FO-FS standard when moving the aero control surfaces: elevons, rudder, and body flap. Each of these surfaces is driven by a large powerful primary actuator which in itself must never fail, just as each wing must never fail. Hydraulic pressure is fed into the secondary actuator by 4 servovalves, any one of which can feed in the full pressure. If 3 of them call for deflection up and 1 calls for deflection down, the majority overpower the minority in feeding pressure to the primary actuator. When the minority servovalve is closed and locked by disabling its servoamplifier, there is still 2-out-of-3 hydraulic voting (Figure 1).

Figure 1. Hydraulic Actuator Voting

Teamwork With QMR Actuator Servos Since the hydraulic actuators are used in the

critical phases of descent, approach, and landing, their relationship to GPCs and data buses is the key to the digital fault tolerance. During critical phases, the 4 GPCs running the Primary Avionics Software System (PASS—a fifth GPC is running the Backup Flight System) are configured as a Redundant Set (RS) with each GPC driving one of the Flight Critical data buses, and each of those buses controls one servoamplifier at each secondary actuator. Thus every primary/ secondary actuator set is driven by the entire RS with a FO-FO level of fault tolerance

As suggested above, Shuttle avionics is a system of distributed computers, with many local processors in addition to the GPCs. Most of these are designated by the functional name of Modulator/Demodulator (MDM) because of their dependence on direction from the data buses, but they are in fact embedded processors capable of interpreting polling queries and actuator commands from a GPC, controlling whatever subsystem they belong to, and returning sensor data to the entire RS. The fact that they communicate with the GPCs by whole coherent messages is important because it allows the GPCs to be synchronized at a fairly coarse granularity, as opposed to, say, every instruction execution.

Figure 2 shows all the data paths involved when GPCs in the RS send command messages to a typical actuator (shown schematically in the upper right corner). A command message is a coherent semantic unit with a content such as “Go to +5º deflection for the next 40 msec.” The horizontal lines between GPCs and MDMs are the 4 aft-subsystem flight critical data buses, each controlled by one GPC. Each of the Flight Critical/Aft-subsystem MDMs (FA1-4) controls one ServoAmplifier (ASA). Note how the ΔP signals from the servovalves feed back into the ASAs and the Monitors, from which they are passed to the MDM. That’s how the GPCs can poll these MDMs to see whether any ΔP signal is so far out of line as to be adjudged a servo failure.

Page 4: 05347450

6.B.1-4

Figure 2. GPCs To MDMs To ASAs To Actuators

Teamwork With Redundant Sensors For input data to the GPCs, the fault tolerance

design requires that each GPC in the RS receive exactly the same data from the set of redundant copies of any sensor. The software examines the multiple instances of input data and applies a filtering rule appropriate to the unit and its degree of redundancy, e.g. average, intermediate value, etc. If all the digital gear is working, this will produce the exact same filtered data in all GPCs in the RS, plus the exact same opinion of which sensor data is out of line, if any. That’s important to keep the calculations identical throughout the RS. (Cases where not all the digital gear is working are covered further on.)

The MDMs are omitted from Figure 3, but should be understood to be in the long horizontal lines, one per Rate Gyro. The cross-coupling shown is actually performed by the data buses, since each MDM sends its data on all 4 buses. This implies that the 4 MDMs cannot all transmit at once, which seems contrary to the principle of synchronized data gathering, but they do sample and hold the data as simultaneously as their respective controlling GPCs sent polling commands; then they take turns on the bus set using normal contention resolution.

Figure 3. GPCs In RS Listening To All Rate Gyros

The Need For Exact Data Uniformity If TMR had been a fully satisfactory fault

tolerance approach, it would have been good enough to make the 3 GPCs in a TMR-level RS acquire input data that was substantially though not bit-for-bit equal, and issue commands that are substantially though not bit-for-bit equal. Going to quad redundancy and achieving graceful degradation of many types of units poses a number of complications, but above all, a sort of “smoking gun” issue that establishes the need for identical data in all GPCs: “Control Slivering.” With 4 GPCs calculating maneuvers that are almost the same, the system could put actuators into a 2-against-2 force fight, resulting in no maneuver. For Shuttle purposes, we made this argument about a 180º roll maneuver (common enough for spacecraft), but in a conference with a commercial-aircraft orientation, perhaps we should talk about a 180º turn instead! This is not any kind of failure case. With all digital and analog units working within their design tolerances, 2 GPCs could call for a maneuver of +179.99º and the 2 others could call for a maneuver of -179.99º.

Page 5: 05347450

6.B.1-5

Method Of GPC Synchronization Having dictated exact uniformity of data in all

GPCs in the RS, we had to define task management to support it. Part of the problem was that the GPCs weren’t really designed for strictly synchronized operation, so they were supplied with a set of interconnecting 28V discretes, such that 3 discretes (“sync lines”) from each GPC could be read as a 3-bit “sync code” by each of the other GPCs. What task management in the FCOS has to do, without violating the requirement for priority-driven task execution, is to establish enough levels of “sync points” to make all changes of priority level occur at the same point of logical flow in all GPCs. As a practical matter, there are only three occasions to switch to a different task: the normal end of a task, the expiration of a timer interval, and the completion of an input/output (I/O) operation. The timer and I/O occasions are the fundamental causes of program interrupts, but in computers as loosely coupled as these, there’s no way to force each interrupt to appear at exactly the same point in the logic flow in all GPCs.

Our solution was to aggregate effective logic flow into much larger blocks than instruction executions, and to require a sync point when ready to pass from one to the next. Interrupts butt in wherever they can, as in any computer system, but they are not allowed to affect mission logic flow except under control of a sync point. To assure timely response to such tamed interrupts, we made a rule that every task has to invoke a “programmed sync point” at least every millisecond. (This would never work in an ordinary mainframe environment where many unrefined programs are running around without regard to rules and freezing the works with infinite loops and deadly embraces, but Shuttle software is built in a much more disciplined environment.) The expiration of a timer interval flags the next sync point to become a “timer sync point,” and an I/O completion flags the next sync point to become an “IOC sync point.” These flags take the form of priorities, with programmed sync points having the lowest priority, timer sync points the next higher, and I/O completions the highest.

These priority values are exchanged among GPCs as sync codes, via the sync lines mentioned above. The effect is that no GPC proceeds to a new logic block, either the next one in the same task or a different one that responds to an interrupt-level sync

point, until all GPCs have come to a common point in their logic flow and agree (under a timeout limit) on a priority level. That’s how synchronization with data uniformity is maintained throughout the RS in the absence of failures. Next we must consider how the synchronization scheme responds to failure conditions.

Case 1: GPC “Total” Failure If a GPC fails, as happened on the first

Approach and Landing Test flight, it will either be a no-show at the next programmed sync point, or (if it’s running at all) may enter a sync point prematurely and see all the others as no-shows. Either way, the timeout limit leads to the conclusion that a failure to sync has occurred, and every GPC reaching this conclusion turns on an alarm light to alert the crew, who alone are empowered to shut down a GPC. The Computer Annunciation Matrix, described in a later section, presents enough information for a confident decision.

Case 2: Sensor Failure If a sensor (or the MDM that provides the only

access to it) fails, it’s essential that all GPCs in the RS perceive this failure on the same I/O cycle, so that they can simultaneously exclude it from their data filter and resolve not to listen to that sensor again. Whenever a sensor failure is simply an error in numerical value and doesn’t trigger a low-level I/O error condition, it can be caught by any of several “old-fashioned” software techniques: parity and cyclic redundancy checks, reasonableness tests based on serial correlation or a check on uniformity within a tolerance with corresponding data from redundant sensors.

Where there is a low-level I/O error, there is generally a fault in some digital subsystem, but it’s not always clear whether the fault lies in an MDM or bus (on the far side of the cross-coupling from the GPCs) or on the GPC side of the cross-coupling. That is, certain digital subsystem failures can present the appearance of a sensor/MDM failure to some GPCs but not to others, which leads into the next failure case.

Page 6: 05347450

6.B.1-6

Case 3: GPC Bus Receiver Failure If a GPC becomes unable to receive data from a

particular data bus, it can’t by itself distinguish that case from a bus failure, an MDM failure, or a sensor failure. The importance of the distinction is of course that if all GPCs in the RS see the same problem, the GPCs must be working correctly and the “string” (bus/MDM/sensor) must be at fault, while if only one GPC sees a problem, it must take the blame as the failed unit so that the other GPCs can continue normal use of that string and sensor. For this reason, the number of sync codes was increased to distinguish normal I/O completion from I/O completion with errors. That part of the design went through several iterations, of which the version presented here is the simplest to describe but doesn’t lack any of the final functionality.

Sync Code System (No I/O Errors) This section goes into more detail about how

GPC synchronization resolves interrupt priority races and detects faults in any GPC’s logic flow, without getting into I/O errors. For these purposes, four sync codes are enough for each GPC to issue as shown in Table 1.

Table 1. Basic Sync Codes (Binary)

Code Interpretation 000 No sync point requested 001 Programmed sync point 010 Timer interrupt 011 I/O Completion interrupt

In the absence of any interrupts, the GPC with the fastest clock gets to a programmed sync point first, issues sync code 001, and waits for the others in the RS to do likewise, but only long enough for the GPC with the slowest clock to catch up. If all the others turn up with code 001 within the time limit, this GPC proceeds into the next block of the current task. Any other GPC that doesn’t catch up within the tolerance on clock rates is considered by this GPC to be out of sync, and it removes that GPC from its record of the RS and lights the appropriate crew alarm, but then proceeds into the next block as usual. If there’s only 1 no-show out of the original 4, the “local” RS is reduced to 3; but if all the other 3 are no-shows, the local RS is reduced to 1, in which case

the lonely GPC does not attempt to enter any more sync points. Presumably, the lonely GPC is convicted of failure, but only the crew may turn it off.

When a timer interrupt occurs in any GPC, it has no immediate effect but lies in wait for a programmed sync point. When that occurs, this GPC issues sync code 010 to propose that course to the RS, and waits for the others to agree on 010 within the time limit. Assuming for now that no I/O Complete interrupt occurs while this is going on, the sequence is similar to the 001 case, except that all GPCs that agree on code 010 proceed to process the timer interrupt and perform whatever change of task that implies. Processing either type of interrupt is simply a matter of elevating the priority of whatever task is supposed to follow the interrupt event, and is so quick that it doesn’t need to do a programmed sync point of its own. Also, any GPCs that fail to sync may be showing 001 rather than being no-shows, but it’s a fail to sync just the same. If all but one of the RS agree on 010, the timer logic in the lonely GPC may have failed (stopped), but if only one winds up with 010 while the others time out agreeing on 001, its timer logic may be triggering falsely.

When an I/O Complete interrupt occurs in any GPC, it also has no immediate effect but lies in wait for a programmed sync point. (For now, assume no timer interrupt occurs around this time.) When the programmed sync point occurs, this GPC issues sync code 011 to propose that course to the RS, and waits for the others to agree on 011 within the time limit. The sequence is again similar, except that all GPCs that agree on code 011 proceed to process the I/O completion and perform whatever task change that implies. If all but one of the GPCs agree on 011, the I/O logic in the lonely GPC may have failed, but if only one GPC winds up with 011 while the others time out agreeing on 001, its I/O completion logic may be triggering falsely.

When both types of interrupt occur at approximately the same time, they may occur in the same or different orders in the GPCs, so an additional step is seen to be required whenever a timer sync code (010) has to yield to an I/O Completion code (011). The GPC observing this event maintains a record of it, and the next sync point to occur (most likely in the high-priority task invoked by the I/O completion) looks for it and issues 010, rather than

Page 7: 05347450

6.B.1-7

001, if it finds it. Thus, a timer interrupt that loses a close race with an I/O completion still gets processed within a millisecond—though whether the task it invokes has a higher or lower priority than the I/O-triggered task is another question. This combination case has more paths to enumerate (and to verify and validate), but the priority organization of both the sync codes and the tasks triggered by the interrupt events keeps the task management fairly straightforward.

Sync Code System (With I/O Errors) This section covers how GPC synchronization at

I/O completion uses extended sync codes to determine the correct response to input errors detected by a GPC’s hardware-level checking. Since the sensors (including feedback from actuators) are organized into 4 strings, and since we don’t have to consider multiple simultaneous failures, four extended sync codes are added to the original four as shown in Table 2.

Table 2. Full Set Of Sync Codes (Binary)

Code Interpretation 000 No sync point requested 001 Programmed sync point 010 Timer interrupt 011 I/O Completion interrupt, no error 100 I/O Completion, error on string 1 101 I/O Completion, error on string 2 110 I/O Completion, error on string 3 111 I/O Completion, error on string 4

At I/O Completion sync points, the codes 011, 100, 101, 110, and 111 are all considered to be in “agreement” as far as synchronization goes, but the fault tolerance logic takes a second look at them as soon as synchronization is assured. If all GPCs in the RS reported code, say, 110, they all evict string 3 data (for the sensor currently being sampled) from their filter calculations and resolve never to look at that sensor again.

If just one GPC reports, say, 101, while all the rest of the RS report 011, the lonely GPC is judged to

have failed to sync after all, with the usual consequences. This is required because that GPC is unable to perform the same input filtering calculations as the others, and the others are not “willing” to reduce their input data set by eliminating data that they all received correctly. Under our rule about one failure at a time, the only combinations that have to be considered are as shown in Table 3, where A, B, and C designate GPCs other than Self.

Table 3. I/O Sync Codes In A 4-GPC RS

Self A B C Interpretation & Action 011 011 011 011 No string fail; use all data 011 011 011 1xx C failed sync; use all data 011 011 1xx 011 B failed sync; use all data 011 1xx 011 011 A failed sync; use all data 1xx 011 011 011 Self failed sync; leave RS 100 100 100 100 String 1 fail; use 2-3-4 data 101 101 101 101 String 2 fail; use 1-3-4 data 110 110 110 110 String 3 fail; use 1-2-4 data 111 111 111 111 String 4 fail; use 1-2-3 data

Although I’ve been using “sensor” in connection with the data being treated in this way, it’s important to realize that all the actuators include sensors to report back on their own performance, e.g. the ΔP feedback from the hydraulics. Similarly, subsystems that are primarily sensors often accept effector commands, e.g. alignment commands for IMUs. So most MDMs participate in two-way conversations.

As stated above, not all the local processors are MDMs. Figure 4 lists all the types and shows how they are connected through the 7 different types of data buses to all 5 GPCs—one of which is running the backup software (BFS). The BFS GPC normally listens to all buses so as to stay current with the state of the vehicle, but does not command any bus except its own Intercomputer bus, but it commands all buses when backup mode is selected. Whatever the type of local processor, the data bus protocols for message format, contention resolution, etc. are the same as described above for MDMs.

Page 8: 05347450

6.B.1-8

Figure 4. Shuttle Data Bus Architecture

Sync Point Routines And Sync Holes The only GPC code we at MIT generated was for

the sync routines. In contrast to all the flight-related code, which was written in HAL/S (a very high-level engineering language incorporating vectors, matrices, and differential equations), the sync routines were in assembly language, more like kernel programming with every nerdy slick trick allowed, and all the microseconds carefully counted out. The central loop of each sync routine made use of the fact that all 15 sync lines (3 from each of the 5 GPCs) could be read in by a single instruction and passed through a current RS mask to look for agreement, with about 10 μsec per iteration—remember, that was blazing fast in the 1970s. This implied that with random phasing between the loops in various GPCs, one could exit a sync point as much as 10 μsec ahead of another.

We spent vast amounts of effort searching for ways in which sync codes could change at such times that they could be seen differently by different GPCs, further complicated by the fact that each GPC had its own clock rate, and refined the code to minimize these

“sync holes.” Fortunately, we were doing this at a time when the main-engine people and the thermal-tile people were struggling with their development problems and taking all the blame for delays, which gave us time to grind the sync holes down to very small probabilities. As far as I know, no Shuttle flight has experienced one.

Computer Annunciation Matrix Figure 5 shows this dedicated display that shows

each GPC’s vote as to whether itself or some other GPC has failed to sync. It treats all 5 GPCs equally, relying on the crew to know which computer is running BFS and which PASS computers are the RS. As the labeling shows, the top row is GPC 1’s votes regarding each of the 5 GPCs, the second row is GPC 2’s votes, and so on. Correspondingly, each column displays the “peer group’s” opinion of one GPC. As long as there is no fail to sync, the whole matrix is dark. Any GPC’s vote against another GPC turns on a white light; when a GPC perceives that it has fallen out of the RS, it turns on the yellow light in the main diagonal. If the crew

Page 9: 05347450

6.B.1-9

turns off a GPC, the entire corresponding row and column go dark so as not to clutter the unit’s possible later displays.

Figure 5. Computer Annunciation Matrix: CAM

Various patterns of lights are quite informative, and help the crew decide whether a GPC that failed to sync might be worth trying again in a later mission phase. For example, suppose that, in an RS of GPCs 1-4, GPC 2 finds that it’s the only one unable to read one of the buses, but is otherwise running normally.

The pattern of Figure 6 suggests that GPC 2 might be usable again if some configurational change were made, e.g. giving it command of a different bus.

Figure 6. CAM After GPC 2 Failed To Sync

The Proof Of The Pudding The first free-flight test of the prototype Space

Shuttle, Enterprise, was in 1977. It was carried aloft by a specially adapted Boeing 747 (still in use to ferry Shuttles from Edwards AFB to Kennedy Space Center) and was popped off the mother ship’s back to perform the Approach and Landing phases. At the instant of separation, the CAM pattern of Figure 6 appeared for a split second, then that of Figure 7, making it clear that GPC 2 was very sick indeed. The crew moded the GPC through STANDBY to OFF and the CAM went dark while the remaining RS of GPCs 1, 3, and 4 controlled the vehicle to a smooth flawless landing.

Page 10: 05347450

6.B.1-10

Figure 7. CAM After GPC 2 Failed Completely

The immediate reaction by some of the NASA and IBM people was that GPC 2’s troubles must have been caused by all that complicated fault tolerance logic dreamed up by those MIT “scientists” (a sort of cuss word in the mouths of some who had wearied of our harangues on making the scheme complete). Several MIT and IBM engineers pored over the post-flight hex dumps of GPC 2’s memory, looking for very small-needle clues in a very large haystack. One of the best IBMers, Lynn Killingbeck, concluded that it looked like the computer had “gone bananas” and that turned out to be exactly right. When the GPC was returned to IBM’s hardware labs, they found that the wire harness connecting its I/O Processor (IOP, see Figure 4) to its CPU had one cold-solder joint that had been jolted loose by the release from the 747. As luck would have it, the wire in question triggered an ultra-priority interrupt as it rattled against its connector pin. Nothing that the software was trying to do was going to get anywhere with that going on—bananas, indeed!

One of Lynn’s colleagues, in honor of his swift and accurate analysis, presented him with a varnished mahogany trophy featuring two bananas fresh from the grocery (Figure 8).

Figure 8. Lynn Killingbeck’s Desk Trophy

In those days, IBM was a strict white-shirt and sober-tie kind of organization, so it was no great surprise to learn that Lynn’s manager came along after a couple of days and observed that although the trophy was appropriate for the moment, it wasn’t the sort of thing that upheld IBM’s corporate image. “Lynn, we’d like to suggest that you eat the evidence,” he said in his resonant bass voice.

A Heretical Observation It has long been orthodox, in both mainframe and

control computers, to emphasize the feature of program interrupts to support the time-sharing that is essential (in very different ways) to both application areas. While the Shuttle GPCs had this feature and used it, the GPC synchronization scheme described here goes a long way toward emulating a system in which no interrupts exist. In effect, each interrupt had to be encapsulated and prevented from altering the logic flow until the higher-level software was prepared to consider such alterations. I believe that verification of the Shuttle PASS software was made much easier and quicker by thus reducing the variety of logic flow paths. Not many people remember that even in the Apollo Guidance Computer, a primitive precursor of this approach existed: jobs that had to be restartable after transient disturbances protected themselves by sampling a variable NEWJOB at convenient times.

Page 11: 05347450

6.B.1-11

As stated above, mainframes are generally time-shared between production programs and new programs under development. The latter must be expected to violate standards of good behavior (e.g. infinite loops), so that it’s critical for interrupts to seize control of the machine without any cooperation from them.

In real-time control systems, by contrast, all software elements are necessarily production code, which has many reasons to comply with strict behavior protocols: jitter, transport lag, optimal disposition of cycle overruns, etc. From this base, it was not difficult to add a protocol for frequent sampling of the conditions that can alter logic flow by elevating the priority of one task over another—that is, to perceive the events that trigger interrupts in a way that actually used the interrupts but could easily have been done without them.

So that’s the heresy: I suggest that real-time control computers may be better off without program interrupts than with them, emulating their effects without over-complicating the logic flows.

References Almost all of this material is from my memory of

the work we did at the time, supported by a few looks at “Digital Development Memos” produced by the CSDL group named in the acknowledgments below, plus illustrations (copied for use in slide shows) from documents by NASA, IBM, and Rockwell International’s Space Division. Feeling that references in a technical paper should cite resources that interested readers can access, I must say with regret that aside from a few memos in my basement, even I can’t lay hands on the primary documents.

Nevertheless, I cannot forbear to include one reference, even if it is to a side issue:

[1] Mindell, David A., 2008, Digital Apollo, Cambridge, MA, MIT Press, p. 232.

Acknowledgements I must acknowledge my colleagues at the Charles

Stark Draper Laboratory, Inc. with whom I worked on fault tolerance during the Space Shuttle Avionics development years, basically the 1970s: Division Head Eldon C. Hall, Group Leader Alan I. Green, Albert L. Hopkins, Ramon Alonso, Malcolm W. Johnston, Gary Schwartz, Frank Gauntt, Bill Weinstein, and others.

NASA people who contributed to development include Bob Crippen, Cline Frasier, Bill Tindall, Phil Shaffer, Ken Cox, and many others.

From IBM Federal Systems Division: Lynn Killingbeck, Steve Palka, Tony Macina, and more.

From Rockwell International Space Division: Vic Harrison, Bob D’Evelyn, Ron Loeliger, Bob Peller, Gene O’Hern, Sy Rubenstein, Ken McQuade, and more.

Rich Katz, of the Office of Logic Design at NASA’s Goddard Space Flight Center, encouraged me to present this material in a Lessons Learned session at the 2006 MAPLD (Military-Aerospace Programmable Logic Devices) conference. While it does not address such devices, it does cover problems that participants needed to understand, bringing forward solutions that had been developed a generation before.

28th Digital Avionics Systems Conference

October 25-29, 2009