JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control...

12
JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE J. Softw. Maint. Evol.: Res. Pract. (2010) Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr.500 Software disasters—understanding the past, to improve the future Patricia A. McQuaid , , , § , Orfalea College of Business, California Polytechnic State University, San Luis Obispo, CA 93407, U.S.A. SUMMARY Over the years, there have been several major software disasters, resulting from poor software project management, poor risk assessment, and poor development and testing practices. The results of the disasters range from project delays, project cancelations, loss of millions of dollars of equipment, to human fatalities. It is important to study software disasters, to alert developers and testers to be ever vigilant, and to understand that huge catastrophes can arise from what seem like small problems. This paper examines such failures as the Mars Polar Lander, the Patriot missile, and the Therac-25 radiation deaths. The focus of the paper is on the factors that led to these problems, an analysis of the problems, and the lessons to be learned that relate to software engineering, safety engineering, government and corporate regulations, and oversight by users of the systems. A model named STAMP, Systems-Theoretic Accident Modeling and Process, will be introduced, as a model to analyze these types of accidents. This model is based on systems theory, where the focus is on systems taken as a whole, as opposed to traditional failure-event models where the parts are examined separately. It is by understanding the past, that we can improve the future. Copyright 2010 John Wiley & Sons, Ltd. Received 14 May 2010; Accepted 14 May 2010 KEY WORDS: software disasters; mars polar lander; patriot missile; Therac-25; STAMP 1. INTRODUCTION ‘Those who cannot remember the past are condemned to repeat it’, said George Santayana, a Spanish-born philosopher who lived from 1863 to 1952 [1]. Software defects come in many forms, from those that cause a brief inconvenience to those that cause fatalities, with a wide range of consequences in between. This paper focuses on three cases: the Mars Polar Lander, the Patriot missile, and the Therac-25 radiation therapy machine. The background is provided, factors that led to these problems are discussed, the problems analyzed, and then the lessons learned from these disasters are discussed, in the hope that we learn from them and do not repeat these mistakes. 2. MARS POLAR LANDER The Mars Polar Lander was the second portion of NASA’s Mars Surveyor ‘98 Program. Launched on 3 January 1999 from Cape Canaveral, it was to work in coordination with the Mars Climate Correspondence to: Patricia A. McQuaid, Orfalea College of Business, California Polytechnic State University, San Luis Obispo, CA 93407, U.S.A. E-mail: [email protected] President of the American Software Testing Qualifications Board (ASTQB). § Chairperson for the Americas, World Congress for Software Quality. Professor of Information Systems. Copyright 2010 John Wiley & Sons, Ltd.

Transcript of JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control...

Page 1: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

JOURNALOF SOFTWAREMAINTENANCEAND EVOLUTION: RESEARCH AND PRACTICEJ. Softw. Maint. Evol.: Res. Pract. (2010)Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/smr.500

Software disasters—understanding the past, to improve the future

Patricia A. McQuaid∗,†,‡,§,¶

Orfalea College of Business, California Polytechnic State University, San Luis Obispo, CA 93407, U.S.A.

SUMMARY

Over the years, there have been several major software disasters, resulting from poor software projectmanagement, poor risk assessment, and poor development and testing practices. The results of the disastersrange from project delays, project cancelations, loss of millions of dollars of equipment, to human fatalities.It is important to study software disasters, to alert developers and testers to be ever vigilant, and tounderstand that huge catastrophes can arise from what seem like small problems. This paper examinessuch failures as the Mars Polar Lander, the Patriot missile, and the Therac-25 radiation deaths. The focusof the paper is on the factors that led to these problems, an analysis of the problems, and the lessons tobe learned that relate to software engineering, safety engineering, government and corporate regulations,and oversight by users of the systems. A model named STAMP, Systems-Theoretic Accident Modelingand Process, will be introduced, as a model to analyze these types of accidents. This model is based onsystems theory, where the focus is on systems taken as a whole, as opposed to traditional failure-eventmodels where the parts are examined separately. It is by understanding the past, that we can improve thefuture. Copyright q 2010 John Wiley & Sons, Ltd.

Received 14 May 2010; Accepted 14 May 2010

KEY WORDS: software disasters; mars polar lander; patriot missile; Therac-25; STAMP

1. INTRODUCTION

‘Those who cannot remember the past are condemned to repeat it’, said George Santayana, aSpanish-born philosopher who lived from 1863 to 1952 [1].

Software defects come in many forms, from those that cause a brief inconvenience to thosethat cause fatalities, with a wide range of consequences in between. This paper focuses on threecases: the Mars Polar Lander, the Patriot missile, and the Therac-25 radiation therapy machine. Thebackground is provided, factors that led to these problems are discussed, the problems analyzed,and then the lessons learned from these disasters are discussed, in the hope that we learn fromthem and do not repeat these mistakes.

2. MARS POLAR LANDER

The Mars Polar Lander was the second portion of NASA’s Mars Surveyor ‘98 Program. Launchedon 3 January 1999 from Cape Canaveral, it was to work in coordination with the Mars Climate

∗Correspondence to: Patricia A. McQuaid, Orfalea College of Business, California Polytechnic State University,San Luis Obispo, CA 93407, U.S.A.

†E-mail: [email protected]‡President of the American Software Testing Qualifications Board (ASTQB).§Chairperson for the Americas, World Congress for Software Quality.¶Professor of Information Systems.

Copyright q 2010 John Wiley & Sons, Ltd.

Page 2: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

Orbiter to study ‘Martian weather, climate and soil in search of water and evidence of long-termclimate changes and other interesting weather effects’ [2].

The initial problems with the project started in September 1999 when the Climate Orbiterdisintegrated while trying to fall into Martian orbit. It was later determined that this was due to amiscommunication between Jet Propulsion Labs and Lockheed Martin Astronautics over the useof metric (Newton) and English (pound) forces to measure the thruster firing strength [2]. Uponthe discovery of the problem’s source, members at JPL went back and corrected the issues and‘spent weeks reviewing all navigation instructions sent to the Lander to make sure it would not beafflicted by an error like the one that doomed the orbiter’ [3].

2.1. The problem

On 3 December 1999, the Polar Lander began its decent towards the Martian surface after a slightcourse adjustment for refinement purposes, and entered radio silence as it entered the Martianatmosphere [3]. Five minutes after the Polar Lander landed on the Martian surface, it was supposedto have begun to transmit radio signals to Earth, taking approximately 30min to travel the 157million mile distance between the planets [4]. Those signals never made it to Earth, and despiterepeated attempts by the people at JPL through April 2000 [3], contact was never established withthe Polar Lander, resulting in another thorough investigation to determine the errors that faultedthis program.

2.2. The causes and lessons learned

It was later determined that an error in one simple line of code resulted in the loss of the PolarLander. The landing rockets ‘were supposed to continue firing until one of the craft’s landing legstouched the surface. Apparently the onboard software mistook the jolt of landing-leg deploymentfor ground contact and shut down the engines, causing [the Polar Lander] to fall from a presumedheight of 40m (130 ft)’ [5] to its demise on the Martian surface. This theory has received moresupport, as images more recently reanalyzed by scientists show what may be both the parachuteand the remains of the Polar Lander. From this, scientists can conclude that the Polar Lander’s‘descent proceeded more or less successfully through atmospheric entry and parachute jettison.It was only a few short moments before touchdown that disaster struck’ [5], and a $165 millioninvestment was lost millions of miles away. Figure 1 depicts two views of the Mars Polar Lander.

According to software safety experts at the Massachusetts Institute of Technology (MIT),although the cause of Mars Polar Lander’s destruction was based on an error that was traced to asingle bad line of software code, ‘that trouble spot is just a symptom of a much larger problem—software systems are getting so complex, they are becoming unmanageable’ [6]. Feature creep,

Figure 1. Two views of the Mars Polar Lander.

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 3: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

SOFTWARE DISASTERS

where features are added, yet not tested rigorously due to budget restraints, compounds the prob-lems. As the Polar Lander software developed more and more add-ons and features while beingtested less and less, it became doomed before it was even loaded onto the Polar Lander itself andlaunched towards Mars [6].

Amidst the coding and human errors, there were also several engineering design and testingflaws discovered in the Polar Lander in post-wreck reports. There were reports that NASA alteredtests, but these allegations have not been proven. One dealt with unstable fuel used and the otherwith improper testing of the landing legs.

One review, led by a former Lockheed Martin executive, found the Mars exploration program tobe lacking in experienced managers, a test and verification program and adequate safety margins.NASA’s Mars Polar Lander Failure Review Board, chaired by a former Jet Propulsion Laboratoryflight operations chief, also released its report. The report said the Polar Lander probably faileddue to a premature shutdown of its descent engine, causing the $165 million spacecraft to smashinto the surface of Mars. It concluded that more training, more management, and better oversightcould have caught the problem [7].

3. THE PATRIOT MISSILE

The Patriot was an army surface-to-air, mobile, air defense system [8]. Originating in the mid-1960s, it was designed to operate in Europe against Soviet medium-to-high-altitude aircraft andcruise missiles traveling at speeds up to about Mach 2. It was then modified in the 1980s to serveas a defense against incoming short range ballistic missiles, such as the Scud missile used by Iraqduring the first gulf war. It had several upgrades since then, in 2002 when the missile itself wasupgraded to include onboard radar. The system was designed to be mobile and operate for onlya few hours at a time. This was in order to avoid detection. It got its first real combat test in thefirst gulf war and was deployed by US forces during Operation Iraqi Freedom.

3.1. Specifications and operations

The missile was 7.4 ft long and was powered by a single stage solid propellant rocket motorthat ran at Mach 3 speeds. The missile itself weighed 2200 pounds, with a range of 43 miles.The Patriot was armed with a 200 pound high-explosive warhead detonated by a proximity fusethat causes shrapnel to destroy the intended target [9]. The plan was for the Patriot missile tofly straight toward the incoming missile and then explode at the point of nearest approach. Theexplosion would either destroy the incoming missile with shrapnel, or knock the incoming missileoff course so it misses its target.

The Engagement Control Station (ECS) was the only manned station in a Patriot battery. TheECS communicated with the launcher, with other Patriot batteries, and with higher commandheadquarters. It controlled all the launchers in the battery. The ECS was manned by three operators,who had two consoles and a communications station with three radio relay terminals [10]. Thus,the weapon control computer was linked directly to the launchers as well as the radar.

The phased array radar carries out search, target detection, track and identification, missiletracking and guidance and electronic counter-countermeasures (ECCM) functions. The radar wasmounted on a trailer and automatically controlled by the digital weapons control computer in theECS, via a cable link. The radar system had a range of up to 100 km, capacity to track up to 100targets and could provide missile guidance data for up to nine missiles. The radar antenna has a63-mile (100-km) range [10]. Pictures of the Patriot Missile launcher and the results of its failure,are depicted below in Figure 2.

The system had three modes: automatic, semi-automatic, and manual. It relied heavily on itsautomatic mode. An incoming missile flew at approximately one mile every second (Mach 5) andcould be 50 miles (80.5 km) away when the Patriot’s radar locked onto it. Automatic detectionand launching became a crucial feature, because there was not a lot of time to react and respond

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 4: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

Figure 2. Patriot Missile launcher and the destroyed Army barracks.

once the missile was detected and a human being could not possibly see it or identify it at thatdistance. The system therefore depended on its radar and the weapon control computer.

The process of finding a target, launching a missile, and destroying a target was straightforward.First, the Patriot missile system used its ground-based radar to find, identify, and track the targets.Once it found a target, it scanned it more intensely and communicated with the ECS. When theoperator or computer decided that it had an incoming foe, the ECS calculated an initial headingfor the Patriot missile. It chose the Patriot missile it would launch, downloaded the initial guidanceinformation to that missile, and launched it. Within 3 seconds the missile was traveling at Mach5 and was headed in the general direction of the target. After launch, the Patriot missile wasacquired by the radar. Then, the Patriot’s computer guided the missile toward the incoming target[11]. As briefly mentioned earlier, when it got close to the target its proximity fuse detonated thehigh explosive warhead and the target was either destroyed by the shrapnel or knocked off course.After firing its missiles, a re-supply truck with a crane pulled up next to the launcher to load itwith new missiles.

The newer PAC-3 missile, first used in 2002, contained its own radar transmitter and computer,allowing it to guide itself. Once launched, it turned on its radar, found the target, and aimed fora direct hit. Unlike the PAC-2, which exploded and relied on the shrapnel to destroy the targetor knock it off course, these PAC-3 missiles were designed to actually hit the incoming targetand explode so that the incoming missile was completely destroyed. This feature made it moreeffective against chemical and biological warheads because they were destroyed well away fromthe target. Described as a bullet hitting a bullet, these ‘bullets’ closed in on each other at speedsup to Mach 10. At that speed there is no room for error—if the missile miscalculates by even1/100th of a second, it will be off by more than 100 ft (30.5m) [11].

3.2. Disaster in Dhahran

On 25 February 1991 a Patriot missile defense system operating at Dhahran, Saudi Arabia, duringOperation Desert Storm failed to track and intercept an incoming Scud. In fact, no Patriot missilewas launched to intercept the Scud that day, where it subsequently hit an Army barracks killing28 people and injuring 97 [8].

The Patriot problems likely stemmed from one fundamental aspect of its design: the Patriot wasoriginally designed as an anti-aircraft, and not an anti-missile, defense system. With this limitedpurpose in mind, Raytheon designed the system with certain constraints. One such constraint wasthat the designers did not expect the Patriot system to operate for more than a few hours at atime—it was expected to be used in a mobile unit rather than at a fixed location. At the time of theScud attack on Dhahran, the Patriot battery had been running continuously for four days—morethan 100 hours. They calculated that after only 8 hours of continuous operation, the Patriot’s storedclock value would be off by 0.0275 seconds, causing an error in the range gate calculation of

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 5: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

SOFTWARE DISASTERS

approximately 55meters. At the time of the Dhahran attack, the Patriot battery in that area hadbeen operating continuously for more than 100 hours—its stored clock value was 0.3433 secondsoff, causing the range gate to be shifted 687meters, a large enough distance that the Patriot waslooking for the target in the wrong place. Consequently, the target did not appear where the Patriotincorrectly calculated that it should. Therefore the Patriot classified the incoming Scud as a falsealarm and ignored it—with disastrous results [12].

On 11 February 1991, after determining the effect of the error over time, the Israelis notifiedthe U.S. Patriot project office of the problem. Once they were notified, the programming team setto work solving the problem. Within a few days, the Patriot project office made a software fixcorrecting the timing error, and sent it out to the troops on 16 February 1991. Sadly, at the time ofthe Dhahran attack, the software update had yet to arrive in Dhahran. That update, which arrivedin Dhahran the day after the attack, might have saved the lives of those in the barracks [12].

This problem of unsuccessfully intercepting incoming missiles was widespread. The Patriotmissile was designed in the late 1970s as an anti-aircraft weapon. However, it was modified inthe 1980s to serve as a defense against incoming short range ballistic missiles. Until the GulfWar, the Patriot had not been tested in combat. Because of this, the Army determined that thePatriot succeeded in intercepting Scud missiles in only perhaps 10–24 of more than 80 attempts.Determining the true success rate of the Patriot is difficult. First, ‘success’ is defined in severalways: destruction, damage to, and deflection of a Scud missile may variously be interpreted assuccesses depending on who makes the assessment. Secondly, the criteria used for ‘proof’ of a‘kill’ varied—in some cases Army soldiers made little or no investigation and assumed a kill, inother cases they observed hard evidence of a Scud’s destruction [12].

3.3. Lessons learned

A 10-month investigation by the House Government Operations subcommittee on Legislation andNational Security concluded that there was little evidence to prove that the Patriot hit more thana few Scuds. Testimony before the House Committee on Government Operations raised seriousdoubts about the Patriot’s performance [9]. One significant lesson learnt from this disaster is thatrobust testing is needed for safety-critical software. We need to test the product for the environmentit will be used in, and under varying conditions. When redesigning systems for a new use, weneed to be very careful to ensure the new design is safe and effective. Also, clear communicationamong the designers, developers, and operators is needed to ensure safe operation. When softwareneeds to be fixed, it must be done quickly and deployed to the site immediately.

4. THERAC-25 MEDICAL ACCELERATOR

The Therac-25 was a computerized radiation therapy machine that dispensed radiation to patients.The Therac-25 is one of the most devastating computer-related engineering disasters to date. It wasdeveloped to treat cancer, but due to poor engineering, it led to the death or serious injury ofsix people. In 1986, two cancer patients in Texas received fatal radiation overdoses from theTherac-25.

In an attempt to improve the software functionality, the Atomic Energy Commission Limited(AECL) and a French company called CGR developed the Therac-25. There were earlier versionsof the machine, and the Therac-25 reused some of the design features of the Therac-6 and theTherac-20. The Therac-25 was ‘notably more compact, more versatile, and arguably easier touse’ than the earlier versions [13]. However, the Therac-25 software also had more responsibilityfor maintaining safety than the software in the previous machines. While this new machine wassupposed to provide advantages, the presence of numerous flaws in the software led to massiveradiation overdoses, resulting in the deaths of three people [13].

Medical linear accelerators ‘accelerate electrons to create high-energy beams that can destroytumors with minimal impact on the surrounding healthy tissue’ [13]. As a dual-mode machine, theTherac-25 was able to deliver both electron and photon treatments. The electron treatment was

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 6: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

Figure 3. The layout of the Therac 25 machine and the treatment room.

used to radiate surface areas of the body to kill cancer cells, whereas the photon treatment, alsoknown as X-ray treatment, delivered cancer killing radiation deeper in the body.

The Therac-25 incorporated the most recent computer control equipment, which was to haveseveral important benefits, one of which was to use a double pass accelerator, to allow a morepowerful accelerator to be fitted into a small space, at less cost. The operator setup time wasshorter, giving them more time to speak with patients and treat more patients in a day. Another‘benefit’ of the computerized controls was to monitor the machine for safety. With this extensiveuse of computer control, the hardware-based safety mechanisms that were on the predecessors ofthe Therac-25, were eliminated and transferred completely to the software, which one will see wasnot a sound idea [14].

The layout of the Therac 25 machine and the treatment room is depicted in Figure 3.The Therac-25 X-rays were generated by smashing high-power electrons into a metal target

positioned between the electron gun and the patient. The older Therac-20 electromechanicalsafety interlocks were replaced with software control, because software was perceived to be morereliable [15].

What the engineers did not know was that the programmer who developed the operating systemused by both the Therac-20 and the Therac-25 had no formal training. Because of a subtle bugcalled a ‘race condition’, a fast typist could accidentally configure the Therac-25 so that the electronbeam would fire in high-power mode, but with the metal X-ray target out of position [15].

4.1. The accidents

The first accident occurred in June 1985 involving a woman who was receiving follow-up treatmenton a malignant breast tumor. During the treatment, she felt an incredible force of heat, and thefollowing week the area that was treated began to breakdown and to lose layers of skin. She alsohad a matching burn on her back, and her shoulder had become immobile. Physicists concluded thatshe received one or two doses in the 20 000 rad (radiation absorbed dose) range, which was wellover the prescribed 200 rad dosage. When the AECL was contacted, they denied the possibilityof an overdose occurring. This accident was not reported to the FDA until after the accidents in1986.

The second accident occurred in July 1985 in Ontario, Canada. After the Therac-25 was activated,it shut down after 5minutes and showed an error that said ‘no dose’. The operator repeated theprocess four more times. The operators had become accustomed to frequent malfunctions that hadno problematic consequences for the patient. While the operator thought no dosage was given, inreality several doses were applied and the patient was hospitalized three days later for radiationoverexposure. The FDA and the Canadian Radiation Protection Board were notified and a voluntaryrecall was issued, while the FDA audited the modifications made to it. A switch was redesigned,believed to have caused the failure, and declared to be 10 000 times safer after the redesign.

The third accident occurred in December 1985 in Washington, where the Therac-25 had alreadybeen redesigned in response to the previous accident. After several treatments, the woman’s skin

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 7: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

SOFTWARE DISASTERS

began to redden. The hospital called the AECL and they said that ‘after careful consideration,we are of the opinion that this damage could not have been produced by any malfunction ofthe Therac-25 or by any operator error’ [13]. However, the staff found evidence for radiationoverexposure due to her symptoms of a chronic skin ulcer and dead tissue.

The fourth accident occurred in March 1986 in Texas where a man died due to complicationsfrom the radiation overdose. In this, the message ‘Malfunction 54’ kept appearing, indicating only6 rads were given, hence the operator proceeded with the treatment. That day the video monitorhappened to be unplugged and the monitor was broken, hence the operator had no way of knowingwhat was happening inside, since the operator was operating the controls in another room. Afterthe first burn, while he was trying to get up off the table, he received another dose—in the wronglocation, since he was moving. He then pounded on the door to get the operator’s attention.Engineers were called upon, but they could not reproduce the problem hence it was put back intouse in April. The man’s condition included vocal cord paralysis, paralysis of his left arm and bothlegs, and a lesion on his left lung, which eventually caused his death.

The fifth accident occurred in April 1986 at the same location as the previous one and producedthe same ‘Malfunction 54’ error. The same technician who treated the patient in the fourth accidentprepared this patient for treatment. This technician was very experienced at this procedure andwas a very fast typist. Thus, as with the former patient, when she typed something incorrectly, shequickly corrected the error. The same ‘Malfunction 54’ error showed up and she knew there wastrouble. She immediately contacted the hospital’s physicist, and he took the machine out of service.After much perseverance, they determined that the malfunction occurred only if the Therac-25operator rapidly corrected a mistake. AECL filed a report with the FDA and began work on fixingthe software bug. The FDA also required AECL to change the machine to clarify the meaning ofmalfunction error messages and shutdown the treatment after a large radiation pulse. The patientfell into a coma, suffered neurological damage and died.

The sixth and final accident occurred in January 1987, again in Washington. AECL engineersestimated that the patient received between 8000 and 10 000 rads instead of the prescribed 86 radsafter the system shut down and the operator continued with the treatment. The patient died due tocomplications from radiation overdose.

4.2. Contributing factors

Numerous factors were responsible for the failure of users and AECL to discover and correctthe problem, and for the ultimate failure of the system. This is partly what makes this case sointeresting; there were problems that spanned a wide range of causes.

Previous models of the machine were mostly hardware based. Before this, computer control wasnot widely in use and hardware mechanisms were in place in order to prevent catastrophic failuresfrom occurring. With the Therac-25, the hardware controls and interlocks which had previouslybeen used to prevent failure were removed. In the Therac-25, software control was almost solelyresponsible for mitigating errors and ensuring safe operation. Moreover, the same pieces of codewhich had controlled the earlier versions were modified and adapted to control the Therac-25. Thecontrolling software was modified to incorporate safety mechanisms, presumably to replace moreexpensive hardware controls that were still in the Therac-20. To this day, not much is known aboutthe sole programmer who ultimately created the software, other than that he had minimal formaltraining in writing software.

Another factor was AECL’s inability or unwillingness to resolve the problems when theyoccurred, even in the most serious patient death instances. It was a common practice for theirengineering and other departments to dismiss claims of machine malfunction as user error, medicalproblems with the patient beyond AECL’s control, and other circumstances wherein the blamewould not fall on AECL. This caused users to try and locate other problems which could havecaused the unexplained accidents.

Next, AECL’s software quality practices were terrible, as demonstrated by their numerousCorrective Action Plan (CAP) submissions. When the FDA finally was made aware of the problem,they demanded of AECL that the numerous problems be fixed. Whenever AECL tried to provide a

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 8: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

solution for a software problem, it either failed to fix the ultimate problem, or changed somethingvery simple in the code, which ultimately could introduce other problems. They could not provideadequate testing plans, nor barely even provide any documentation to support the software theycreated [14]. For instance, the 64 ‘Malfunction’ codes were referenced only by their number, withno meaningful description of the error provided to the console.

FDA interaction at first was poor mainly due to the limited reporting requirements imposed onusers. While medical accelerator manufacturers were required to report known issues or defectswith their products, users were not, resulting in the governmental agency failing to get involvedat the most crucial point following the first accident. Had users been required to report suspectedmalfunctions, the failures may well have been prevented. The AECL did not deem the accidentsto be any fault on their part, and thus did not notify the FDA.

Finally, the defects in the software code were what ultimately attributed to the failures themselves.Poor user interface controls caused prescription and dose rate information to be entered improperly[14]. Other poor coding practices caused failures to materialize, such as the turntable to not be inthe proper position when the beam was turned on. For one patient, this factor ultimately causedthe radiation burns since the beam was applied in full force to the victim’s body, without beingfirst deflected and defused to emit a much lower dosage.

Another major problem was due to race conditions. They are brought on by shared variableswhen two threads or processes try to access or set a variable at the same time, without some sortof intervening synchronization. In the case of the Therac-25, if an operator entered all informationregarding dosage to the console, arriving at the bottom of the screen, some of the software routineswould automatically start although the operator did not issue the command to accept those variables.If an operator then went back in the fields to fix an input error, it would not be sensed by themachine and would therefore not be used, using the erroneous value instead. This attributed toabnormally high dosage rates due to software malfunction.

An overflow occurs when a variable reaches the maximum value that its memory space canstore. For the Therac-25, a 1-byte variable named Class3 was used and incremented during thesoftware checking phase. Since a 1-byte variable can only hold 255 values in total, on every 256thpass through, the value would revert to zero. A function checked this variable and if it was setto 0, the function would not check for a collimator error condition, a very serious problem. Thus,there was a 1 in 256 chance every pass through the program that the collimator would not bechecked, resulting in a wider electron beam which caused the severe radiation burns experiencedby a victim at one of the treatment sites [13].

4.3. Lessons learned

The Therac-25 accidents were tragic and provide a learning tool to prevent any future disasters ofthis magnitude. All of the human, technical, and organizational factors must be considered whenlooking to find the cause of a problem. The main contributing factors of the Therac-25 accidentincluded:

• management inadequacies and lack of procedures for following through on all reportedincidents

• overconfidence in the software and removal of hardware interlocks• presumably less-than-acceptable software-engineering practices• unrealistic risk assessments and overconfidence in the results of these assessments [13].When an accident arises, a thorough investigation must be conducted to see what sparked it.

We cannot assume that the problems were caused by one aspect alone because the parts of a systemare interrelated.

Another lesson is that companies should have audit trails and incident-analysis procedures thatare applied whenever it appears that a problem may be surfacing. Hazard logging and problemsshould be recorded as a part of quality control. A company also should not over rely on the numericaloutputs of the safety analyses. Management must remain skeptical when making decisions.

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 9: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

SOFTWARE DISASTERS

The Therac-25 accidents also reemphasize the basics of software engineering which includecomplete documentation, established software quality assurance practices and standards, cleandesigns, and extensive testing and formal analysis at the module and software level. Changes inany design changes must also be documented so that people do not reverse them in the future.

Manufacturers must not assume that reusing software is 100% safe; parts of the code in theTherac-20 that were re-used were found later to have had defects the entire time. But sincethere were still hardware safety controls and interlocks in place on that model, the defects wentundetected. The software is a part of the whole system and cannot be tested on its own. Regressiontests and system tests must be conducted to ensure safety. Designers also need to take time indesigning user interfaces with safety in consideration.

According to Leveson [13], ‘Most accidents involving complex technology are caused by acombination of organizational, managerial, technical, and, sometimes, sociological or politicalfactors. Preventing accidents requires paying attention to all the root causes, and not just theprecipitating event in a particular circumstance. Fixing each individual software flaw as it wasfound did not solve the device’s safety problems. Virtually all complex software will behave inan unexpected or undesired fashion under some conditions—there will always be another bug.Instead, accidents must be understood with respect to the complex factors involved. In addition,changes need to be made to eliminate or reduce the underlying causes and contributing factorsthat increase the likelihood of accidents or loss resulting from them’.

5. STAMP—SYSTEMS-THEORETIC ACCIDENT MODELING AND PROCESS

STAMP is a model to analyze these types of accidents. The model is based on systems theory,where the focus is on systems taken as a whole, as opposed to traditional failure-event modelswhere the parts are examined separately. The creator of this model is Dr Nancy Leveson, who wasalso the principal investigator and author of the authoritative papers on the Therac-25 disasters.With STAMP, she employs a hazard analysis approach, whereby she examines the controls ofpotential hazards, early in and throughout the processes of design, development, testing, and soon. Leveson calls it ‘investigating an accident before it occurs’ [16].

When using a systems-theoretic accident model, accidents are viewed as the result of flawedprocesses that involve interactions among system components. These components include thefollowing: people, societal and organizational structures, engineering activities, and the physicalsystem. To put this in context, industrial (occupational) safety models focus on unsafe acts orconditions. Reliability engineering emphasizes failure events and the direct relationships betweenthese events. A systems approach to safety takes a broader view by focusing on what was wrongwith the system’s design or its operations that allowed the accident to take place [17]. One needsto examine the larger system and processes that produced the events, to understand what wentwrong. She states that ‘traditional models do a poor job of handling systems containing softwareand complex human decision making; dealing with the organizational and managerial aspects ofsystems, e.g., the safety culture and management decisions; and the adaptation of systems overtime (migration toward hazardous states)’ [16].

Leveson separates accidents into two types: those caused by failures of individual components,and those caused by dysfunctional interactions between non-failed components [17]. We needdifferent models and conceptions of how accidents occur, to more accurately and completelyreflect the types of accidents we are experiencing today. Simply building more tools based on thechain-of-events model will not result in significant gains.

Most hazard analysis techniques focus on failure events, such as by using the popular techniquesof Fault Tree Analysis (FTA) and Failure Modes and Effects Criticality Analysis (FMECA).A shortcoming of these techniques is that they depend on failure events. What STAMP proposesis to examine dysfunctional interactions among operating components, rather than the failure ofindividual components. It purports that the cause of an accident, instead of being understood interms of a series of failure events, is viewed as the result of a lack of constraints imposed on

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 10: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

the system design and operations. Thus in this model, the role of the system engineer or safetyengineer is to identify the design constraints necessary to maintain safety and to ensure that thesystem design and operation enforce these constraints [16]. ‘The analysis starts with identifyingthe constraints required to maintain safety and then goes on to assist in providing the informationand documentation necessary for system engineers and system safety engineers to ensure that theconstraints are enforced in system design, development,manufacturing, and operations. A structuredmethod for handling hazards during development, i.e., designing for safety, and during operationsis presented’ [16].

Leveson asserts that usually a root cause selected from the chain of events has one or more ofthe following characteristics: it represents a type of event that is familiar and thus easily acceptableas an explanation for the accident; it is a deviation from a standard; it is the first event in thebackward chain for which a ‘cure’ is known; or, it is politically acceptable as the identified cause.She believes that there are two basic reasons for conducting an accident investigation: to assignblame for the accident, or to understand why it happened so that future accidents can be prevented.‘Blame is not an engineering concept; it is a legal or moral one’ [17].

All attempts to engineer safer systems rest upon the underlying causal models of how acci-dents occur, although engineers may not be consciously aware of their use of such a model. Anunderlying assumption of these accident models is that there are common patterns in accidentsand that accidents are not simply random events. ‘By defining those assumed patterns, accidentmodels may act as a filter and bias toward considering only certain events and conditions orthey may expand consideration of factors often omitted. The completeness and accuracy of themodel for the type of system being considered will be critical in how effective are the engineeringapproaches based on it’. Further, ‘at the foundation of almost all causal analysis for engineeredsystems today is a model of accidents that assumes they result from a chain (or tree) of failureevents and human errors. The causal relationships between the events are direct and linear, repre-senting the notion that the preceding event or condition must have been present for the subsequentevent to occur, i.e., if event X had not occurred, then the following event Y would not haveoccurred. As such, event chain models encourage limited notions of linear causality, and they cannotaccount for indirect and nonlinear relationships’. ‘When the goal is to assign blame, the back-ward chain of events considered often stops when someone or something appropriate to blame isfound’ [17].

In summary, the differences between the systems-theoretic accident model and the chain-of-events model is that the systems-theoretic accident model does not identify a root cause of anaccident. The entire safety control structure is examined, as well as the accident process to determinewhat role each part of the process played in the loss. In the chain-of-events model, the root cause ofan accident is identified. ‘Perhaps less satisfying in terms of assigning blame, the systems theoreticanalysis provides more information in terms of how to prevent future accidents’ [17].

6. CONCLUSION

Software disasters come in many shapes and sizes, with consequences ranging from breaks inservice, to loss of very expensive equipment and technology, to the loss of human life. It isimportant to remember these well-documented software disasters, learn from them, and not repeatthe mistakes. We need to use systems-based models to analyze the reasons for the failures, tobetter understand what happened, and to prevent problems from occurring.

REFERENCES

Introduction

1. Santayana G. The Life of Reason, 1905, vol. 1. Available at: http://www.quotationspage.com/quote/27300.html[January 2010].

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 11: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

SOFTWARE DISASTERS

Mars Polar Lander Sources [All Accessed January 2010]

2. Exploring Mars, Spacetoday.org. Available at: http://www.spacetoday.org/SolSys/Mars/MarsExploration/MarsSurveyor98.html.

3. Wilford JN. Probe to Mars Becomes Silent, Its Fate Unclear, 4 December 1999. Available at: http://partners.nytimes.com/library/national/science/120499sci-nasa-mars.html.

4. Wade M. Mars Polar Lander. Available at: http://www.astronautix.com/craft/marander.htm.5. Editors of Sky & Telescope, Mars Polar Lander Found at Last, 6 May 2005. Available at: http://skyandtelescope.

com/news/article 1509 1.asp.6. Clark G. Fatal Error: Buggy Software May Have Crashed Mars Polar Lander, 31 March 2000. Available at: http://

www.space.com/businesstechnology/technology/mpl software crash 000331.html.7. SPACE.com. Staff Writers, Scathing Reports Take NASA to Task Over Mars Missions, 28 March 2000. Available

at: http://www.space.com/scienceastronomy/solarsystem/nasa report synopsis 000328.html.

Patriot Missile Disaster Sources [All Accessed January 2010]

8. General Accounting Office, 1992. Available at: http://www.fas.org/spp/starwars/gao/im92026.htm.9. Simon A. The Patriot Missile. Performance in the Gulf War Reviewed, 1996. Available at: http://www.cdi.org/

issues/bmd/Patriot.html.10. Patriot Missile Air Defence System, USA. Available at: http://www.army-technology.com/projects/patriot/.11. Brain M. How Patriot Missiles Work, 28 March 2003. Available at: http://science.howstuffworks.com/

patriot-missile.htm.12. Marshall E. Fatal error: How patriot overlooked a scud. Science 1992; 255(5050):1347.

Therac-25 Medical Accelerator Sources [All Accessed January 2010]

13. Leveson T. An investigation of the Therac-25 accidents. IEEE Computer 1993; 26(7):18–41. Available at:http://courses.cs.vt.edu/cs3604/lib/Therac 25/Therac 1.html.

14. http://www.computingcases.org/case materials/therac/case history/Case%20History.html.15. A History of the Introduction and Shut Down of Therac-25. Available at: http://wired.com/news/technology/

bugs/0,2924,69355,00.html?tw=wn tophead 1.

STAMP [All Accessed January 2010]

16. Leveson. A new approach to hazard analysis for complex systems. Conference of the System Safety Society,Ottawa, 2003. Available at: http://sunnyday.mit.edu/papers.html.

17. Leveson. A systems-theoretic approach to safety in software-intensive systems. IEEE Transactions on Dependableand Secure Computing, vol. 1. Available at: http://sunnyday.mit.edu/accidents/external2.pdf.

AUTHOR’S BIOGRAPHY

Patricia A. McQuaid, PhD, is a Professor of Information Systems at California Poly-technic State University. She has a doctorate in Computer Science and Engineering, anMBA, an undergraduate degree in Accounting, and is a Certified Information SystemsAuditor (CISA). She has taught in both the Colleges of Business and Engineeringthroughout her career, and has worked in the industry in the banking and manufacturingindustries. Her research interests include software testing, software process improvement,software quality, and software project management.

Patricia is a member of IEEE, and a Senior Member of the American Society forQuality (ASQ). She is an Associate Editor for the Software Quality Professional journal,and also participates on ASQ’s Software Division Council. She has published bothinternationally and nationally. She was a contributing author to both Volumes I and IIof the Fundamental Concepts for the Software Quality Engineer (ASQ Quality Press)

books. She is a frequent speaker at both national and international conferences, often a keynote speaker.She is the co-founder and President of the American Software Testing Qualifications Board (ASTQB),

the American presence in the International Software Testing Qualifications Board (ISTQB). The ISTQB is

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr

Page 12: JWUK SMR 500 · mounted on a trailer and automatically controlled by the digital weapons control computer in the ECS, via a cable link. The radar system had a range of up to 100km,

P. A. MCQUAID

the global certification body of software testers. Currently, there are nearly fifty countries involved in thecertification consortium. She is a Certified Tester–Foundation Level (CTFL), through the ISTQB.

She has been the person in charge for the Americas for the Second and Third World Congresses for SoftwareQuality, held in Japan in 2000 and Germany in 2005, and was the Associate Director and Program Chair ofthe Fourth World Congress for Software Quality, held in Washington, D.C. in 2008. She will hold a similarposition for the Fifth World Congress for Software Quality (5WCSQ), to be held in Beijing, China in the Fallof 2011.

Copyright q 2010 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. (2010)DOI: 10.1002/smr