Software Reliability

40
Software Reliability 25 September 2006

description

Software Reliability. 25 September 2006. About the Evening Lectures. Viewing is required All lectures will be recorded and shown during a regular class period Working on getting them posted on the web so that you can download them at other times as well Sign in sheet at lecture - PowerPoint PPT Presentation

Transcript of Software Reliability

Page 1: Software Reliability

Software Reliability25 September 2006

Page 2: Software Reliability

About the Evening Lectures Viewing is required

All lectures will be recorded and shown during a regular class period

Working on getting them posted on the web so that you can download them at other times as well

Sign in sheet at lecture Assignment: two paragraph summary of what

you learned Dinner lottery

Page 3: Software Reliability

About the Midterm Use of Blackboard

http://help.unc.edu/?id=4735&trail=4781 Installing SecureExam (see Guidelines on

home page) Later this week, I will post a dummy exam

that you are all to take BEFORE the midterm to assure that everything is working properly

Page 4: Software Reliability

Simplified Model of a Computer

processor

instructions data

the information that it works

on

defines an

algorithm

retrieves the instruction directs data movement

Control Unit

Arithmetic Logic Unit

MEMORY

Performs the operations

Page 5: Software Reliability

Points to Remember Computers access information by location and

doesn’t know the value Computers store numbers in fixed size

packets, which means that they can not grow indefinitely

Computers do not distinguish between different types of data (e.g., instructions or text or numbers)

Page 6: Software Reliability

Review: Computerized Systems Finance: banking; stock market; commerce

Medical: diagnostics; life support; medical devices

Communications: television; radio; news; networks Transportation: traffic signals; air traffic control; air craft; space craft;

trains; cars

Military: weapons systems; intelligence gathering

Energy: power plants; toxic chemical plants; oil & gas Water: sewer

Buildings: HVAC; security; lights

Personal & household items

Page 7: Software Reliability

What is a Bug?

Page 8: Software Reliability

Bug Problems in code that cause it to behave in an

unintended, unanticipated or unpredictable manner

Origin Grace Hopper (1947): moth in a relay

"First actual case of bug being found."

Thomas Edison used the term in 1878 "Bugs"—as such little faults and difficulties are called—

1906-1992

Page 9: Software Reliability

First Computer Bug

Page 10: Software Reliability

Why are bugs hard to find? The error can appear in another program

Device drivers, memory management The error may only occur occasionally

May require multiple conditions to occur

Page 11: Software Reliability

Classes of Problems Poorly designed software Poorly understood requirements Poorly designed user interfaces Improper use Data entry problems Simple coding errors

Page 12: Software Reliability

80% of software projects fail

50% challenged 2x budget 2x completion time 2/3 planned function

30% impaired Scrapped

Standish Group, 1995

Page 13: Software Reliability

Sources of Risk1. Top management commitment

2. User commitment

3. Misunderstood requirements

4. Inadequate user involvement

5. Mismanaged user expectations

6. Scope creep

7. Lack of knowledge or skill Keil et al, “A Framework for Identifying Software Project

Risks,” CACM 41:11, November 1998.

Page 14: Software Reliability

Can’t We Test Out the Problems? In order to establish that the probability of failure of

software is less than 10-9 in 10 hours, testing required with one computer is greater than 1 million years

Butler and Finelli, “The Infeasibility of Experimental Quantification of Life-Critical Software Reliability”

NIST estimates cost to US economy from inadequate software testing > $59 billion/yr.

NIST Planning Report 02-3

Page 15: Software Reliability

Simple Problems Tampa couple was billed $4,062,599.57 for a

month’s electricity Correct bill was $146.76 Input error – clearly not good enough check for

reasonable values High School freshman banned from football because

of drug use in middle school Actual offense was chewing gum and being tardy Different codes not properly translated - systems are only

as good as their weakest links

Page 16: Software Reliability

User Interface Bug Usability Issue Afghanistan War (December 2001)

Friendly fire kills 3 injures 20 when satellite-guided bomb landed on a battalion command post

Use of GPS Receiver to determine coordinators Change battery What should come up?

www.washingtonpost.com/ac2/wp-dyn/A8853-2002Mar23

Page 17: Software Reliability

Denver Airport Baggage System (1995) 4 years in development at cost of $193M The promise

delivered in < 10 minutes to any part of airport! Massively complex system

4000 cars 21 miles of track scanners photocells 300 computers

What happened: misrouted and crashed, baggage lost and damaged Delayed opening cost $1.1M/day When airport opened a year late only one airline used the system

www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html

Page 18: Software Reliability

Denver Airport Baggage System (1995) 4 years in development at cost of $193M Massively complex system

4000 cars, 21 miles of track, scanners, photocells, 300 computers

Cars misrouted and crashed, baggage lost and damaged Delayed opening cost $1.1M/day When airport opened a year late only one airline used it

www.cis.gsu.edu/~mmoore/CIS3300/handouts/SciAmSept1994.html

Page 19: Software Reliability

Denver Airport System Examples of bugs:

Photocell could not detect bags on the belt and therefore didn’t stop system

System had lost track of state of carts during jams Timing between conveyor belts and carts not

properly synchronized Overall

Not just software glitches very complex, poorly engineered system

Page 20: Software Reliability

Ariane 5 (1996)

Integer overflowSoftware error

Page 21: Software Reliability

External view

Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded

Page 22: Software Reliability

External view

Page 23: Software Reliability

Cost

Development cost $7 Billion Delay of more than one year

One set of four identical, uninsured scientific satellites

+ One rocket

$500,000,000

Page 24: Software Reliability

What Happened? Overflow: tried to put too big a number into

too small a space Even worse – the feature that caused the

problem wasn’t needed! It was only needed to set up the launch!

archive.eiffel.com/doc/manuals/technology/contract/ariane/page.html

Page 25: Software Reliability

Bank of New YorkNovember 20, 1985

BoNY: Nation’s largest clearer of Govt securities.

Software to track Federal securities transactions wrote new information on top of old.

Feds debited the bank for each transaction but bank did not know who owed it how much.

90 minutes => $32 Billion overdraft!

Page 26: Software Reliability

Cost of Bug Bank had to borrow $24 billion from federal

reserves. Interest paid ~$5 million for 1 day. (Annual earnings of bank ~120 million)

BoNY share prices dropped by 25¢ Federal funds rate dropped from 8.4% to

5.5% System down for 28 hours. Fear of financial crisis caused increase in

price of platinum!

Page 27: Software Reliability

Cause of bug Message buffer counter at BoNY system was

16-bit long. Counters at Fed (and other banks) 32 bit. More than 32,000 transactions that morning!

=>Counter overflow Securities database corrupted.

Page 28: Software Reliability

The Drama continues… Trying to correct it – they copied corrupted

data over the backup. Lost a few hours because of this.

Reference: Wiener, Digital Woes, 1993

Page 29: Software Reliability

Therac-25 Landmark case of how things can go terribly wrong Medical linear accelerator: radiation therapy for

cancer patients Used to zap tumors with high energy beams

Electron beams for shallow tissue X-ray photons for deeper tissue

Eleven Therac-25s were installed: Six in Canada Five in the United States

Developed by Atomic Energy of Canada Limited (AECL).

Page 30: Software Reliability

Therac-25 Improvements over Therac-20:

Uses new “double pass” technique to accelerate electrons.

Machine itself takes up less space. Other differences from the Therac-20:

Software now coupled to the rest of the system and responsible for safety checks. Hardware safety interlocks removed.

“Easier to use.”

Page 31: Software Reliability

Therac-25 Turntable

Counterweight

Field Light Mirror

Beam Flattener (X-ray Mode)

Scan Magnet (Electron Mode)

Turntable

Page 32: Software Reliability

1985-1987: Six known accidents Jun 1985: Patient at Mareitta GA received

overdose July 1985: Hamilton, Ontario: patient

severely burned, died that November. December 1985: Patient in Yakima, WA

overdose

Page 33: Software Reliability

Vernon Kidd Early March 1986, Tyler, Tx:

receives dose > 100 times too high Complained he felt burned…..

Engineer: It’s not possible for Therac-25 to give an overdose.

Engineering firm: Machine does not appear capable of giving a patient an electrical shock...

Died 5 months later

Put back in use late March

Page 34: Software Reliability

What Went Wrong? User Interface

Operator entered code for high energy rather than low energy

“Malfunction message” Operator entered “Proceed” because system was

known to give quirky errors Result

Turntable was in the wrong position

Page 35: Software Reliability

3 Weeks Later: Ray Cox Second accident in Tyler, Tx

Same operator

Patient died 1 month later

This time they were able to reproduce

Page 36: Software Reliability

What would cause that to happen? Race conditions.

Several different race condition bugs. Overflow error.

The turntable position was not checked every 256th time the “Class3” variable is incremented.

No hardware safety interlocks. Wrong information on the console. Non-descriptive error messages.

“Malfunction 54” “H-tilt”

User-override-able error modes.

Page 37: Software Reliability

Source of the Bug Incompetent engineering. Safety analysis excluded the software! No usability testing.

Page 38: Software Reliability

Sources Leveson, N., Turner, C. S., An Investigation of the Therac-25

Accidents. IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41. http://courses.cs.vt.edu/~cs3604/lib/Therac_25/Therac_1.html Information for this article was largely obtained from primary sources

including official FDA documents and internal memos, lawsuit depositions, letters, and various other sources that are not publicly available.

Nancy Leveson Clark S. Turner

The authors:

Page 39: Software Reliability

Lots more stories Links will be added to references section of

web http://www5.in.tum.de/~huckle/bugse.html http://www.baddesigns.com/

Page 40: Software Reliability

Final Discussion Should Microsoft be held responsible for the

business problems and viruses caused by security holes in their software?