2007 MIT BAE Systems Fall Conference: October 30-31

2007 MIT BAE Systems Fall Conference: October 30-31

Software Reliability Methods and Experience

Dave DwyerUSA – E&[email protected]


Overview and outline

• Definitions• Similarities and differences: hardware and software reliability• Foundations of Musa’s models reviewed

– Trachtenberg (Trachtenberg, Martin. “The Linear Software Reliability Model and Uniform Testing,” IEEE Transactions on Reliability, 1985, pp 8-16)

– Downs (Downs, Thomas. “An Approach to the Modeling of Software Testing with Some Applications,” IEEE Transactions on Software Engineering, Vol. SE-11, No. 4, April 1985, pp 375-386)

• Instantaneous Failure Rate, a.k.a. failure intensity– Hardware - Duane, Codier– Software - analogous derivation

• Testing results• SW reliability calculator


SW reliability defined

• Software reliability defined:– The probability of failure-free operation for a specified time in a specified

environment for a specified purpose (“Software Engineering”, 5th edition, I. Somerville, Addison-Wesley, 1995)

– The probability of failure-free operation of a computer program for a specified time in a specified environment (“Software Reliability”, Musa, Iannino, Okumoto, McGraw-Hill, 1987)

– We will use MTBF or its reciprocal, λ


HW vs. SW reliability

• The hardware reliability discipline provided an impetus to provide for safety margins in the stresses, both mechanical and electrical

• But margins of safety don’t mean much in software because it doesn’t wear out

• Software has ‘x’ failures per million unique executions [if ‘y’ executions/hour, then ‘xy’ failures/million hours]

• Once a process has been successfully executed, that identical process is not going to fail in the future


Martin Trachtenberg (1985):

• Simulation testing showed that:– Testing the functions of the software system in a random or round-robin order

and fixing the failures gives linearly decaying system error rates

– Testing and fixing each function exhaustively one at a time gives flat system-error rates

– Testing and fixing different functions at widely different frequencies gives exponentially decaying system error rates [operational profile testing], and

– Testing strategies that result in linear decaying error rates tend to require the fewest tests to detect a given number of errors

– Testing to the operational profile gives the lowest time to reach an operational MTBF


Down’s ‘Pure’ approach reflected the nature of software (1985)

• The execution of a sequence of M paths

• The actual number of paths affected by a fault is treated as a random variable ‘c’

• Not all paths are equally likely to be executed

j = (N – j), where:

N = the total number of faults,

j = the number of corrected faults,

= -r log(1 – c/M),

r = the number of paths executed/unit time


Down’s execution path parameters

Start

1 2

3

M

x1

x2xN

2 paths affected by x1

1 path affected by x2 ‘N’ total faults initially

‘M’ total paths

‘c’ paths affected by an arbitrary fault


Our data analysis approach

• Cumulative 8-hour test shifts are recorded • Failures plotted:

– All– First instance

• The last data point will be put at the end of the test time• Only integration and system test data


Failure rate is proportional to failure number, Downs: j (N – j)r(c/M)

Given: N = total initial number of faults (0) = initial failure rate => 0 errors detected/corrected (start of testing)

j = cumulative failure rate after some number of faults is detected, ‘j’ j = the number of faults removed over time i = instantaneous failure rate (failure intensity) T = time

N j j = j/T 0


Failure rate plots against failure number for a range of non-uniform testing profiles, M1, M2 paths and N1, N2 initial faults in those paths

‘Concave’ or logarithmic plots


Instantaneous failure intensity derivation ~ Duane’s for hardware

cm

Tmk

TF

kTF

kT

TF

i

m

i

m

c

)1(

)1(

/

/

)(

m)(1

)(

)1/(

)1(

)(

)/()(

/

)(

)(

/

T

T

T

TjTjN

Tj

jNTj

jN

Tj

ji

ji

iji

i

j

Instantaneous for HW Instantaneous for SW

Same Approach

Similar Result


Background – test example

• Console operation and operating profile

• Necessity of distinguishing failure priorities:

– Priority 1: “Prevents mission essential capability”

– Priority 2: “Adversely affects mission essential capability with no alternative workaround”

– Priority 3: “Adversely affects mission essential capability with alternative workaround”

• Work shifts varied over test duration: 1-3/day

• Calculation of failure intensity


Corrective action for Priority 2 failures suspended while Priority 1 failures corrected

y = -179.88x + 288.61

y = -176.83x + 349.85

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

0 0.2 0.4 0.6 0.8 1 1.2

Failures/8 Hours

Su

m F

ailu

res

Series1

Series2

Series3

Linear (Series2)

Linear (Series3)


Codier, Duane 1964 RAMS HW reliability growth

• Ref. Appendix B, Notes on Plotting (Codier, Ernest O., “Reliability Growth in Real Life”, Proceedings, 1968 Annual Symposium on Reliability, New York, IEEE, January 1968, pp 458-469)

– 1. “The latter points, having more information content, must be given more weight than earlier points” (Trachtenberg, too)

– 2. The normal curve-fitting procedures of drawing the line through the “center of gravity” of all the points should not be used

– 3. Start the line on the last data point and seek the region of highest density of points to the left [right for Musa plots] of it”


How I draw a growth line through the points on a reliability growth plot?

• Is there one point that is most important?– Yes, the last point represents the cumulative MTBF to date; it has the most

degrees of freedom

• Should the trend line go through that point?– Yes, it has the best measure of cumulative MTBF

• Would an Excel trend line go through that point?– No, it’s just a least squares fit with all points weighing the same

• What is the least important point?– The first; it has the least degrees of freedom


Questions: Drawing a line through the points (cont.)

• If the line goes through the last point, what else should it go through?– The center of density of the other points (ref. back to Duane, Codier)

• What is the center of density?– The center of density is where the center of mass would be if “The latter

points …[are]… given more weight than earlier points”


Example - Priority 1 data plotted

y = -43.964x + 38.803

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

0 0.2 0.4 0.6

Failures/8 Hours

Su

m F

ailu

res

(n)


Point estimates vs. instantaneous


The formula for calculation of i correlates with interval estimates of failure intensity

From the previous graph j = -431c + 66

j c T= j/c

44.00 0.050 88041.84 0.055 76146.16 0.045 1026

i = (46.16 – 41.84)/(1,026 – 761)

= 4.32/265= 0.016

From the formula for instantaneous failure intensity:

i = c/(1 + T) = 1/431T = 880

i = 0.050/(1 + 880/431)

= 0.050/(1 + 2.04)= 0.050/3.04= 0.016


Most recent data plot

0

10

20

30

40

50

60

70

0 0.02 0.04 0.06 0.08 0.1

Failure rate, Lambda

Fai

lure

co

un

t -

firs

t in

stan

ce


A calculator has been developed for BAE Systems SW reliability practice 8349714


Priority 1 data graph


Questions?

• Anybody want a grad course in SW Reliability? I need 5 more students

• Rivier College can do that through teleconference(e-mail: [email protected])

• You will solve a real problem @ no charge to your department (except tuition)

2007 MIT BAE Systems Fall Conference: October 30-31

Documents

Transcript of 2007 MIT BAE Systems Fall Conference: October 30-31