Fun Outside Work - Karthik Jayagovind - MIT Sloan Fall 2013 Entry
2007 MIT BAE Systems Fall Conference: October 30-31
-
Upload
softwarecentral -
Category
Documents
-
view
394 -
download
0
description
Transcript of 2007 MIT BAE Systems Fall Conference: October 30-31
2007 MIT BAE Systems Fall Conference: October 30-31
Software Reliability Methods and Experience
Dave DwyerUSA – E&[email protected]
2007 MIT BAE Systems Fall Conference: October 30-31 Page 2
Overview and outline
• Definitions• Similarities and differences: hardware and software reliability• Foundations of Musa’s models reviewed
– Trachtenberg (Trachtenberg, Martin. “The Linear Software Reliability Model and Uniform Testing,” IEEE Transactions on Reliability, 1985, pp 8-16)
– Downs (Downs, Thomas. “An Approach to the Modeling of Software Testing with Some Applications,” IEEE Transactions on Software Engineering, Vol. SE-11, No. 4, April 1985, pp 375-386)
• Instantaneous Failure Rate, a.k.a. failure intensity– Hardware - Duane, Codier– Software - analogous derivation
• Testing results• SW reliability calculator
2007 MIT BAE Systems Fall Conference: October 30-31 Page 3
SW reliability defined
• Software reliability defined:– The probability of failure-free operation for a specified time in a specified
environment for a specified purpose (“Software Engineering”, 5th edition, I. Somerville, Addison-Wesley, 1995)
– The probability of failure-free operation of a computer program for a specified time in a specified environment (“Software Reliability”, Musa, Iannino, Okumoto, McGraw-Hill, 1987)
– We will use MTBF or its reciprocal, λ
2007 MIT BAE Systems Fall Conference: October 30-31 Page 4
HW vs. SW reliability
• The hardware reliability discipline provided an impetus to provide for safety margins in the stresses, both mechanical and electrical
• But margins of safety don’t mean much in software because it doesn’t wear out
• Software has ‘x’ failures per million unique executions [if ‘y’ executions/hour, then ‘xy’ failures/million hours]
• Once a process has been successfully executed, that identical process is not going to fail in the future
2007 MIT BAE Systems Fall Conference: October 30-31 Page 5
Martin Trachtenberg (1985):
• Simulation testing showed that:– Testing the functions of the software system in a random or round-robin order
and fixing the failures gives linearly decaying system error rates
– Testing and fixing each function exhaustively one at a time gives flat system-error rates
– Testing and fixing different functions at widely different frequencies gives exponentially decaying system error rates [operational profile testing], and
– Testing strategies that result in linear decaying error rates tend to require the fewest tests to detect a given number of errors
– Testing to the operational profile gives the lowest time to reach an operational MTBF
2007 MIT BAE Systems Fall Conference: October 30-31 Page 6
Down’s ‘Pure’ approach reflected the nature of software (1985)
• The execution of a sequence of M paths
• The actual number of paths affected by a fault is treated as a random variable ‘c’
• Not all paths are equally likely to be executed
j = (N – j), where:
N = the total number of faults,
j = the number of corrected faults,
= -r log(1 – c/M),
r = the number of paths executed/unit time
2007 MIT BAE Systems Fall Conference: October 30-31 Page 7
Down’s execution path parameters
Start
1 2
3
M
x1
x2xN
2 paths affected by x1
1 path affected by x2 ‘N’ total faults initially
‘M’ total paths
‘c’ paths affected by an arbitrary fault
2007 MIT BAE Systems Fall Conference: October 30-31 Page 8
Our data analysis approach
• Cumulative 8-hour test shifts are recorded • Failures plotted:
– All– First instance
• The last data point will be put at the end of the test time• Only integration and system test data
2007 MIT BAE Systems Fall Conference: October 30-31 Page 9
Failure rate is proportional to failure number, Downs: j (N – j)r(c/M)
Given: N = total initial number of faults (0) = initial failure rate => 0 errors detected/corrected (start of testing)
j = cumulative failure rate after some number of faults is detected, ‘j’ j = the number of faults removed over time i = instantaneous failure rate (failure intensity) T = time
N j j = j/T 0
2007 MIT BAE Systems Fall Conference: October 30-31 Page 10
Failure rate plots against failure number for a range of non-uniform testing profiles, M1, M2 paths and N1, N2 initial faults in those paths
‘Concave’ or logarithmic plots
2007 MIT BAE Systems Fall Conference: October 30-31 Page 11
Instantaneous failure intensity derivation ~ Duane’s for hardware
cm
Tmk
TF
kTF
kT
TF
i
m
i
m
c
)1(
)1(
/
/
)(
m)(1
)(
)1/(
)1(
)(
)/()(
/
)(
)(
/
T
T
T
TjTjN
Tj
jNTj
jN
Tj
ji
ji
iji
i
j
Instantaneous for HW Instantaneous for SW
Same Approach
Similar Result
2007 MIT BAE Systems Fall Conference: October 30-31 Page 12
Background – test example
• Console operation and operating profile
• Necessity of distinguishing failure priorities:
– Priority 1: “Prevents mission essential capability”
– Priority 2: “Adversely affects mission essential capability with no alternative workaround”
– Priority 3: “Adversely affects mission essential capability with alternative workaround”
• Work shifts varied over test duration: 1-3/day
• Calculation of failure intensity
2007 MIT BAE Systems Fall Conference: October 30-31 Page 13
Corrective action for Priority 2 failures suspended while Priority 1 failures corrected
y = -179.88x + 288.61
y = -176.83x + 349.85
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
0 0.2 0.4 0.6 0.8 1 1.2
Failures/8 Hours
Su
m F
ailu
res
Series1
Series2
Series3
Linear (Series2)
Linear (Series3)
2007 MIT BAE Systems Fall Conference: October 30-31 Page 14
Codier, Duane 1964 RAMS HW reliability growth
• Ref. Appendix B, Notes on Plotting (Codier, Ernest O., “Reliability Growth in Real Life”, Proceedings, 1968 Annual Symposium on Reliability, New York, IEEE, January 1968, pp 458-469)
– 1. “The latter points, having more information content, must be given more weight than earlier points” (Trachtenberg, too)
– 2. The normal curve-fitting procedures of drawing the line through the “center of gravity” of all the points should not be used
– 3. Start the line on the last data point and seek the region of highest density of points to the left [right for Musa plots] of it”
2007 MIT BAE Systems Fall Conference: October 30-31 Page 15
How I draw a growth line through the points on a reliability growth plot?
• Is there one point that is most important?– Yes, the last point represents the cumulative MTBF to date; it has the most
degrees of freedom
• Should the trend line go through that point?– Yes, it has the best measure of cumulative MTBF
• Would an Excel trend line go through that point?– No, it’s just a least squares fit with all points weighing the same
• What is the least important point?– The first; it has the least degrees of freedom
2007 MIT BAE Systems Fall Conference: October 30-31 Page 16
Questions: Drawing a line through the points (cont.)
• If the line goes through the last point, what else should it go through?– The center of density of the other points (ref. back to Duane, Codier)
• What is the center of density?– The center of density is where the center of mass would be if “The latter
points …[are]… given more weight than earlier points”
2007 MIT BAE Systems Fall Conference: October 30-31 Page 17
Example - Priority 1 data plotted
y = -43.964x + 38.803
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
0 0.2 0.4 0.6
Failures/8 Hours
Su
m F
ailu
res
(n)
2007 MIT BAE Systems Fall Conference: October 30-31 Page 18
Point estimates vs. instantaneous
2007 MIT BAE Systems Fall Conference: October 30-31 Page 19
The formula for calculation of i correlates with interval estimates of failure intensity
From the previous graph j = -431c + 66
j c T= j/c
44.00 0.050 88041.84 0.055 76146.16 0.045 1026
i = (46.16 – 41.84)/(1,026 – 761)
= 4.32/265= 0.016
From the formula for instantaneous failure intensity:
i = c/(1 + T) = 1/431T = 880
i = 0.050/(1 + 880/431)
= 0.050/(1 + 2.04)= 0.050/3.04= 0.016
2007 MIT BAE Systems Fall Conference: October 30-31 Page 20
Most recent data plot
0
10
20
30
40
50
60
70
0 0.02 0.04 0.06 0.08 0.1
Failure rate, Lambda
Fai
lure
co
un
t -
firs
t in
stan
ce
2007 MIT BAE Systems Fall Conference: October 30-31 Page 21
A calculator has been developed for BAE Systems SW reliability practice 8349714
2007 MIT BAE Systems Fall Conference: October 30-31 Page 22
Priority 1 data graph
2007 MIT BAE Systems Fall Conference: October 30-31 Page 23
Questions?
• Anybody want a grad course in SW Reliability? I need 5 more students
• Rivier College can do that through teleconference(e-mail: [email protected])
• You will solve a real problem @ no charge to your department (except tuition)