Dependability of Computing Systems
Transcript of Dependability of Computing Systems
Dependability : ability to deliver service that can justifiably be trusted
Service delivered by a system: its behavior as it is perceived by its user(s)
User: another system that interacts with the former
Function of a system: what the system is intended to do
(Functional) Specification: description of the system function
Correct service: when the delivered service implements the systemfunction
System failure: event that occurs when the delivered service deviates fromcorrect service, either because the system does not comply with thespecification, or because the specification did not adequately describe itsfunction
Failure modes: the ways in which a system can fail, ranked according tofailure severities
Dependability: ability to avoid failures that are more frequent or moresevere than is acceptable to the user(s)
When failures are more frequent or more severe than acceptable:dependability failure
Absenceof catastrophic
consequences onthe user(s) and the environment
Continuityof service
Readinessfor usage
Absence of unauthorized disclosure of information
Absenceof improper
systemalterations
Ability toundergo
repairs andevolutions
SafetyReliability ConfidentialityAvailability Integrity Maintainability
SecurityAbsence of unauthorized access to, or handling of, system state
Dependability
Authorized actions
SafetyReliability ConfidentialityAvailability Integrity Maintainability
FaultPrevention
FaultTolerance
FaultRemoval
FaultForecasting
Faults Errors FailuresActivation Propagation Causation
FaultsFailures …… Causation
Fault Error Failure
Deviation of thedelivered
service fromcorrect service,
i.e.,implementing
the systemfunction
Part of systemstate that may
cause asubsequent
failure
Adjudged orhypothesized cause
of an error
Failure… …Fault
System does notcomply withspecification
Specification does notadequately describe
function
Internal Fault,
dormant
Error ErrorError
activation
propagation prop.
ServiceInterface
IncorrectService: Outage
CorrectService
Failure
Correct ServiceIncorrectService: Outage
Failure
Occurrence
external fault
Systemboundary
Means for Dependability
Preventingoccurrence orintroduction of
faults
FaultPrevention
Delivering correctservice in spite of
faults
Fault Tolerance
Reducingthe
presenceof faults
FaultRemoval
Estimating thepresent number, thefuture incidence and
the likelyconsequences of
faults
FaultForecasing
Attributes
AvailabilityReliabilitySafetyConfidentialityIntegrityMaintainability
Dependability Means
Fault PreventionFault ToleranceFault RemovalFault Forecasting
ThreatsFaultsErrorsFailures
Security
Faultforecating
Ordinal or qualitative evaluation
Probabilistic orquantitative evaluation
Modeling
Operational testing
Faulttolerance
Error detection
Systemrecovery
Error handling
Fault handling
Development
Static verification
Dynamic verificationVerification
Diagnosis
Correction
Non-regression verificationFaultremoval
Operational life Corrective or preventive maintenance
Means
Faultprevention
Attributes
Availability/Reliability
Safety
Confidentiality
Integrity
Maintainability
Threats
Faults
Errors
Failures
Development
Physical
Interaction
Dependability
Faults
Phy
sica
l
Dev
elop
men
t
Inte
ract
ion
Ava
ilabi
lity/
Rel
iabi
lity
Saf
ety
Con
fiden
tialit
y
Loca
lized
Dis
trib
uted
Failures
✔ ✔✔June 1980: False alerts at the North American Air Defense(NORAD)
✔ ✔✔April 1981: First launch of the Space Shuttle postponed
✔ ✔✔June 1985 - January 1987: Excessive radiotherapy doses(Therac-25)
✔ ✔ ✔✔August 1986 - 1987: the "wily hacker" penetrates severaltens of sensitive computing facilities
✔ ✔ ✔✔November 1988: Internet worm
✔ ✔✔15 January 1990: 9 hours outage of the long-distance phonein the USA
✔ ✔ ✔✔✔February 1991: Scud missed by a Patriot (Dhahran, Gulf War)
✔ ✔ ✔✔✔November 1992: Crash of the communication system ofthe London ambulance service
✔✔ ✔✔26 and 27 June 1993: Authorization denial of credit cardoperations in France
4 June 1996: Failure of Ariane 5 maiden flight ✔ ✔ ✔
13 April 1998: Crash of the AT&T data network ✔ ✔ ✔✔
February 2000: Distributed denials of service on large Web sites ✔ ✔ ✔✔
May 2000: Virus I love you ✔ ✔ ✔✔
July 2001: Worm Code Red ✔ ✔ ✔✔
August 2003: Propagation of the electricity blackout in theUSA and Canada
✔ ✔ ✔✔
October 2006: 83000 email addresses, credit card numbersand online transaction files stolen in UK
✔✔✔ ✔
Average outage costs❉
❉ Maintenance costsSpace shuttle on-board software: 100 M $ / an
Yearly cost of failures❉
Accidental faults
Malicious faults
UK
1,25 G£
France (private sector)
0,9 G€ 1 G€
USA
4 G$
❉ Cost of software project cancellation (failure of the development process)
FAA AASEstimate 1983
1 G $
Estimate 1988(contract passed)
4 G $
Estimate 1994
7 G $
Timing shift(estimate 1994)
6 - 8 yrs
USA [Standish Group, 2002,13522 projects]
Success
34%
Challenged
51%
Cancelled
15%
loss ~ 38 G$ (out of total 225 G$)
Estimates of insurance companies
Global estimate USA : 80 G$ UE : 60 G€
1
1,1
1,2
1,4
1,6
2,8
Industrialsectors
Banking
Retail
Insurance
Financial institutions
Manufacturing Millionsof Eurosrevenueper hourlost
Energy
System
Entity interacting withother entities, thus other
systems(« black box »)
Set of componentsinterconnected in order to
interact(« white box »)
Other systems:environment of
considered system
Component: another system
End of decomposition:atomic system
Function: what the system is intended for
Behavior: what the system does for implementing its function
Delivered service: system behavior as perceived by other system(s)interacting with it: it(s) user(s) (physical or human)
Structure: what enables the system to generat eits behavior
Each level of the tree:more or less detailed list ofthe system components
System
Components=
systems
Components =
Atomic systems
Error of a programmer
FaultImpaired instructions or data
ActivationFaulty component and inputs
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
Short-circuit in integrated circuit
Failure
FaultStuck-at connection, modification of circuit function
ActivationFaulty component and inputs
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
Operator errorInappropriatehuman-system interaction
Fault
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
Electromagnetic perturbation
Fault
Error
ActivationFaulty component and inputs
FaultImpaired memory data
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
Serviceinterface
Correct service
IncorrectserviceFailure
Dormantinternal
faultError ErrorError
activation propagation prop.
Error ErrorErrorpropagation prop.
Operator
error
Hardwarefailure
Programmererror
Electromagneticperturbation
Failures
Détectability
Signalled failures [delivered service is detected as incorrect, and signalled as such]
Unsignalled failures[incorrect service deivery is not detected]
Consistency
Consistent failures[incorrect service identically perceived by all users]
Inconsistent, or Byzantine, failures[some, or all, users perceive differently incorrect service]
Consequences
Minor failures[harmful consequences are of similar cost to the benefits provided by correctservice delivery ]
Catastrophic failures[cost of harmful consequences is orders of magnitude, or even incommensurablyhigher than the benefits provided by correct service delivery ]
•••
Domain
Value failures[value of delivered service deviates from implementing system function]
Timing failures[timing of service delivery (instant or duration) deviates from implementingsystem function]
Failuredomain
Value(correct timing)
Timing(correct value)
Value andtiming
Value failures
Service deliveredtoo early
Early timing failures
Service deliveredtoo late
Late timing failures
Suspendedservice
Halt failures
Erraticservice
Erratic failures
Non signalling of incorrect service: unsignalled failure
Signalling incorrect service in absence of failure: false alarm
• halt failures: fail-halt system
• minor failures: fail-safe system
Failure of detecting mechanisms
System whose all failure are, to an acceptable extent
Faults
Phase of creationor occurrence
Development faults[occur during (a) system development, (b) maintenance during the use phase,and (c) generation of procedures to operate or to maintain the system]
Operational faults[occur during service delivery of the use phase]
Systemboundaries
Internal faults[originate inside the system boundary]
External faults[originate outside the system boundary and propagate errors into the system by interaction or interference]
Phenomenologicalcause
Natural faults[caused by natural phenomena without human participation]
Human-made faults[result from human actions]
Dimension
Hardware faults[originate in, or affect, hardware]
Software faults[affect software, i.e., programs or data]
Intent
Malicious faults[introduced by a human with the malicious objective of causing harmto the system]
Non-malicious faults[introduced without a malicious objective]
Deliberate faults[result of a decision] Capability
Accidental faults[introduced inadvertently]
Incompetence faults[result from lack of professional competence by the authorized human(s),or from inadequacy of the development organization]
Persistence
Permanent faults[presence is assumed to be continuous in time]
Transient faults[presence is bounded in time]
Development faults Physical faults Interaction faults
SoftwareFlaws
LogicBombs
Hardw Errata
Produc Defects
PhysDeter
PhysicalInterference
IntrusionAttempts
VW
InputMistakes
Faults
Persistence Per Per Per Per Per Per Per Per Per Per Tra Per Tra Tra Tra Per Tra Per Tra PerTra Tra Tra Per Tra
Intent Nonmal
Nonmal
Mal Mal Nonmal
Nonmal
Nonmal
Nonmalicious
Nonmalicious
Malicious Mal
Dimension Software Hardware Hardware Hardware Software
System boundaries Internal Internal External
Phase of creation oroccurrence
Development Operational
Capability Acc Del Inc Acc Inc Acc Acc Acc Acc Del Inc Del Del Acc Del IncDél Del Del
Phenomenologicalcause
Human-made
Human-made
Nat Nat NatHuman-made
Human-made
Human-made Faults
Non-malicious MaliciousIntent
Accidental(Mistake)
Deliberate(Bad decision)
DeliberateIncompetenceCapability
Interaction(operators,
maintainers)&
Development(designers)
Malicious logicfaults:
logic bombs,Trojan horses,
trapdoors,viruses, worms,
zombies
Intrusionattempts
Individuals&
organizations
Decision byindependentprofessional
judgement by boardof enquiry or legal
proceedings in courtof law
Failure pathology
Activation of dormant fault by computation process
External fault
Active fault produces error(s)
Error is latent or detected
Error can disappear before detection
Error propagation Other errors
Component failure
Fault as viewed by• Components interacting with failed component• System containing failed component
Fault Error Failure FaultActivation
oroccurrence
Propagation Causation(interaction,composition)
Fault Error Failure FaultActivation
oroccurrence
Propagation Causation(interaction,composition)
Erroraffectsservice
ActivationReproducibility
Solid faults
Elusive faults
Elusive permanent faultsand
Transient faults
Intermittent faults
Interactionfaults
Prior presence ofa vulnerability:
Internal fault thatenables an
external fault toharm the system
Facility forstoppingcausal
recursion
Contextdependent
Transient operational physical faults:proportion increases with integration
Reduced geometric dimensions
Lower and lower energy
Environmental perturbations become faults more easily
Directly:α particles,outer space
particles
Indirectly
Limitation of electrical fields
Lower supply voltage
Redfuced noise immunity
Cosmic rays
Low energy deflected
Primariesdisappear~ 25 km
~ 1600/m2-s cascade to sea level
~ 100/cm2-s at 12000 m
~ 1/cm2-s sea-level flux
µ-
γ π0
π+π-
e+
e-
e+e-
γ
γ
γ
γ
n
pp
np
np
np
p
n
n
Andrew network of CMU, 13 SUN stations, 21 station.yrs
689298System failures
6552
183
29
1056
Permanent faults
Intermittent faults
Mean time tomanifestation (h)
Number ofmanifestations
Tandem experimental data
Examination ofanomaly log files (several tens of systems, 6months): 132 recorded software faults
• 1 solid fault (« bohrbug »)• 131 elusive faults (« heisenbugs »)
Permanent faults and productionyield for Intel microprocessors Intel
1 FIT = 10-9/hDPM : defects per million
Mean time to system crash, dueto hardware failure, for high end
IBM servers
1985 1990 1995 20000
10
20
30
ECL-TCM
CMOS308X/3090
9020
G1-G5
MT
TF
yrs
)
1
10
100
1000
10000
1972
1976
1980
1984
1988
1992
1996
2000
FITDPM
• Complexity
• Economic pressure
From J. Gray, ‘Dependability in the Internet era’
9%
99%
99.9%
99.99%
99.999%
99.9999%
1950 1960 1970 1980 1990
Computing systems
Phone systems
Cellularphone
Internet
Ava
ilabi
lity
2000
36j 12h0,9
3j 16h0,99
8h 46mn0,999
52mn 34s0,9999
5mn 15s0,99999
32s0,999999
Outage duration/yrAvailability
AvailabilityMTTR
98.598 hours
99.011 hour
99.8310 mins
99.981 minAvailabilityfor 100h
MTBF
Website uptime statistics (Netcraft)
Upt
ime
(hou
rs)
Top 50 most requested sites (July 2006) Top 50 longest running sites (July 2006)
Upt
ime
(hou
rs)
Avg
Upt
ime
(hou
rs)
Evolution over time
0
200
400
600
800
1000
1200
1400
1600
1 5 9 13 17 21 25 29 33 37 41 45 49
AverageMax
600
800
1000
1200
1400
1600
1 5 9 13 17 21 25 29 33 37 41 45 49
AverageMax
0200400600800
100012001400160018002000
Top requ, avg of avg
Top requ, avg of max
Max longest running
Dec2003
Nov2004
July2005
March2006
July2006
System configuration
34%
Hardware22%
Coresoftware
13%
Drivers27%
Anti-virus4%
Failure causeWindows 2000
Hardware15%
Software34%
Operatorerrors51%
Failure sources of Web sites(3 sites, 6 months,
500-5000 servers/site)
Fault removal: reducing the presence of faults
Diagnosis
Correction
Non-regression verification
Verification Checking whether the systemsatisfies verification conditions
general specific
Verification
DynamicStatic
Static analysis
System
Effective execution
Reviews andinspections
Staticanalysis
Theoremproving
Systembehaviormodel
Modelchecking
Symbolicinputs
Systemmodel
Symbolicexecution
Valued inputs
System
Test
Specification
Design
Implementation
❉ Verification of specification and of system conformance to specification: validation
Reviews and inspections ↔ Testing
Schneider Electric data: nuclear control software
0
50
100
150
200
250
300
Componentstatic
analysis
Unittests
Integration tests
Qualification tests
Number of uncovered faults
Verification effortin man.day
HP data Uncovered faults/hour
Operational testing 0,21
Functional testing 0,28
Structural testing 0,32
Inspections 1,06
Space shutlle on-board software (500 000 lines de code)
Versions
0
5
10
15
20
25
30
5/83 12/85 6/91
Product (operation)
Process (testing)
Early detection (pre-build)
Fau
lts /
KS
LO
C
Corrective20%
[D,I,V]Adaptative
25%[S,D,I,V]
Perfective55%
[R,S,D,I,V]
Maintenance67%
Verification15%
Implementation7%
Design5%
Specification3%
Requirements3%
Verification45%
Implementation22%
Design17%
Specification 8%
Requirements 7%
Cost of software verification
Average figures: for critical systems, ≥ 70%
Verification46%
Relative cost of fault removal
1976(Boehm)
1995(Baziuk)
Design 5 - 7,5
Implementation 8 - 15
Requirements & specification 1 1
Integration 10 - 25 90 - 440
Qualification 25 - 75 440
Operation/Maintenance 470 - 88050 - 500
Cost of fault removal all the more high as it is late
IBM data
Fau
lt pe
rcen
tage
Implem Unittesting
Integr. testing
Qualif. testing
Operat.life
Faultsintroducedby phase
Faultscorrectedby phase
Cost offault
correctionby phase
Functional specs review 1%
Detailed specs review 2%
Inspection of component logic 2%
Inspection of component code 3%
Unit testing 4%
Functional testing 6%
System testing 10%
Percentage of erroneous corrections wrt number ofremoved faults
Incorrect or poorly expressed requirements
Incorrect or poorly expressed specification
Design-implementation faults affectingseveral components
Design-implementation faults affectingone components
Typographical fault
Regression fault
Others
0 10 20 30 40
Percentage of uncovered faults
Typical distribution of software faults uncovered during development
Persistent software faults classified by manifestation frequency
(200 faults examined, 100 per project)
1. Omitted logic (existing code too simple)
2. Non re-initialization of data
3. Regression fault
4. Documentation fault (correct software )
5. Inadequate specification
6. Faulty binary correction
7. Erroneous comment
8. IF instruction too simple
9. Faulty data reference
10. Faute d'alignement de donnée (bits de gauche
par rapport aux bits de droite, etc.)
11. Timing fault causing data loss
12. Non-initialization of data
13. Other categories of lesser importance (total less than or
equal to 4)
Category: Project B(100 k ins.)
24
6
12
6
1
11
11
2
4
3
3
1
Total
60
23
17
16
11
11
11
11
10
7
6
5
36
17
5
10
10
0
0
9
6
4
3
4
Project A(500 K ins.)
Fault density
Created faults: 10 à 300 / KLOC
Residual faults: 0.01 à 10 / KLOC
Example : increments of AT&T ESS-5
Size of Density of faults Density of faults Learning Type ofincrement uncovered during uncovered in factor software
development the field (y/x)(x) (y)
42069 28,5 7,2 0,25 Preventive maintenance5422 67,3 21,0 0,31 Billing9313 79,3 27,7 0,35 On-line upgrading14467 26,5 7,2 0,27 System growth
165042 101,6 5,3 0,05 Hardware fault recovery16504 84,1 2 0,02 Software fault recovery38737 149,4 5,8 0,04 System integrity
0
20
40
60
80
100
120
140
Pentium[march 95]
PentiumPro
[nov 95]
PentiumII
[may 97]
MobileP. II
[aug 98]
Celeron[apr 98]
Xeon[may 98]
Total
Present jan 99
Design faults (‘errata’) of Intel processors
Des
ign
faul
ts
Processors[date of first update]
FAULT FORECASTING
Identification, analysis ofconsequences, and classification
of failures
Probabilistic evaluation of the extentto which some dependability
attributes are satisfied
Ordinal or
qualitativeevaluation
Probabilistic or
quantitative evaluation
Estimation of presence, creation and consequences of faults
DependabilityAttributes
Dependabilitymeasures
Operational testing
Evaluation testaccording to
operational inputprofile
Behavior model of system /failure, maintenance actions,
solicitations
Modelling
ORDINAL EVALUATION
FME(C)A
RELIABILITY DIAGRAMS
FAULT TREES
STATE DIAGRAMS
MARKOV CHAINS
PETRI NETS
PROBABILISTIC EVALUATION
a
b
c
Time
Failure intensity
evolving reliability
non-stationary processesstationary processes
times to failure are stochastically identical
stable reliability
ability to deliver correct service is preserved
reliability growth
ability to deliver correct service is improved
times to failure are stochastically increasing
reliability decrease
ability to deliver correct service is degraded
times to failure are stochastically decreasing
Before version X
Version X
Version X+1
Version X+2
Version X+3
Version X+4
IBM Data18
16
14
12
10
8
6
4
2
0
Ave
rage
num
ber
of u
niqu
e fa
ults
/ 3m
onth
s
Relative month where failure first reported
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Version X+1 X+2 X+3 X+4 X+5
0
.1
.2
.3
.4
pack
s pe
r 1k
line
s pe
r m
onth
(/5
0)
Yr 3Yr 2Yr 1
Hardware replacement rate trend
Field data
AT&T Data
Operational testing
Required effort
• Increasing function of the number of observed failures
• Zero-failure: Discrete time Continuous time
N ≥ ln (α)
ln (1-p0)
executions
T ≥ - λ0
hours
≈ln (α)
N
jC p0j
1− p0( )N− j
j =0
f
∑ ≥ α
P{ p ≤ p0 } ≥ 1-αNumber of executions for
p : failure probability per executionp0 : reliability objective1-α : confidence level
Zero failureDISCRETE TIME
Number of program executions
CONTINUOUS TIME
Program execution duration
Risk α
10-1 10-2 10-3 10-4
10-1 23 46 69 92
10-2 230 461 691 921
10-3 2303 4605 6908 9210
10-4 23026 46052 69078 92103
10-5 230259 460517 690776 921034
10-6 2302585 4605170 6907755 9210340
p0
Risk α
10-1 10-2 10-3 10-4Timeunit
10-1 1 2 3 4 Days
10-2 10 20 1 1.3 Months
10-3 3.2 6.4 9.6 1
10-4 2.6 5.3 7.9 10.5 Years
10-5 26.2 52.3 78.9 105.1
10-6 262.8 525.7 788.6 1051.4
λ0
f failures, p0 = 10-3, α = 10-2f123456789
N66368403
10042116011310514567159961739818779
Study of the FAA Airworthiness Directives (AD)
1 Jan. 1980 - 21 Sept. 1994
Confirmed avionics ADs : 33Hardware : 20 Software : 13
Equipments Rockwell/CollinsBendix/KingHoneywell/SperryTracor Aerospace
Estimation of software reliability
10-610-710-810-910-10
DME (88)
DME (94)
TCAS II (91)
TCAS (94)
Omega
ATC Transp.
AverageNo software problem reported
Failure rate (h-1)
Deliver service implementing system function inspite of faults
Error detection: identification of error presence
FAULT TOLERANCE
Error handling: error removal from system state, if possiblebefore failure occurrence
Fault handling : avoiding fault(s) to be activatedagain
System recovery: transformation of erroneous state in a statefree from detected error and from fault that can be activatedagain
Concurrent detection, during service delivery
Preemptive detection: service delivery suspended, search forlatent errors and dormant faults
Rollback: brings the system back into a state saved prior to erroroccurrence
Saved state: recovery point
Rollforward: new state (free from detected error) found
Compensation: erroneous state contains enough redundancy forenabling error masking
Addition of error detection mechanisms in component
Self-checking component
Error detection
Error handling
Diagnostis: identifies and records the error cause(s), in terms oflocalisation and category
Isolation: performs physical or logical exclusion of the faultycomponent(s) from further contribution to service delivery, i.e.,makes the fault(s) dormant
Reconfiguration: either switches in spare components or reassignstasks among non-failed components
Reinitialization: checks, updates and records the new configuration,and updates system tables and records
Fault handling
Intermittent faults
Isolation and reconfiguration not necessary
Error handling Non recurrenceof error
Fault diagnosis Absenceof fault
Intermittentfault
Identification
Error detection and systemrecovery
orDetection - recovery
Fault masking andsystem recovery
orMasking
Error detection
Error handling
Rollback
Rollforward
Compensation
Fault handling
Systematic application even in error absence
2
1
2
1
3
2
1
3
2
1
3 3
coverage factor influent when 1-c ~
Effectiveness of error processing: coverage factor
c = P{ service is delivered failed component }
Component failure is covered if system error successfully processed
Duplex system, active redundancyComponent failure rate: λ Repair rate: µ -c = 1 - c
Reliability
R(t) ≈ exp{ -2 ( -c + λ/µ ) λ t }
MTTF ≈ 1
2 ( -c + λµ
) λ
Unavailability-A = 1 - A ≈ 2 ( -c + λ/µ ) ( λ / µ )-ANR = λ / µ
Equivalent failure rate λeq = 2 ( -c + λµ
) λ
2 -c λ non-covered failure2 ( λ / µ ) λ two successive failures, first failure covered
⇒ λ/µ
µ
λµ
λ
λ
eq
10-1
10-2
10-3
10-4
1
10-4 10-3 10-2
c = .95
c = .9
c = .99
c = .995
c = .999 c = 1.2
.4
.6
.8
1
0
10 -2 10 -1 10 10 2 10 31λ t
R(t)
c = .95
c = .99
c = .995
c = .999
c = 1
λ
µ = 10 -3
10-4 10 -3 10-2
10 -23,6 j / yr
10 -38,8 h / yr
10 -453 min / yr
10 -55 min / yr
10 -630 sec / yr
10 -73 sec / yr
c = .9
c = .95
c = .99
c = .995
c = .999 c = 1
NON REDUNDANT
λ
-A
TandemDynabus
Dynabus interface
CPU
Memory
I/O channels
Disk control.
Disk control.
Disk control.
Tape control.
VDU control.
Processormodule
Processormodule
Processormodule
Experimental data
Systems 9000Processors 25500Disks 74000MTBF System 21 years
Environment
Total
Maintenance
Operations
Software
Hardware
50
100
150
200
250
300
350
400
450
0
MTBF(yrs)
1 2 3 4 5 6 7 80
50
100
150
200
71
183
53
21
1 1 1
Num
ber
of r
epor
ted
failu
res
Length of fault chains
Length of fault chains having
led to failure
Hierarchy of faulttolerant protocols
according to relation"uses service”
Group membership
Atomic multicast
Clock synchronization
Common mechanism:consensus
Distributed system: set of processors (processing and storage)communicating by messages via communication network
Pro
cess
or
Processor Processor Processor
Processor
Processor Processor Processor
Communication network
Implies self-checking processors
Consequences:
1) Only correct value messages are sent
➠ minimum number of processors for consensus in presence of t faults: n ≥ t+2
2) Error detection: halting detection by interrogating and waiting
3) Resource replication for tolerating t faults ("halts"): ≥ t+1
4) Communication network saturation by “babbling” is impossible
5) Network architecture:
Fail-silent processors
Induced complications:
1) Possibility of non consistent (“byzantine”) behavior
➠ minimum number of processors for consensus in presence of t faults: n ≥ 3.t+1
➠ t+1 exchanges of messages
2) Error detection ≠ halt detection
3) Resource replication for tolerating t faults ≥ 2.t+1
4) Communication network saturation by “babbling” possible
5) Possibility of "lying" on source address
6) Network architecture:
Arbitrary failure processors
(1) 2 3 9
1 (2) 3 4
1 2 (3) 9
X X X (4)
1
2
34
P1P2 P3 P4 P1P2 P3 P4 P1 P2 P3 P4 P1P2 P3 P4Origin
Private values
P1 P2 P3 P4
After 1st exchange
– – 3 4
– 2 – 9
– 5 6 –
– – 3 9
(1 2 3 4)1 – – 9
6 – 7 –
– 2 – 9
1 – – 4(1 2 3 9)
7 8 – – (X X X 4)
(1 2 3 9)
X X – X
X – X X
– X X XFrom P1
From P2
From P3From P4
After 2nd exchange
After majority vote 1 2 3 9 1 2 3 9 1 2 3 9 X X X X
P1P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
Broadcastingprivate values
Broadcastingvalues receivedfrom therprocessors
Practical aspects
Performance overhead due to volume of exchanged data t=1 9 values received by each processor
t=2 156 values received by each processor
Use of signaturesSignatures attached to exchanged data such that corruption probability withoutdetection is neglectable
➜ 2t+1 processors and communication channels
➜ t+1 exchanges
➜ Reduction of volume of data exchanged
t=1 4 values received by each processor
t=2 40 values received by each processor
(176,178,233)
178
(176,178,233)
178
(176,178,233)
178
177
176
233
178
Other utilization: data with likely variations (e.g., from sensors),median vote
(110,176,178)
176
(176,177,178)
177
(176,178,233)
178
177
176
?
178
(176,177,178)
177
(176,177,178)
177
(176,177,178)
177
Prevention of errorpropagation
Fail-fast
Error detection(defensive
programming)and exception
handling
Service continuity
Error detectionand
recovery points
Design diversity
Recoveryblocks
Two-folddiversity
+acceptance
test
N-Versionprogramming
Three-folddiversity
+vote
N-self-checkingprogramming
Four-folddiversity
+switching
Two-folddiversity
+comparison
Doubleprogramming
Soft faults Solid faults
Development fault tolerance
Examples of recovery points
Tandem process pairs
Statistics on 169 software failures
1 processor halted
Several processors halted
138
31
efficiency: 82% (138/169)❉ does not account for under-reporting
Causes of multiple processor halts
2nd halt: processor executing the backup 24of failed primary
same fault primary-backup 17different fault 4unclear 3
Not process-pair related 4Unclear 3
Aim: fault independency
Obstacles: common specification, commondifficulties, inter-variant synchronizations anddecisions
Operational use
Civil avionics: generalized, at differing levels
Railway signalling: partial
Nuclear control: partial
Dependability improvement
Gain, although less than for physical fault tolerance
Contribution to verification of specification
Design diversity
CRA - NCSU - RTI - UCLA/UCSB - UIUC - UVA EXPERIMENT
Sensor management in a redundant inertia platform
20 programmes (1600 to 5000 lines)
920746 tests on flight simulator
Average failure probab 1 version 5,3 10-3
Average failure probab 3 -version systems 4,2 10-4
Reliability improvement : 13
Architecture of Airbus computers
28V
ProcessorMemory I/O
PowerControl channel
Monitoring channelPower
ProcessorMemory I/O
Protection lightning, EMI, over/under voltage
Airbus A-320
Ailerons
Horizontal plane
Elevators
SFCC1
SFCC2
Rudder
Slats andflaps
FAC1
FAC2
SpoilersSEC1
SEC2
SEC3
ELAC1
ELAC2
Boeing 777
PFC left
PFC center
PFC right
LeftCenterRight
Motorola68040
Intel80486
Lan
e2
ACE Actuateurssurfaces de
vol
Actuateurssurfaces de
vol
Actuateurssurfaces de
volActuators
ARINC 629 Data bus
PFC : Primary Flight Computer
ACE : Actuator Control Electronics
AMD29050
Lan
e1L
ane3
L C R
ACEACEACEACE
Ailerons
Flaps
Spoilers
Elevator
Rudder
Pilots
ARINC 629
Other
PFC : Primary Flight Computer
ACE : Actuator Control Electronics
LeftPFC
CenterPFC
RightPFC
Other Other
Verification and evaluation of fault tolerance
Faults (deficiencies) in algorithms and mechanisms of fault tolerance
Fault toleranceCoverage
Modelling Test
Evaluation / Influence
Fault forecasting
Improvement
Fault removal
Test
Dynamicverification
Staticverification
Model checking
Fault injection
ActivityTargetsystem
Inputs
Faults
OutputsCorrect/Incorrect
Error detection,error and faulthandling
Fault injection
By hardware•Radiations
•Interferences
•Pins
Physical
By software•Memory
•Executive
•Processor
Insimulation
Informational
Injection
Prototype oractual system
Simulationmodel
Target system
Representativityof faults
Delta-4 system (LAAS)
Robustness testing of POSIX calls (CMU) Test of Chorus Classix R3 (LAAS)
Detectederrors
Toleratederrors
ErrorsFaults
Failures
94% 85% 99%
3% 1%
12%
Dig
ital U
nix
4.0
Fre
e B
SD
2.2
.5
HP
UX
B.1
0.20
0
0,05
0,1
0,15
0,2
0,25
Fai
lure
pro
babi
lity
AIX
4.1
Irix
6.2
Linu
x 2.
0.18
Lynx
OS
2.4
.0
Net
BS
D 1
.3
QN
X 4
.2.4
Sun
OS
5.5
Ker
nel
Deb
ug.
No
obs
Err
or
stat
us
0
10
20
30
40
50
60
70
80
90
Sys
t. ha
ng
Exc
ep.
App
l .h
ang
Inco
r.
res.
% r
espo
nses
Internal faults
External faults
Average Min-Max
Average Min-Max
integration
inte
rcon
nect
ion perform
ance
vanishing substitutes for computers
funnel factor decreasing natural
robustness
unavoidability of faults ill-mastered complexity
dependency
Functional design + Over-dimensioning
Construction of reliable systems: an all time concern
[Fiabilité — de fiable, fiableté, XIIème siècle — caractère de ce qui est digne de confiance]
Traditional, pre-computer, approach
Reliability approach
Discrete and abstract characters of digital computers
Over-dimensioning not (currently?) foreseeable
Consciousness that all faults are development faults
Dependability-explicit development approach