Fault Tolerant Computing Basics
description
Transcript of Fault Tolerant Computing Basics
![Page 1: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/1.jpg)
1
Fault Tolerant ComputingBasics
Dan SiewiorekCarnegie Mellon University
June 2012
![Page 2: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/2.jpg)
2
Preview Many terms have multiple usage that can lead to confusion when used
out of context• Sources of error
Faults go through at least ten stages from inception to repair - so designer better plan for all ten stages• Relationship between sequence of events in handling a fault and mathematical
measures
![Page 3: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/3.jpg)
3
Outline Introduction Definitions Sources of Errors
![Page 4: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/4.jpg)
4
Introduction
![Page 5: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/5.jpg)
5
WHY RELIABILITY? Three of the driving factors:
• Critical applications– computer outage or error can cause loss of money, time, life– No longer just in aerospace, but in more mundane applications – customer
expectations• Increasing system complexity
– more components, more likelihood of failure (counter: increased rel. of | VLSI)– Lower signal/noise ratios in ↑ VLSI speed more likelihood of transient errors– Diagnosis more difficult, downtime is longer, repair costs ↑ increased inventory
costs too
• Relative cost is less
![Page 6: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/6.jpg)
6
AVAILABILITY EXAMPLE 90 MINUTES DOWNTIME PER WEEK AVAILABILITY 0.991 RESERVATION SYSTEM -- $36,000/MINUTE DOWN $3.24 MILLION PER WEEK
.1% AVAILABILITY = 10 MINUTES = $360,000.00
![Page 7: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/7.jpg)
7
Univac I Checkers Parity
• Memory• Input to function table• Output from function table, odd number of selected
gates. Dummy lines preserve parity• Unitypes
1-of-n• Intermediate line function table• Memory bank select
![Page 8: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/8.jpg)
8
Univac I Checkers (cont’d) Duplication
• Registers• Adder• Comparitor• Multiplier-quotient coupler• Bus amplifier• Bus interface
Automatic voltage monitoring system tests every DC voltage at rate of one per minute
“720 checker” counts 720 characters per I/O block
![Page 9: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/9.jpg)
9
Modern Microprocessor checkers
![Page 10: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/10.jpg)
10
![Page 11: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/11.jpg)
11
![Page 12: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/12.jpg)
12
DEFINITIONS &THE LIFE OF A FAULT
![Page 13: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/13.jpg)
13
Definitions RELIABILITY:
SURVIVAL PROBABILITY • When repair is costly or function is critical
AVAILABILITY:THE FRACTION OF TIME A SYSTEM MEETS ITS SPECIFICATION• When service can be delayed or denied
REDUNDANCY:EXTRA HARDWARE, SOFTWARE, TIME
![Page 14: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/14.jpg)
14
Stages in the development of a system
STAGE ERROR SOURCES ERROR DETECTIONSpecificationAlgorithm Design Simulation& design Formal Specification Consistency checks,
model checking
Prototype Algorithm design Stimulus/responseWiring & assembly testingTimingComponent Failure
Manufacture Wiring & assembly System testingComponent failure Diagnostics
Installation Assembly System TestingComponent failure Diagnostics
Field Operation Component failure DiagnosticsOperator errorsEnvironmental factors
![Page 15: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/15.jpg)
15
Cause-effect sequence FAILURE: component does not provide service FAULT: deviation of logic function from design value
• Hard, Transient ERROR: manifestation of a fault by incorrect value
![Page 16: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/16.jpg)
16
Fault Classification DURATION:
• Transient- design errors, environment • Intermittent-repair by replacement• Permanent- repair by replacement
EXTENT:• Local (independent)• Distributed (related)
VALUE:• Determinate (stuck at X)• Indeterminate (variable)
![Page 17: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/17.jpg)
17
Basic Steps in Fault Handling Fault Confinement -- contain it before it can spread Fault Detection -- find out about it to prevent acting on bad data Fault Masking -- mask effects Retry -- since most problems are transient, just try again Diagnosis -- figure out what went wrong as prelude to correction Reconfiguration -- work around a defective component Recovery -- resume operation after reconfiguration in degraded mode Restart -- re-initialize (warm restart; cold restart) Repair -- repair defective component Reintegration -- after repair, go from degraded to full operation
![Page 18: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/18.jpg)
18
MTBF -- MTTD -- MTTRAvailability = MTTF______________
MTTF + MTTR
![Page 19: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/19.jpg)
19
Error Containment Levels For distributed systems
there are additional levels• Containment to a single
node or FTU• Containment to a single
bus or subsystem• Containment to a single
vehicle/piece of equipment in a national infrastructure
![Page 20: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/20.jpg)
20
Sources of Errors
![Page 21: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/21.jpg)
21
“Mainframe”Outage Sources
(* the sum of these sources was 0.75)
![Page 22: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/22.jpg)
22
Summary of Tandem Reported System Outage Data
1985 1987 1989
Customers 1000 1300 2000
Outage Customers 176 205 164
Systems 2400 6000 9000
Processors 7000 15,000 25,500
Discs 16,000 46,000 74,000
Reported Outages 285 294 438
System MTBF 8 years 20 years 21 years
![Page 23: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/23.jpg)
23
Tandem Causes of System Failures
(Up is good; down is bad)
![Page 24: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/24.jpg)
24
Tandem Hardware Causes of Outage
Disks 49% Communications 24% Processors 18% Timing 9% Spares 1%
![Page 25: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/25.jpg)
25
Tandem Operations Causes of Outage Procedures 42% Configurations 39% Move 13% Overflow 4% Upgrade 1%
![Page 26: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/26.jpg)
26
Tandem Maintenance Causes of Outage
Disk 67% Communication 20% Processor 13%
![Page 27: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/27.jpg)
27
Tandem Environmental Outages Extended Power Loss 80% Earthquake 5% Flood 4% Fire 3% Lightning 3% Halon Activation 2% Air Conditioning 2%
Total MTBF about 20 years MTBAoG* about 100 years
• Roadside highway equipment will be more exposed than this *
(AoG= “Act Of God”)
![Page 28: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/28.jpg)
28
CMU Andrew File Server Study Configuration
• 13 SUN II Workstations with 68010 processor• 4 Fujitsu Eagle Disk Drives
Observations• 21 Workstation Years
Frequency of events• Permanent Failures 29• Intermittent Faults 610• Transient Faults 446• System Crashes 298
Mean Time To• Permanent Failures 6552 hours• Intermittent Faults 58 hours• Transient Faults 354 hours• System Crash 689 hours
![Page 29: Fault Tolerant Computing Basics](https://reader035.fdocuments.in/reader035/viewer/2022062816/56813b0e550346895da3b877/html5/thumbnails/29.jpg)
29
Some Interesting Ratios Permanent Outages/Total Crashes = 0.1
Intermittent Faults/Permanent Failures = 21• Thus first symptom appears over 1200 hours prior to repair
(Crashes - Permanent)/Total Faults = 0.255 14/29 failures had three or fewer error log entries
• 8/29 had no error log entries