1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer...
-
Upload
laureen-gaines -
Category
Documents
-
view
217 -
download
2
Transcript of 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer...
1
Fault-Tolerant Computing Systems#2 Hardware Fault Tolerance
Pattara LeelapruteComputer Engineering DepartmentKasetsart [email protected]
2
Hardware Fault Tolerance Triple Modular Redundancy (TMR)*
Can mask the failure of one hardware unit No explicit actions need to be performed for the
occurrence of faults (error detection, recovery, etc.)
Module
Module
Module
VotingElementInput Output
Majority votes
Majority
*proposed byVon Neumann
(replicas)
3
Triple Modular Redundancy (1) Triple Modular Redundancy (TMR)
Suitable for transient faults Voting element does not remove the faulty unit after an error occurs
Reliability of the TMR becomes lower than a simplex system once a failure occurs.
Ex. (0,1,1) = 1, (1,0,0)=0, (1,0)=???
Module
Module
Module
VotingElementInput Output
Majority votes
Majority
4
Triple Modular Redundancy (2) Bit-wise voting
Take the majority for each bit
Voting element has to be simple and highly reliable unit Tight synchronization is required
Single clock
Generalization of TMR is N-modular redundancy (NMR)
…
Module
Module
Module
VotingElement
VotingElement
VotingElement
&
&
&
+
Voting Element (Voter)
5
Static Redundancy The effect of a faulty element (component,
circuit, system) is immediately masked by permanently connected and continually operating replicas of the element.
TMR and NMR are static redundancy scheme
…
Module
Module
Module
VotingElement
VotingElement
VotingElement
replicas
fault
KEY Points • Permanently connected• Continually operating
6
Dynamic Redundancy When a fault is detected, that fault or its effect
is subsequently corrected =>Reconfiguration Consists of several units, but with only one
operating at a time. Other units are just “Spare”
Module
Module
ModuleSpare
…
Module
Module
Module
…Module
Module
Module
…
Reconfiguration
operating unit
7
Dynamic Redundancy Cold-Standby system
Only one unit is powered up and operational Spares are not powered on -> they are still cold ! Faulty unit is replaced by turning off its power and
powering up a spare
Hot-Standby All units are operating simultaneously Their outputs are then matched
If they are the same, one is selected arbitrarily If not, faulty unit is detected and the system will be reconfigured
Dual System Matching circuit continuously compares the results of two unit
Module
Module compare
How to detect the fail unit ??
8
Coding Code
One of most important techniques for supporting fault tolerance hardware
Codeword, Non-codeword Ex. 0001 = a
0010 = b
0011 = c
Single Parity Check Code Even Parity Odd Parity
9
Single Parity Check Add check bit “Parity bit” to the information bit Total number of 1s in the codeword is always
even or always odd
Odd parity check Parity bit is 1 iff the number of 1s in the data bit is even
Even parity check Parity bit is 1 iff the number of 1s in the data bit is odd
0 1 0 1 0 1
parity bitdata bits
1 1 0 1 0 1
parity bitdata bits
codeword
codeword
The # of 1s in the codeword is Odd
The # of 1s in the codeword is Even
10
Parity Check How it works? Ex. of odd parity check
0 1 0 1 0 1
parity bitdata bits
Sending side
0 1 0 0 0 1Receiving ex.1
The parity bit is 1 iff the number of 1s in the data bit is even
Check whether the # of 1s in the codeword is odd or not.
0 1 1 0 0 1Receiving ex.2
Occurrence of one bit error can be detected.Cannot correct an error (no way to specify the place)
xok?
11
Step2
Parity Check (Advanced)
Sending side
odd parity or even parity?
Receiving side Receiving side
Step1
Receiving side
12
Coding (Hamming Distance) Minimum Distance
The minimum distance of Hamming Distance between any pair of 2 different Codewords
Ex. Single Parity Check Code Minimum Distance = 2 1bit error can be detected
11110000
00010010…
11010111…
01010110…
d
(d -1)/2 Correction
d - 1Detection
Td = number of bit errors that can be detectedTc = number of bit errors that can be correctedd = minimum distance
Why?
13
Self-Checking Can detect faults by itself Ex . Self-Checking Parity Checker
x0 x2 x4 x6 x1 x3 x5 x7
x8
z2 z1
FunctionalCircuit
Checker
Inputs
Error IndicationIf using odd parity Codewords (0, 1), (1, 0): Error Free Noncodewords (0, 0), (1, 1): Error (in A or B)
x
z
A
B
14
Self-Checking
Circuitinput
F = set of faults
Codeword or Non-codeword
Non-codeword means fault
Fault-Secure Even f F occurs, incorrect codeword will not be produced
Self-Testing When f F occurs, there will be an input that leads to the output of non-codeword (which means the detections of fault)
Totally Self-CheckingFault-Secure + Self-Testing
15
2-Rail Logic Don’t use Not gate
xx0
x1
xy
x1x0
y1y0
x1x0
y1
y0
xy
z1
z0
z1
z0
z1
z0
001
1 10
x1x0
y1y0
x
y
16
2-Rail Logic and Unidirectional Error
The effect of fault on the output Unidirectional Error (definition)
All erroneous signal are only one of: Error that 1 0 occurs Error that 0 1 occurs
2-Rail Logic Incorrect codeword will never be produced
ex. (0,1) (1,0) never occurs however, the non-codeword may be produced Fault-Secure
17
Disk Shadowing
Maintaining a set of identical disk images on separate several disk devices.
Disk Mirroring 2 Disks
with 2 disk controllersWrite to both disks,
read from either of disks Tandem System
the first commercial fault-tolerant system
Host Host
Disk Disk
DiskController
DiskController
18
RAID (Redundant Array of Inexpensive Disks)
StripingDivide the storage area into several parts called
stripes, then distribute those stripes to several disksLoad balancing between disks
to maximize throughput
Fault Tolerance can be implemented at low cost
D4D3
D0 D1
Controller
D5
D2
19
RAID-0 Striping
Advantage Good performance due to high data throughput
Disadvantage Non-Fault Tolerance
Usable Storage Capacity Percentage = 100%
D4D3
D0 D1
Controller
D5
D2 Only stripingNo redundancy
20
RAID-1 Mirroring
D1D1
D0 D0
Controller
Writing all data to N disks
Advantage High performance of fault tolerance
(tolerate/mask failure of N-1 disk) Faster on reads (compare to a single drive)
Disadvantage Slower on writes (compare to a single drive) Low utilization efficiency
Usable Storage Capacity Percentage = 100/N %
21
RAID-4
D4D3
D0 D1
Controller
D5
D2
P3~5
P0~2
N Advantage Very good for read (the same as RAID-0) High utilization efficiency Tolerate/mask failure of 1 disk
Disadvantage Slow on writes (typically, small random write)
*due to the concentration of access to the parity disk Usable Storage Capacity Percentage = 100*(N-1)/N %
Add one redundant parity disk
22
RAID-5
D4D3
D0 D1
Controller
P3~5
D2
D5
P0~2
Similar to RAID-4, but distributes parity among the drives Advantage
Very good for read/write (even small random write)*Parity disk does not become a bottleneck anymore
High utilization efficiency Tolerate/mask failure of 1 disk
Disadvantage Slower than RAID-4 on read
*parity data must be skipped on each drive during reads Usable Storage Capacity Percentage= 100*(N-1)/N %