1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer...

22
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University [email protected]

Transcript of 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer...

Page 1: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

1

Fault-Tolerant Computing Systems#2 Hardware Fault Tolerance

Pattara LeelapruteComputer Engineering DepartmentKasetsart [email protected]

Page 2: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

2

Hardware Fault Tolerance Triple Modular Redundancy (TMR)*

Can mask the failure of one hardware unit No explicit actions need to be performed for the

occurrence of faults (error detection, recovery, etc.)

Module

Module

Module

VotingElementInput Output

Majority votes

Majority

*proposed byVon Neumann

(replicas)

Page 3: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

3

Triple Modular Redundancy (1) Triple Modular Redundancy (TMR)

Suitable for transient faults Voting element does not remove the faulty unit after an error occurs

Reliability of the TMR becomes lower than a simplex system once a failure occurs.

Ex. (0,1,1) = 1, (1,0,0)=0, (1,0)=???

Module

Module

Module

VotingElementInput Output

Majority votes

Majority

Page 4: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

4

Triple Modular Redundancy (2) Bit-wise voting

Take the majority for each bit

Voting element has to be simple and highly reliable unit Tight synchronization is required

Single clock

Generalization of TMR is N-modular redundancy (NMR)

Module

Module

Module

VotingElement

VotingElement

VotingElement

&

&

&

+

Voting Element (Voter)

Page 5: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

5

Static Redundancy The effect of a faulty element (component,

circuit, system) is immediately masked by permanently connected and continually operating replicas of the element.

TMR and NMR are static redundancy scheme

Module

Module

Module

VotingElement

VotingElement

VotingElement

replicas

fault

KEY Points • Permanently connected• Continually operating

Page 6: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

6

Dynamic Redundancy When a fault is detected, that fault or its effect

is subsequently corrected =>Reconfiguration Consists of several units, but with only one

operating at a time. Other units are just “Spare”

Module

Module

ModuleSpare

Module

Module

Module

…Module

Module

Module

Reconfiguration

operating unit

Page 7: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

7

Dynamic Redundancy Cold-Standby system

Only one unit is powered up and operational Spares are not powered on -> they are still cold ! Faulty unit is replaced by turning off its power and

powering up a spare

Hot-Standby All units are operating simultaneously Their outputs are then matched

If they are the same, one is selected arbitrarily If not, faulty unit is detected and the system will be reconfigured

Dual System Matching circuit continuously compares the results of two unit

Module

Module compare

How to detect the fail unit ??

Page 8: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

8

Coding Code

One of most important techniques for supporting fault tolerance hardware

Codeword, Non-codeword Ex. 0001 = a

0010 = b

0011 = c

Single Parity Check Code Even Parity Odd Parity

Page 9: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

9

Single Parity Check Add check bit “Parity bit” to the information bit Total number of 1s in the codeword is always

even or always odd

Odd parity check Parity bit is 1 iff the number of 1s in the data bit is even

Even parity check Parity bit is 1 iff the number of 1s in the data bit is odd

0 1 0 1 0 1

parity bitdata bits

1 1 0 1 0 1

parity bitdata bits

codeword

codeword

The # of 1s in the codeword is Odd

The # of 1s in the codeword is Even

Page 10: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

10

Parity Check How it works? Ex. of odd parity check

0 1 0 1 0 1

parity bitdata bits

Sending side

0 1 0 0 0 1Receiving ex.1

The parity bit is 1 iff the number of 1s in the data bit is even

Check whether the # of 1s in the codeword is odd or not.

0 1 1 0 0 1Receiving ex.2

Occurrence of one bit error can be detected.Cannot correct an error (no way to specify the place)

xok?

Page 11: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

11

Step2

Parity Check (Advanced)

Sending side

odd parity or even parity?

Receiving side Receiving side

Step1

Receiving side

Page 12: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

12

Coding (Hamming Distance) Minimum Distance

The minimum distance of Hamming Distance between any pair of 2 different Codewords

Ex. Single Parity Check Code Minimum Distance = 2 1bit error can be detected

11110000

00010010…

11010111…

01010110…

d

(d -1)/2 Correction

d - 1Detection

Td = number of bit errors that can be detectedTc = number of bit errors that can be correctedd = minimum distance

Why?

Page 13: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

13

Self-Checking Can detect faults by itself Ex . Self-Checking Parity Checker

x0 x2 x4 x6 x1 x3 x5 x7

x8

z2 z1

FunctionalCircuit

Checker

Inputs

Error IndicationIf using odd parity Codewords (0, 1), (1, 0): Error Free Noncodewords (0, 0), (1, 1): Error (in A or B)

x

z

A

B

Page 14: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

14

Self-Checking

Circuitinput

F = set of faults

Codeword or Non-codeword

Non-codeword means fault

Fault-Secure Even f F occurs, incorrect codeword will not be produced

Self-Testing When f F occurs, there will be an input that leads to the output of non-codeword (which means the detections of fault)

Totally Self-CheckingFault-Secure + Self-Testing

Page 15: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

15

2-Rail Logic Don’t use Not gate

xx0

x1

xy

x1x0

y1y0

x1x0

y1

y0

xy

z1

z0

z1

z0

z1

z0

001

1 10

x1x0

y1y0

x

y

Page 16: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

16

2-Rail Logic and Unidirectional Error

The effect of fault on the output Unidirectional Error (definition)

All erroneous signal are only one of: Error that 1 0 occurs Error that 0 1 occurs

2-Rail Logic Incorrect codeword will never be produced

ex. (0,1) (1,0) never occurs however, the non-codeword may be produced Fault-Secure

Page 17: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

17

Disk Shadowing

Maintaining a set of identical disk images on separate several disk devices.

Disk Mirroring 2 Disks

with 2 disk controllersWrite to both disks,

read from either of disks Tandem System

the first commercial fault-tolerant system

Host Host

Disk Disk

DiskController

DiskController

Page 18: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

18

RAID (Redundant Array of Inexpensive Disks)

StripingDivide the storage area into several parts called

stripes, then distribute those stripes to several disksLoad balancing between disks

to maximize throughput

Fault Tolerance can be implemented at low cost

D4D3

D0 D1

Controller

D5

D2

Page 19: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

19

RAID-0 Striping

Advantage Good performance due to high data throughput

Disadvantage Non-Fault Tolerance

Usable Storage Capacity Percentage = 100%

D4D3

D0 D1

Controller

D5

D2 Only stripingNo redundancy

Page 20: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

20

RAID-1 Mirroring

D1D1

D0 D0

Controller

Writing all data to N disks

Advantage High performance of fault tolerance

(tolerate/mask failure of N-1 disk) Faster on reads (compare to a single drive)

Disadvantage Slower on writes (compare to a single drive) Low utilization efficiency

Usable Storage Capacity Percentage = 100/N %

Page 21: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

21

RAID-4

D4D3

D0 D1

Controller

D5

D2

P3~5

P0~2

N Advantage Very good for read (the same as RAID-0) High utilization efficiency Tolerate/mask failure of 1 disk

Disadvantage Slow on writes (typically, small random write)

*due to the concentration of access to the parity disk Usable Storage Capacity Percentage = 100*(N-1)/N %

Add one redundant parity disk

Page 22: 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th.

22

RAID-5

D4D3

D0 D1

Controller

P3~5

D2

D5

P0~2

Similar to RAID-4, but distributes parity among the drives Advantage

Very good for read/write (even small random write)*Parity disk does not become a bottleneck anymore

High utilization efficiency Tolerate/mask failure of 1 disk

Disadvantage Slower than RAID-4 on read

*parity data must be skipped on each drive during reads Usable Storage Capacity Percentage= 100*(N-1)/N %