ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering...
-
Upload
randell-edwards -
Category
Documents
-
view
217 -
download
0
Transcript of ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering...
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K.SalujaDepartment of Electrical and Computer
Engineering
System Diagnosis
ECE 753 Fault Tolerant Computing 2
Overview• Introduction• System Model• Diagnosis Problem - PMC model• Other Models and Comments• Sequential Diagnosability• Other Formulations, Algorithms, and
Problems• Summary
ECE 753 Fault Tolerant Computing 3
Introduction• Reference
• [prad:96] Chapter 8, Original paper in IEEETC (Dec 1967)
• Diagnosis: an important part of recovery, maintenance and reconfiguration
• What is system level diagnosis: diagnose failed components in a large, possibly multiprocessor, system
• Underlying needs: failures inevitable, units are smart/intelligent to test other units, hence need a different model and corresponding theory
ECE 753 Fault Tolerant Computing 4
System Model• Model and Assumptions
– Graph model• Processors/processes expressed as nodes• Interconnects as links between nodes
– Each processor is sufficiently powerful to test other processors comprehensively
– An example model with four nodes
– Test model: node Vi tests Vj then draw a
directed link from Vi to Vj
ECE 753 Fault Tolerant Computing 6
Diagnosis - PMC model (contd.)• Assumptions
– System with n units– Tests are comprehensive– Test results are binary: good (0) /faulty (1)– Faulty units can not be trusted for their test
outcomes (denote x – means can be 0 or 1)– Total number of faulty units in the system is
upper-bounded to t– Example: system with four nodes and one
fault
ECE 753 Fault Tolerant Computing 7
Diagnosis - PMC model (contd.)
• Example – Test outcomes
• Assume V2 is faulty
v4 v3
v2v11
0
0
0
xx
ECE 753 Fault Tolerant Computing 8
Diagnosis - PMC model (contd.)• One-step diagnosis
– Analysis problem – give a system with n units, all the interconnects, and the test outcomes, identify the faulty units subject to the constraint that no more than t units in the system are faulty.
– Design problem – design a system using fewest possible test links such that all the faulty units can be correctly identified in one-step knowing the outcomes of the tests.
ECE 753 Fault Tolerant Computing 9
Diagnosis - PMC model (contd.)
• One-step diagnosis - Example– Consider all possible outcomes -
fault a12 a23 a24 a31 a41 a43
none 0 0 0 0 0 0
V1 faulty x 0 0 1 1 0
V2 faulty 1 x x 0 0 0
V3 faulty 0 1 0 x 0 1
V4 faulty 0 0 1 0 x x
each row is called Syndrome of the fault
ECE 753 Fault Tolerant Computing 10
Diagnosis - PMC model (contd.)
• Observations1. Two possible syndromes associated with the
fault V1 and these are: 0 0 0 1 1 0 and 1 0 0 1 1 0 2. No two faults have overlapping syndromes
Hence: we can correctly identify (diagnose) the faulty unit
ECE 753 Fault Tolerant Computing 11
Diagnosis - PMC model (contd.)• Consider two faulty units – say V1 and V2
possible syndrome
x x x 1 1 0
implies
0 0 0 1 1 0 a possible outcome
Therefore we can not determine if V1 alone or both V1 and V2 are faulty. Thus two faults in this system can not be diagnosed in one-step.
ECE 753 Fault Tolerant Computing 12
Diagnosis - PMC model (contd.)
• Result: A system is one-step t-fault diagnosable provided syndrome for each fault ( 0-fault, 1-fault, 2-faults, …, t-faults) are all distinct (non overlappling/non intersecting)
• More results: -
but first one more assumption – no two units test each other
ECE 753 Fault Tolerant Computing 13
Diagnosis - PMC model (contd.)• Result 1: For a system to be one-step t-fault
diagnosable n 2t + 1≧
• Result 2: For a system to be one-step t-fault
diagnosable each unit must be tested by at least t other units
• Theorem: A system of n units in which no two units test
each other is one step t-fault diagnosable if and only if each unit is tested by t other units.
ECE 753 Fault Tolerant Computing 14
6
0
1
5
4 3
2
Diagnosis - PMC model (contd.)
• Design Problem – one-step t-fault diagnosable system
• Example – n = 7, t = 3
ECE 753 Fault Tolerant Computing 15
Diagnosis - PMC model (contd.)• Design Problem: Algorithm for a simple one-
step t-fault diagnosable with n 2t + 1≧ 1. Number the nodes from 0 to n-1
2. draw a link from node i to i+1 (mod n),
i+2 (mod n), … , i+t (mod n).
3. System so designed is t-fault one-step diagnosable.
ECE 753 Fault Tolerant Computing 16
Diagnosis - PMC model (contd.)• Systems in which some units test each
other• One-step t-fault diagnosability conditions
are some what complex – See [prad:96]• How does one check if a given system is
one-step t-fault diagnosable – – Simple if no two units test each other– Some what complex if units test each other– There is a body of literature dealing with diagnosis
algorithems
ECE 753 Fault Tolerant Computing 17
Other Models and CommentsConsider possible test outcomes when a unit
Vi tests unit Vj – see the listing below
Vi Vj outcomes
G G 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
G F 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
F G 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
F F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ECE 753 Fault Tolerant Computing 18
Other Models/Comments(contd.)
– 4,5,6,7 PMC model– 8,9,10,11 PMC with complement encoding– 0,15 of little value– etc.– Some subset of PMC are more interesting – for
example 5,7 – this implies that a unit being tested is always correctly identified, if faulty, independent of the status of the testing unit. Many such variations have been studied.
ECE 753 Fault Tolerant Computing 19
Other Models/Comments(contd.)
– Comparison based testing and diagnosis• A paper is in the IEEE Transactions of
Computers - February 2009 Issue
– Basically the model is built on PMC model
ECE 753 Fault Tolerant Computing 20
Sequential Diagnosability
• Consider the following repair strategy identify one or more faulty units
repair them
test system again and continue till we know that there are no more faulty units
–This is called sequential diagnosis
ECE 753 Fault Tolerant Computing 21
Sequential Diagnosability (contd.)
• Assumptions– Same as before:
• System with n units• Tests are comprehensive• Test results are binary: good (0) /faulty (1)• Faulty units can not be trusted for their test
outcomes (denote x – means can be 0 or 1)• Total number of faulty units in the system is
upper-bounded to t
ECE 753 Fault Tolerant Computing 22
Sequential Diagnosability (contd.)
• Result 1:
For a system to be sequntially t-fault diagnosable
n 2t + 1≧
It is not necessary for every unit to be tested by t units
ECE 753 Fault Tolerant Computing 23
0
Sequential Diagnosability (contd.)
• Example – n = 7, t = 3
6 1
5
4 3
2
ECE 753 Fault Tolerant Computing 24
Sequential Diagnosability (contd.)
• It is easy to show that the example system is sequentially 3-fault diagnosable
• Above construction will require n+2t–1 links
• A better solution: A system with n+2t-2 links can be designed that is sequentially t-fault diagnosable
ECE 753 Fault Tolerant Computing 25
Sequential Diagnosability (contd.)
• Proof:– First construct the system – n nodes form a
single loop, thus containing n links– Next choose some 2t-2 units and let these units
test V0 unit– Now show that this system is sequentially t-fault
diagnosable using the following three cases. Let n1 indicate the number of units which find V0 faulty. Similarly n0 indicate the units that find V0 not faulty. Clearly n1+ n0 = 2t-1
ECE 753 Fault Tolerant Computing 26
Sequential Diagnosability (contd.)
• Proof:– Case 1: n1 > t ---- V0 is faulty
– Case 1: n1 < t ---- V0 is not faulty
– Case 1: n1 = t ---- a fault free unit exists that is not involved in testing V0
ECE 753 Fault Tolerant Computing 27
Sequential Diagnosability (contd.)• Sequential diagnosis – single loop system
– Example single loop system with n=5– This is sequentially 2-fault diagnosable and can be
demonstrated by constructing syndromes for different fault conditions. However, a system with n=9 is NOT sequentially 4-fault diagnosable
– General result: A single loop system is sequentially t-fault diagnosable if and only if
n t + t2/4 + 2 for even t
n t + [(t-1)(t+1)/4] + 2 for odd t
ECE 753 Fault Tolerant Computing 28
Other Formulations, Algorithms, and Problems
• Generalization of sequential diagnosability– Diagnose s faulty units at a time thus making a
system t/s-sequentially diagnosable• Allow replacing up to t units – but not all units there
are replaced are faulty. In other words non faulty units can be replaced as long as all the faulty units are within the replaced units (t/t fault diagnosability)– An example in [prad:96] shows a system with 13
units, each unit is tested by 3 other units. Clearly such a system is only one-step 3-fault diagnosable. But it is shown to be 5/5 diagnosable.
• Even additional formulations exist
ECE 753 Fault Tolerant Computing 29
Other Formulations, Algorithms, and Problems
• Diagnosis algorithms – Given a syndrome and knowing that the system is t diagnosable, determine the set of faulty units– Possible solutions
• Dictionary approach – some what impractical for large systems
• Algorithmic approach – based on graph models and using solution to maximum matching problem
– Central v/s distributed algorithms• Diagnosis and reconfiguration in homogenous
and heterogeneous multicore systems