Download - FAULT-TOLERANT COMPUTING

Transcript
Page 1: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Jenn-Wei LinDepartment of Computer Science and Information Engineering

Fu Jen Catholic University

Reliability Modeling and Analysis Lecture Set 3

Page 2: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

2

Overview

• Introduction

• Reliability Modeling– reliability block diagram

– combinatorial model

– Markov model

• Other Parameters and analysis

• General remarks and Summary

Page 3: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

3

Introduction

• References• [prad:96], [swew:99], [shooman:02]• [triv:82] Books in the first line (three books) contain sufficient material

covering this part of the course

• Recap of definitions

• Importance of analysis and analytical model

• Mathematical formulation for quantitative analysis

Page 4: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

4

Introduction (contd.)

• Recap of definitions– Reliability R(t)

– Availability A(t)

– Performability and Dependability

• Importance of analysis and analytical model– to evaluate a design

– a metric to compare different designs

– to provide feedback to the designer during early design stages

– use a model for performance analysis

– used for quantitative and qualitative analysis

Page 5: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

5

Introduction (contd.)

• Mathematical formulation for quantitative analysis– consider a large experiment with N components– operate correctly at time t0

– observation at time t• N0(t) - number of correctly operating systems• Nf(t) - number of failed systems

– Hence• Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N

– Probability that a component has survived the interval [t0, t]• Unreliability Q(t) = 1 - R(t)• Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt) • dNf(t)/dt is called instantaneous failure rate of the component

Page 6: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

6

Introduction (contd.)

• Mathematical formulation (contd.)– Also

• failure rate at time t– (instantaneous failure rate at time t) / N0(t)– (1/No(t))(dNf(t)/dt) - called z(t)– this and the previous expressions together reduce to

» z(t) = -(1/R(t))(dR(t)/dt)» z(t) is called failure rate function, hazard function or hazard

rate– We can solve the above for R(t) provided we know

instantaneous failure rate– Bath tub curve for failure rate function

» implies constant failure rate during useful life» infant mortality and wear out periods have variable failure

rates

Page 7: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

7

Page 8: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

8

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - constant failure rate

• dR(t)/dt =-z(t)R(t)• solve the equations - exponential function for reliability and

for unreliability, R(t) = 1- Q(t) = exp(-λt)

– Reliability computation - time varying failure rate

• Waibull distribution z(t) = αλ(λt)**(α-1)• solve the equations - exponential function for reliability and

for unreliability

– Failure rate computation - military standard• function of - learning factor, quality factor, temperature factor,

environmental factor, and # of pins on IC

Page 9: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

9

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time to failure (MTTF)

• Definition: expected time that a system will operate before the first failure occurs

• Probability measure: S-sample space, E-event space– for A in E P(A) >= 0

– P(S) = 1

– P(AB) = P(A) + P(B), when A and B are non-intersecting

• Random Variable (RV) - X maps events of S to real-numbers

• Probability distribution function of a RV

• Probability density function (pdf) - derivative of the distribution function

Page 10: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

10

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time to failure

• Probability density function - properties– always >= 0

– integrates to 1 (between limits)

• Expectation– Integrate xf(x)

– Σ xi p(xi) in discrete case

• Application in our case– unreliability Q(t) is a probability distribution function of failure -

in fact it is cumulative probability that system fails in time [0,t]

Page 11: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

11

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - MTTF and MTTR

• Application in our case (contd.)– derivative of Q(t) , written as f(t), is pdf of failure - or failure

density function– Expected value can be computed using integration and is

Mean Time To Failure (MTTF)– constant failure rate

» MTTF = 1/λ• Mean time to repair - MTTR

– assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ

Page 12: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

12

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time between failure (MTBF)

• Mean time between failure - MTBF– use heuristic arguments to conclude

» MTBF = (total time T)/(average number of failures)

– can also argue MTBF = MTTF + MTTR

• Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably

Page 13: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

13

Reliability Modeling

• Application of the previous analysis to system models– Assumptions

• system consists of modules

• each module assigned a probability of working R(t), a function of time

• once a module fails it is assumed to yield incorrect results

• module failures are independent

Page 14: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

14

Reliability Modeling

• Application of the previous analysis to system models– Reliability block diagrams

• consider a system - microP, controller, mem, bus, …

• the system will fail if any of the components fails

• Rsys = P(all subsystems work correctly)

= P(bus correct).P(mem correct)…. Etc.

(follows from the assumption that component

failures are independent)

• Rsys = Rbus.Rmem.Rmicro.Rcont

Page 15: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

15

Reliability Modeling– Reliability block diagrams - Series Systems

• Assume system has n components

• All components should survive for system to operate

• Reliability of system– R sys = i Ri (t)

• For exponential distributions of each component– R sys = i e - i t = e - (1 + )t =exp(-it)

– Effect is that the system failure rate is the summation of failure rates of components

• Note these are nonredundant systemsR1 R2 Rn

Page 16: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

16

Reliability Modeling– Reliability block diagrams - Parallel Systems

• Assume system with spares

• faulty component is replaced by a spare as fault occurs

• only one component needs to survive for the system to operate

• Model is to represent all components connected in parallel

• P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails)

• Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn)

Page 17: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

17

Page 18: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

18

Reliability Modeling– Reliability block diagrams - Series-Parallel Systems

• straight forward

– Reliability block diagrams - MTTF of system

• 1/(system failure rate)• Series systems - 1/(sum of individual failure rates)

• Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations

Page 19: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

19

Reliability Modeling– Reliability block diagrams -Non series parallel systems

• Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:

A = (AB)(AB)

P(A) = P[(AB)(AB)]

= P[(AB)] + P[(AB)]

= P(A/B)P(B) + P(A/B)P(B)

• In general the set S can be partitioned into (B1, B2, … ,Bn)

P(A) = Σ P(A/Bi)P(Bi)

This can be viewed graphically also (draw a tree)

Page 20: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

20

Reliability Modeling• Reliability block diagrams -Non series parallel systems

– Example - consider the following non series parallel system

– list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5

– These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability

– Exact computation is possible using Bayes rule – complete in class

C5

C4

C3

C2

C1

Page 21: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

21

Reliability Modeling– Combinatorial model

• Consider an NMR system

• Assume voter reliability to be 1

• Divide all events for success to disjointed events

• Compute probability of each event and add them

• Example – TMR system

• Can be used to compute MTTF

• Can also analyze other systems such as an m-of-n system

Page 22: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

22

Page 23: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

23

Page 24: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

24

Page 25: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

25

Reliability Modeling– Markov model

• Difficulty with the previous models– incorporating repairs in the model and analysis

– Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured

• Markov modeling - basic– Define the concept of state using TMR system example (8 states)

– Transitions between states occur with certain probabilities

• Markov model – assumption– Probability of transition from a state si to sj is independent of the

method of arrival into state si

• Example – develop a Markov model for a TMR in class

Page 26: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

26

Reliability Modeling– Markov model

• Markov model for a TMR – all details not shown

111

110

101

011

100

010

001

000

λΔt

λΔt

λΔt

1-3λΔt

Page 27: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

27

Page 28: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

28

Reliability Modeling– Markov model- Reduced

• Reduced Markov model for a TMR system

• Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities

– Markov model- accounting for repairs• We can include links between states knowing the repair rates

of components

Page 29: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

29

Reliability Modeling– Markov model- analyzing systems

• Consider a duplicate compare system – no repairs• Develop Markov model with 3 states

• Develop a difference equation for computing probabilities for being in different states of the system

• Develop a differential equation model

• Solution methods– Numerical approach

– Solving differential equation

» direct approach

» Using Laplace transforms

Page 30: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

30

Reliability Modeling– Markov model- analyzing systems

• Consider a duplicate compare system – with repairs• Develop Markov model with 3 states

• Develop a differential equation model

• Solve using Laplace transforms

– Yet one more example• duplicate compare system – with imperfect coverage

• Develop Markov model with 5 states

• Reduce model for different scenarios

Page 31: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

31

Summary

• Introduction of mathematical models• Solving models to carry out analysis

– Example systems• Duplicate

• Duplicate with repair

• Simplex with repair for avialability