FAULT-TOLERANT COMPUTING

31
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Reliability Modeling and Analysis Lecture Set 3

description

FAULT-TOLERANT COMPUTING. Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Reliability Modeling and Analysis Lecture Set 3. Overview. Introduction Reliability Modeling reliability block diagram combinatorial model Markov model - PowerPoint PPT Presentation

Transcript of FAULT-TOLERANT COMPUTING

Page 1: FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Jenn-Wei LinDepartment of Computer Science and Information Engineering

Fu Jen Catholic University

Reliability Modeling and Analysis Lecture Set 3

Page 2: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

2

Overview

• Introduction

• Reliability Modeling– reliability block diagram

– combinatorial model

– Markov model

• Other Parameters and analysis

• General remarks and Summary

Page 3: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

3

Introduction

• References• [prad:96], [swew:99], [shooman:02]• [triv:82] Books in the first line (three books) contain sufficient material

covering this part of the course

• Recap of definitions

• Importance of analysis and analytical model

• Mathematical formulation for quantitative analysis

Page 4: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

4

Introduction (contd.)

• Recap of definitions– Reliability R(t)

– Availability A(t)

– Performability and Dependability

• Importance of analysis and analytical model– to evaluate a design

– a metric to compare different designs

– to provide feedback to the designer during early design stages

– use a model for performance analysis

– used for quantitative and qualitative analysis

Page 5: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

5

Introduction (contd.)

• Mathematical formulation for quantitative analysis– consider a large experiment with N components– operate correctly at time t0

– observation at time t• N0(t) - number of correctly operating systems• Nf(t) - number of failed systems

– Hence• Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N

– Probability that a component has survived the interval [t0, t]• Unreliability Q(t) = 1 - R(t)• Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt) • dNf(t)/dt is called instantaneous failure rate of the component

Page 6: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

6

Introduction (contd.)

• Mathematical formulation (contd.)– Also

• failure rate at time t– (instantaneous failure rate at time t) / N0(t)– (1/No(t))(dNf(t)/dt) - called z(t)– this and the previous expressions together reduce to

» z(t) = -(1/R(t))(dR(t)/dt)» z(t) is called failure rate function, hazard function or hazard

rate– We can solve the above for R(t) provided we know

instantaneous failure rate– Bath tub curve for failure rate function

» implies constant failure rate during useful life» infant mortality and wear out periods have variable failure

rates

Page 7: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

7

Page 8: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

8

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - constant failure rate

• dR(t)/dt =-z(t)R(t)• solve the equations - exponential function for reliability and

for unreliability, R(t) = 1- Q(t) = exp(-λt)

– Reliability computation - time varying failure rate

• Waibull distribution z(t) = αλ(λt)**(α-1)• solve the equations - exponential function for reliability and

for unreliability

– Failure rate computation - military standard• function of - learning factor, quality factor, temperature factor,

environmental factor, and # of pins on IC

Page 9: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

9

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time to failure (MTTF)

• Definition: expected time that a system will operate before the first failure occurs

• Probability measure: S-sample space, E-event space– for A in E P(A) >= 0

– P(S) = 1

– P(AB) = P(A) + P(B), when A and B are non-intersecting

• Random Variable (RV) - X maps events of S to real-numbers

• Probability distribution function of a RV

• Probability density function (pdf) - derivative of the distribution function

Page 10: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

10

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time to failure

• Probability density function - properties– always >= 0

– integrates to 1 (between limits)

• Expectation– Integrate xf(x)

– Σ xi p(xi) in discrete case

• Application in our case– unreliability Q(t) is a probability distribution function of failure -

in fact it is cumulative probability that system fails in time [0,t]

Page 11: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

11

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - MTTF and MTTR

• Application in our case (contd.)– derivative of Q(t) , written as f(t), is pdf of failure - or failure

density function– Expected value can be computed using integration and is

Mean Time To Failure (MTTF)– constant failure rate

» MTTF = 1/λ• Mean time to repair - MTTR

– assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ

Page 12: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

12

Introduction (contd.)

• Mathematical formulation (contd.)– Reliability computation - mean time between failure (MTBF)

• Mean time between failure - MTBF– use heuristic arguments to conclude

» MTBF = (total time T)/(average number of failures)

– can also argue MTBF = MTTF + MTTR

• Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably

Page 13: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

13

Reliability Modeling

• Application of the previous analysis to system models– Assumptions

• system consists of modules

• each module assigned a probability of working R(t), a function of time

• once a module fails it is assumed to yield incorrect results

• module failures are independent

Page 14: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

14

Reliability Modeling

• Application of the previous analysis to system models– Reliability block diagrams

• consider a system - microP, controller, mem, bus, …

• the system will fail if any of the components fails

• Rsys = P(all subsystems work correctly)

= P(bus correct).P(mem correct)…. Etc.

(follows from the assumption that component

failures are independent)

• Rsys = Rbus.Rmem.Rmicro.Rcont

Page 15: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

15

Reliability Modeling– Reliability block diagrams - Series Systems

• Assume system has n components

• All components should survive for system to operate

• Reliability of system– R sys = i Ri (t)

• For exponential distributions of each component– R sys = i e - i t = e - (1 + )t =exp(-it)

– Effect is that the system failure rate is the summation of failure rates of components

• Note these are nonredundant systemsR1 R2 Rn

Page 16: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

16

Reliability Modeling– Reliability block diagrams - Parallel Systems

• Assume system with spares

• faulty component is replaced by a spare as fault occurs

• only one component needs to survive for the system to operate

• Model is to represent all components connected in parallel

• P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails)

• Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn)

Page 17: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

17

Page 18: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

18

Reliability Modeling– Reliability block diagrams - Series-Parallel Systems

• straight forward

– Reliability block diagrams - MTTF of system

• 1/(system failure rate)• Series systems - 1/(sum of individual failure rates)

• Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations

Page 19: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

19

Reliability Modeling– Reliability block diagrams -Non series parallel systems

• Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:

A = (AB)(AB)

P(A) = P[(AB)(AB)]

= P[(AB)] + P[(AB)]

= P(A/B)P(B) + P(A/B)P(B)

• In general the set S can be partitioned into (B1, B2, … ,Bn)

P(A) = Σ P(A/Bi)P(Bi)

This can be viewed graphically also (draw a tree)

Page 20: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

20

Reliability Modeling• Reliability block diagrams -Non series parallel systems

– Example - consider the following non series parallel system

– list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5

– These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability

– Exact computation is possible using Bayes rule – complete in class

C5

C4

C3

C2

C1

Page 21: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

21

Reliability Modeling– Combinatorial model

• Consider an NMR system

• Assume voter reliability to be 1

• Divide all events for success to disjointed events

• Compute probability of each event and add them

• Example – TMR system

• Can be used to compute MTTF

• Can also analyze other systems such as an m-of-n system

Page 22: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

22

Page 23: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

23

Page 24: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

24

Page 25: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

25

Reliability Modeling– Markov model

• Difficulty with the previous models– incorporating repairs in the model and analysis

– Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured

• Markov modeling - basic– Define the concept of state using TMR system example (8 states)

– Transitions between states occur with certain probabilities

• Markov model – assumption– Probability of transition from a state si to sj is independent of the

method of arrival into state si

• Example – develop a Markov model for a TMR in class

Page 26: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

26

Reliability Modeling– Markov model

• Markov model for a TMR – all details not shown

111

110

101

011

100

010

001

000

λΔt

λΔt

λΔt

1-3λΔt

Page 27: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

27

Page 28: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

28

Reliability Modeling– Markov model- Reduced

• Reduced Markov model for a TMR system

• Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities

– Markov model- accounting for repairs• We can include links between states knowing the repair rates

of components

Page 29: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

29

Reliability Modeling– Markov model- analyzing systems

• Consider a duplicate compare system – no repairs• Develop Markov model with 3 states

• Develop a difference equation for computing probabilities for being in different states of the system

• Develop a differential equation model

• Solution methods– Numerical approach

– Solving differential equation

» direct approach

» Using Laplace transforms

Page 30: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

30

Reliability Modeling– Markov model- analyzing systems

• Consider a duplicate compare system – with repairs• Develop Markov model with 3 states

• Develop a differential equation model

• Solve using Laplace transforms

– Yet one more example• duplicate compare system – with imperfect coverage

• Develop Markov model with 5 states

• Reduce model for different scenarios

Page 31: FAULT-TOLERANT COMPUTING

ECE 753 Fault Tolerant Computing

31

Summary

• Introduction of mathematical models• Solving models to carry out analysis

– Example systems• Duplicate

• Duplicate with repair

• Simplex with repair for avialability