ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering...

60
ECE 753: FAULT- TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1

Transcript of ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering...

Page 1: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753: FAULT-TOLERANT COMPUTING

Kewal K.SalujaDepartment of Electrical and Computer Engineering

Motivation and IntroductionLecture Set 1

Page 2: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 2

Overview• Motivation

• About the Course and the Instructor

– Conduct, Outline, Coursepack

• Introduction

• Terminology and definitions– Sources, Overview and Comments

– System defined

• Dependability/Security and their attributes

• Threat to dependability and modeling FEF chain

• Means to attain dependability

• Fundamental Principles

Page 3: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 3

Motivation

• Informal Definition

• Key Attributes

• Who, What and Why Study

• Examples

Page 4: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 4

Motivation

• What is Fault-Tolerance?

A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system.

Page 5: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 5

Motivation (contd.)

• Key attributes

Fault - Error - Failure

Performance - Availability - Reliability More recently concept of “survivability”

Inclusions of these constraints at design stage is likely to be more cost effective.

Page 6: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 6

Motivation (contd.)• Who is concerned about fault-tolerance?

– System Users – irrespective of the application but some are a lot more concerned than others

• Who is concerned at design stages?– Universities

• R, d, and a (Research, development, applications)– Industry

• r, D, and A (research, Development, Applications)• Issues

– Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation

Page 7: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 7

Motivation (contd.)

Examples

• General Purpose Systems– PCs: RAMs with parity checks and possibly ECC

(consideration of re-execution on failure detection is being investigated)

– Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW)

Page 8: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 8

Motivation (contd.)

Examples

• Reliable Systems– Telephone systems– Banking systems e.g. ATM– Stock market– CAE - exams/projects– Football games - display/ticketing

Page 9: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 9

Motivation (contd.)

Examples

• Critical and Life Critical Systems– Manned and unmanned space borne systems– Aircraft control systems– Nuclear reactor control systems– Life support systems

Page 10: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 10

Motivation (contd.)

Examples

• Reliable -> Critical Systems– 911 telephone switching system– Traffic light control system– Automotive control systems (ABS, Fuel

injection system)

Page 11: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 11

About the Course and the Instructor

• Conduct– homeworks, exam, project, grading

• Outline

• Coursepack– references and reading list

Page 12: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 12

Introduction

– Historical perspective and major push

– New initiatives

– Goals of fault-tolerance

– Applications of fault-tolerance

Page 13: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 13

Introduction (contd.)• Historical Perspective

– not a new concept

– first use by J. van Neumann 1956• probabilistic logic and synthesis of reliable organism from

unreliable components, Annals of mathematical studies, Princeton University Press

• Major push– Space program

– HW Fault tolerance - then

– SW Fault tolerance later

– Merge the two

Page 14: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 14

Introduction (contd.)

• New initiativesDensity of devices more failures likely

Power issue – schedular, on-chip sensorsFailures due to soft-errors, life time degradations

- hardening, re-exection, - on-chip ECC- erconfiguration- microarchitectural solutions- architectural solutions

Page 15: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 15

Introduction (contd.)

• New initiatives (contd.)Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on

chip/board/system possibility of system hang-up

Speculative execution results may need to be re-checked

Low cost of HW and SW affordable/ecnomical

• Hot issues: Soft errors, Life-time failures, Power and Thermal Management

Page 16: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 16

Introduction (contd.)

• Goals - different goals for different applications

The key word is “reliability” – has different meaning for different users and applications

• Intuitive explanations– Dependability– Service– Specification

Page 17: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 17

Introduction (contd.)• Intuitive concepts

– Reliability – continues to work– Availability – works when I need it– Safety – does not put me in jeopardy – Performability – Maintainability– Testability– Survivability – will the system survive

catastrophic events?– Security

Page 18: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 18

Introduction (contd.)

• Applications– Space borne system

• long life system

– Airplane control system• critical system

– Transaction processing system• high availability system

– Switching system• high availability over certain level of performance

Page 19: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 19

Terminology and definitions

• Reliability and concept of probability– R(t): conditional probability that a system provides

continuous proper service in the interval [0,t] given that it provided desired service at time 0.

• Availability

• Performabiltiy – An Example

• Dependability

• Security

Page 20: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Sources, Overview and Comments (1/4)Key reference:

• Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004.

Other references:• Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007.

• D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.

• B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989.

• My course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.html

ECE 753 Fault Tolerant Computing

Page 21: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Sources, Overview and Comments (2/4)

• What does the paper cover?– Very basic definitions of the terminologies used in

dependable computing– It categorizes definitions in three groups

• System, attributes of dependability, threats to dependability

– Covers very briefly methods to attain dependability

ECE 753 Fault Tolerant Computing

Page 22: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Sources, Overview and Comments (3/4)

• How to read the paper?– It is easy to read – scan it first and then read it– I have organized the material differently – you may

find it helpful

• What is not covered?– One attribute almost missing - survivability– Basic methods of Fault Tolerance and their

characterization

ECE 753 Fault Tolerant Computing

Page 23: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Sources, Overview and Comments (4/4)

• Chronology of Developments– Need for fault-tolerance - inception of the space program

(recall “Voyager” launched in 1977 is still sending signals)

– First standard glossary in 1985

– Integration of performance etc into fault tolerance – and hence the term “Dependability” – book published in 1992

– Recognition of “Security” as a basic attribute of dependability – this paper in 2004

ECE 753 Fault Tolerant Computing

Page 24: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

System Defined (1/4)• “. . . an entity that interacts with other entities”

– First entity (system) – limited to be “electronic (mostly digital)” or “computer based”

– Second entity• Hardware, software, human, other systems, .. (can also be called

“environment”)

• Characterization and fundamental properties– Functionality

– Performance

– Dependability and security

– Cost

(usuability, managability, adaptabilty : not directly included in the paper)

ECE 753 Fault Tolerant Computing

Page 25: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

System Defined (2/4)

• Function – “ what the system is intended to do” – functional specifications: describe it in terms of functionality

and performance

– behavior – described as a sequence of states to implement the functionality

– Total states – set of states as system evolves • Internal states

• External states – as viewed by the environment and users

• Structure – “What enables system behavior (function)” – Interconnected components – recursively defined to

“atomic” level

ECE 753 Fault Tolerant Computing

Page 26: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

System Defined (3/4)

• System Life Cycle– Development phase

– Use phase

• Service – what is delivered by the system to its “environment” (user)– Environment sees only the “external states”

– Development Phase – activities from concept to decision that system is ready for “use phase”

– Use Phase - More meaningful and includes service delivery, service outage, service shutdown, maintenance

ECE 753 Fault Tolerant Computing

Page 27: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

System Defined (4/4)• Development phase environment

– Physical world

– Human developers

– Development tools

– Production and test facilities

• User phase environment– Physical world

– Administrators – maintainers

– Users and intruders

– Providers and infrastructure

ECE 753 Fault Tolerant Computing

Page 28: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Dependability/Security Attributes (1/6)

• Original definition: “ability to deliver service that can justifiably be trusted”

• Encompassing the following attributes– Availability

– Reliability

– Safety

– Integrity

– maintainability

ECE 753 Fault Tolerant Computing

Page 29: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Dependability/Security Attributes (2/6)

• New definition: “ability to avoid service failures that are more frequent or more severe than is acceptable” - deliver service that can justifiably be trusted

• Reason for modification– Security related issues

– This recognizes that a system can fail and it usually does fail and it still can be called dependable

– This definition also enables a connection with “development failures”

ECE 753 Fault Tolerant Computing

Page 30: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Dependability/Security Attributes (3/6) Dependability

• availability: readiness for correct service.

• reliability: continuity of correct service.

• safety: absence of catastrophic consequences on the user(s) and the environment.

• integrity: absence of improper system alterations.

• maintainability: ability to undergo modification and repairs

When addressing security, an additional attribute confidentiality: the absence of unauthorized disclosure

ECE 753 Fault Tolerant Computing

Page 31: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Security is concurrent existence of composite of the attributes

1) availability (for authorized actions only),

2) confidentiality, and

3) integrity (with “improper” meaning “unauthorized”)

Dependability/Security Attributes (4/6)

ECE 753 Fault Tolerant Computing

Page 32: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

F

Dependability/Security Attributes (5/6)

ECE 753 Fault Tolerant Computing

Page 33: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

• Other related concepts – summarized in table (Fig 15) - these are– Dependability– High confidence– Survivability– Trustworthiness

• Example: all these have similar goals such as 1): ability to deliver service, 2): predictable service, 3): fulfill mission, 4): assurance of expected service delivery

Dependability/Security Attributes (6/6)

ECE 753 Fault Tolerant Computing

Page 34: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (1/12)

• Different phases are open to different types of threats – generally termed as “faults”

• Faults lead to “errors” – a total state of the system different from the “true total state”

• Errors can lead to “failure” – the service deviates from the desired service

• This creates a FEF chain – a hierarchical phenomenon (see next and more later)

ECE 753 Fault Tolerant Computing

Page 35: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

fault error

failure

Fault activation – Error manifestation – Failure

Threats and modeling threats (2/12)

Fault – active or dormant

Error – masked or latent

Failure – incorrect response

Page 36: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (3/12)

FEF Chain in an hierarchy

Page 37: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (4/12)

Fault classes

• Groups (not exclusive)– Development, Physical – (that affect hardware - I

disagree with this definition), Interaction

• Viewpoints: – phase, system boundary, cause, dimension,

objective, intent, capability, persistence

Page 38: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (5/12)

Fault Taxonomy and Examples

Production defect: physical, hardware, natural

Bug: physical, software, natural

Omission (absence of an action): Humam made, system generated

Melicious (meant to cause harm): Human made, Hardware or software

Notes:

1. Paper has a classification – Fig 4 and 5

2. Examples and definition of many other faults given. Some listed on next slide

Page 39: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (6/12)

Fault Taxonomy (contd.)

Permanent faults

Intermittent faults – repeat at some interval

Transient faults – no specific interval

Malicious logic faults – caused be natural faults

Intrusion attempts – caused by humans

Interaction faults – may be development phase or use phase

Configuration faults – incorrect setting of parameters

Page 40: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (7/12)

Errors classes

• Detected

• Latent

An example– An adder gives incorrect sum for certain operands

– Fault is active when those operands appear, otherwise it is dormant

– Incorrect sum is latent unless used or checked for correctness

Page 41: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (8/12)

Failure classes

• Development failures

• Service failures

• Security failures

Page 42: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (9/12)

• Development failures – introduced during the development phase– Human developers– Tools – Production facility– Budgetary reasons– Scheduling issue (time to market)

(basically the system delivered is a downgraded system)

Page 43: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (10/12)

• Service failures - delivery of incorrect service – Four viewpoints

1. Failure domain– Content failure– Timing failure – early or late delivery of

the service(s)• Special case: silent failure, halt failure, crash

failure

• Erratic failure (like Byzantine failure)

Page 44: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (11/12)

2. Failure detectability– Signal provided by some checking mechanism

• Signaled failure

• Unsignaled failure

• False alarm

3. Consistency – Consistent failure – all services see the same

data– Inconsistent – different services see different

data (like Byzantine failure)

Page 45: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Threats and modeling threats (12/12)

4. Consequence of failure– Need to rate the failure and hence develop

criteria – examples:• Outage of duration (availability related)

• Lives being endangered (safely related)

• Extent of corrupted service (integrity related)

• Amount of information disclosed (confidentiality related)

Page 46: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (1/6)

• Fault Prevention or Fault Avoidance• Improvement of development process

• Elimination of causes that can induce faults

• Fault Tolerance• Techniques and implementations

(more later)

Page 47: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (2/6)

• Fault Removal • Remove faults during development phase

– extensive simulation and validation

• Testing• Deterministic testing

• Random and statistical testing

• Back to back testing

Test/validation quality: fault injection, design for test/verification

Page 48: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (3/6)

• Fault Forecasting – evaluate the system behavior and then use one or more methods previously discussed to improve dependability• Qualitative evaluation• Quantitative evaluation• Use benchmarks• Use of simulators

Examples: 1) Error and failure logs

2) when and where commissioned

Page 49: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (4/6)

• Fault Tolerance Techniques• Error detection - need redundancy

• Duplicate execution

• Use of parity

• Checker programs and/or hardware

• More later

Page 50: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (5/6)

• Recovery - Key is redundancy

• Error handling• Masking and compensation

• Rollback

• Rollforward

• Fault handling• Diagnosis

• Isolation

• Reconfiguration

• Initialization

Page 51: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

Means to attain dependability (6/6)

• Key to fault tolerance• Break FEF chain

• Use “redundancy” to improve “use phase” dependability and security

• See next “fundamental principles”

Page 52: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 52

Fundamental Principles

• Hardware redundancy• Low level

• High level

• Software Redundancy• Time Redundancy• Information Redundancy

Page 53: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 53

Fundamental Principles (contd.)

• Hardware Redundancy - Low level– logic level

• Example 1 - Self checking circuits

• Example 2 - Arithmetic code A modular adder using the mathematical principle

(A+B) mod k = ((A mod k) + (B mod k)) mod k

• Hardware Redundancy - High level– Triplicate or 5-copies as in space shuttle

Page 54: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 54

Fundamental Principles (contd.)

• Software Redundancy – Use two different programs/algorithms

• Time Redundancy– Re-compute or redo the task and compare the results

– May or may not use the same hardware/software

• Information Redundancy– backup information

– Use of ECC

• Question - What kind of FT is achieved?

Page 55: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 55

Fault-Error-Failure

• Intuitive definitions• Origins of faults• Methods to break FEF chain• Attribute of faults

Page 56: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 56

Fault-Error-Failure concept (contd.)

Intuitive definitions

• Fault -– An anomalous physical condition caused by a

manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …

– Causes

• Error - Effect of activation of a fault

• Failure - over-all system effect of an error

Fault -> Error -> Failure

Page 57: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 57

Fault-Error-Failure concept (contd.)

Origins of faults

• Physical device level (HW)

• Logic level (HW)

• Chip level (HW)

• System level (HW/SW)– interfacing, specifications, …

• Why systems fail

Page 58: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 58

Fault-Error-Failure concept (contd.)

Methods to break FEF chain

• Flow FEF

• Barriers– Fault avoidance– Fault masking– Fault removal– Fault forecasting

Page 59: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.
Page 60: ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.

ECE 753 Fault Tolerant Computing 60

Fault-Error-Failure concept (contd.)

Attribute of faults

• Cause

• Nature

• Duration

• Extent

• Value