Reliable Architectures - MIT OpenCourseWare · Joel Emer December 7, 2005 6.823, L24-1 Reliable...

40
Joel Emer December 7, 2005 6.823, L24-1 Reliable Architectures Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Transcript of Reliable Architectures - MIT OpenCourseWare · Joel Emer December 7, 2005 6.823, L24-1 Reliable...

Joel Emer December 7, 2005

6.823, L24-1

Reliable Architectures

Joel Emer Computer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

0

6.823, L24-2

Strike Changes State of a Single Bit

1

Joel Emer December 7, 2005

6.823, L24-3

Impact of Neutron Strike on a Si Device

drain to alter the state of the device

+ - ++ +-- -

Transistor Device

source drain

Joel Emer December 7, 2005

Secondary source of upsets: alpha particles from packaging

Strikes release electron & hole pairs that can be absorbed by source &

neutron strike

n

Joel Emer December 7, 2005

6.823, L24-4

Cosmic Rays Come From Deep Space

p n p

p

n

n

p

p n

n

Earth’s Surface

• Neutron flux is higher in higher altitudes 3x - 5x increase in Denver at 5,000 feet

100x increase in airplanes at 30,000+ feet

Joel Emer December 7, 2005

6.823, L24-5

Physical Solutions are hard

• Shielding? – No practical absorbent (e.g., approximately > 10 ft of concrete) – unlike Alpha particles

• Technology solution: SOI? – Partially-depleted SOI of some help, effect on logic unclear – Fully-depleted SOI may help, but is challenging to manufacture

• Circuit level solution? – Radiation hardened circuits can provide 10x improvement with

significant penalty in performance, area, cost – 2-4x improvement may be possible with less penalty

6.823, L24-6Triple Modular Redundancy (Von Neumann, 1956)

V does a majority vote on the results

M

M

M

V Result

Joel Emer December 7, 2005

6.823, L24-7Dual Modular Redundancy (e.g., Binac, Stratus)

• •

restore state to other

M

M

C Mismatch?

Error?

Error?

Joel Emer December 7, 2005

Processing stops on mismatch Error signal used to decide which processor be used to

6.823, L24-8Pair and Spare Lockstep

• •

M

M

C Mismatch?

Primary

M

M

C Mismatch?

Backup

Joel Emer December 7, 2005

(e.g., Tandem, 1975)

Primary creates periodic checkpoints Backup restarts from checkpoint on mismatch

6.823, L24-9Redundant Multithreading (e.g., Reinhardt, Mukherjee, 2000)

X W X X W X X W

X W X X W X X W

C Fault?

Leading Thread

Trailing Thread

C Fault? C Fault?

Joel Emer December 7, 2005

Writes are checked

6.823, L24-10Component Protection

Error?

ECC

1 1 0

Parity

Parity

1 1 0

ECC

0

1 1

… …

Joel Emer December 7, 2005

• Fujitsu SPARC in 130 nm technology (ISSCC 2003) – 80% of 200k latches protected with parity – versus very few latches protected in commodity microprocessors

yes

6.823, L24-11Strike on a bit (e.g., in register file)

Bit

Bit has error protection?

yes no

detection & correction

no no error

benign fault no error

detection only

outcome?

True DUE False DUE

noyes no

outcome?

benign fault no error SDC

yes no

Joel Emer December 7, 2005

Read?

affects program affects program

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error

Joel Emer December 7, 2005

Metrics 6.823, L24-12

• Interval-based – MTTF = Mean Time to Failure – MTTR = Mean Time to Repair – MTBF = Mean Time Between Failures = MTTF + MTTR – Availability = MTTF / MTBF

• Rate-based – FIT = Failure in Time = 1 failure in a billion hours

– 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT

– SER FIT = SDC FIT + DUE FIT

Hypothetical Example Cache: 0 FIT

Image removed due to + IQ: 100K FITcopyright restrictions. + FU: 58K FIT

Total of 158K FIT

Joel Emer December 7, 2005

6.823, L24-13

Cosmic Ray Strikes: Evidence & Reaction

• Publicly disclosed incidence

– Error logs in large servers, E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996.

– Sun Microsystems found cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to crash, R. Baumann, IRPS Tutorial on SER, 2000.

– Cypress Semiconductor reported in 2004 a single soft error brought a billion-dollar automotive factory to a halt once a month, Zielger & Puchner, “SER – History, Trends, and Challenges,” Cypress, 2004.

6.823, L24-14

# Vulnerable Bits Growing with Moore’s Law

1

10

100

10000

l

l

12x GAP

Typical DUE goal: 10-25 year MTBF

Joel Emer December 7, 2005

1000

200 3

200 4

200 5

200 6

200 7

200 8

200 9

201 0

201 1

201 2

Year

100% Vu nerable

20% Vu nerable

1000 year MTBF Goal

Typical SDC goal: 1000 year MTBF

Joel Emer December 7, 2005

6.823, L24-15

Architectural Vulnerability Factor (AVF)

AVFbit = Probability Bit Matters

# of Visible Errors =# of Bit Flips from Particle Strikes

FITbit= intrinsic FITbit * AVFbit

Joel Emer December 7, 2005

6.823, L24-16

Architectural Vulnerability FactorDoes a bit matter?

• Branch Predictor – Doesn’t matter at all (AVF = 0%)

• Program Counter – Almost always matters (AVF ~ 100%)

Joel Emer December 7, 2005

6.823, L24-17Statistical Fault Injection (SFI) with RTL

LogicLogic

1

0

Simulate Strike on Latch

0

output

Does Fault Propagate to Architectural State

+ Naturally characterizes all logical structures

Joel Emer December 7, 2005

6.823, L24-18

Architecturally Correct Execution (ACE)

Program Input

Program Outputs

• ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine)

• Anything else (un-ACE path) can be derated away

Joel Emer December 7, 2005

6.823, L24-19

Example of un-ACE instruction: Dynamically Dead Instruction

Dynamically Dead Instruction

Most bits of an un-ACE instruction do not affect program output

Joel Emer December 7, 2005

6.823, L24-20

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T = 1 ACE% = 2/4

Joel Emer December 7, 2005

6.823, L24-21

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T = 2 ACE% = 1/4

Joel Emer December 7, 2005

6.823, L24-22

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T = 3 ACE% = 0/4

Joel Emer December 7, 2005

6.823, L24-23

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

T = 4 ACE% = 3/4

Joel Emer December 7, 2005

6.823, L24-24

Vulnerability of a structure

AVF = fraction of cycles a bit contains ACE state

( 2 + 1 + 0 + 3 ) / 4( 2 + 1 + 0 + 3 ) / 4= 44

Average number of ACE bits in a cycleAverage number of ACE bits in a cycle= Total number of bits in the structureTotal number of bits in the structure

Joel Emer December 7, 2005

6.823, L24-25

Little’s Law for ACEs

Nace = Tace × Lace

NaceAVF = Ntotal

Joel Emer December 7, 2005

Computing AVF 6.823, L24-26

• Approach is conservative – Assume every bit is ACE unless proven otherwise

• Data Analysis using a Performance Model – Prove that data held in a structure is un-ACE

• Timing Analysis using a Performance Model – Tracks the time this data spent in the structure

Joel Emer December 7, 2005

6.823, L24-27

Dynamic Instruction Breakdown

20%

1%

26%

ACE 46%

7%

Average across Spec2K slices

DYNAMICALLY DEAD

PERFORMANCE INST

NOP

PREDICATED FALSE

ACEInst

6.823, L24-28

Mapping ACE & un-ACE Instructions to the Instruction Queue

Architectural un-ACE Micro-architectural un-ACE

Wrong-Path Inst

IdleNOP Prefetch ACE Inst

Ex-ACE Inst

Joel Emer December 7, 2005

6.823, L24-29ACE Lifetime Analysis (1) (e.g., write-through data cache)

Idle IdleValidValidValid

Fill Read Read Evict

Joel Emer December 7, 2005

• Idle is unACE

• Assuming all time intervals are equal • For 3/5 of the lifetime the bit is valid • Gives a measure of the structure’s utilization

– Number of useful bits – Amount of time useful bits are resident in structure – Valid for a particular trace

6.823, L24-30

Idle Idle

Fill Read Read Evict

Write-through Data Cache

ACE Lifetime Analysis (2) (e.g., write-through data cache)

Joel Emer December 7, 2005

• Valid is not necessarily ACE

• ACE % = AVF = 2/5 = 40% • Example Lifetime Components

– ACE: fill-to-read, read-to-read – unACE: idle, read-to-evict, write-to-evict

6.823, L24-31

Idle Idle

Fill Read Read Evict

Write-through Data Cache

ACE Lifetime Analysis (3) (e.g., write-through data cache)

Joel Emer December 7, 2005

• Data ACEness is a function of instruction ACEness

• Second Read is by an unACE instruction

• AVF = 1/5 = 20%

Joel Emer

6.823, L24-32Instruction Queue December 7, 2005

15%

ACE 29%

31%

10%

3%

8%

3%

1%

NOP

IDLE

Ex-ACE

WRONG PATH

DYNAMICALLY DEAD

PREDICATED FALSE

PERFORMANCE INST

ACE percentage = AVF = 29%

yes

6.823, L24-33

Strike on a bit (e.g., in register file)

Bit

Bit has error protection?

yes no

detection & correction

no no error

benign fault no error

detection only

outcome?

True DUE False DUE

noyes no

outcome?

benign fault no error SDC

yes no

Joel Emer December 7, 2005

Read?

affects program affects program

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error

Joel EmerDecember 7, 2005

6.823, L24-34

29%

6%

l 16%

11%

Idl i38%

DUE AVF of Instruction Queue with Parity

False DUE AVF 33%

Asim

True DUE AVF

Uncommitted

NeutraDynamically Dead

e & Msc

CPU2000

Simpoint Itanium®2-like

Joel Emer December 7, 2005

6.823, L24-35Sources of False DUE in an Instruction Queue

• Instructions with uncommitted results– e.g., wrong-path, predicated-false – solution: π (possibly incorrect) bit till commit

• Instruction types neutral to errors – e.g., no-ops, prefetches, branch predict hints – solution: anti- π bit

• Dynamically dead instructions – instructions whose results will not be used in future – solution: π bit beyond commit

6.823, L24-36

Coping with Wrong-Path Instructions (assume parity-protected instruction queue)

DECLARE ERROR

ON ISSUE

IQFetch Execute Commit

Instruction Cache (IC)

Data Cache

RR instX

Joel Emer December 7, 2005

• Problem: not enough information at issue

Decode

Joel Emer December 7, 2005

6.823, L24-37The π (Possibly Incorrect) Bit (assume parity-protected instruction queue)

At commit point, declare error only if not wrong-path instruction and ππ bit is setbit is set

IQFetch Decode Execute Commit

Instruction Cache (IC)

Data Cache

RR inst inst

POST ERROR IN ππ BIT ON

ISSUE

inst (ππ)) inst (ππ)) inst (ππ)) inst (ππ))

Joel Emer December 7, 2005

6.823, L24-38

Anti-π bit: coping with No-ops (assume parity-protected instruction queue)

On issue, if the anti-ππ bit is set, then do not set thebit is set, then do not set the ππ bitbit

IQFetch Decode Execute Commit

Instruction Cache (IC)

Data Cache

RR inst inst

(anti-ππ))inst

(anti-ππ))inst inst inst

anti-ππ bitbit neutralizesneutralizes

thethe ππ bitbit

6.823, L24-39π bit: avoiding False DUE on Dynamically Dead Instructions

IQFetch Execute Commit

Instruction Cache (IC)

Data Cache

RR

write R1 write R1 write R1(π) write R1(π) write R1(π) write R1(π) read R1 read R1 read R1 read R1 (π)

π bit is set

• π bit can be used in caches & main memory …

Inst i: Inst i+n:

Joel Emer December 7, 2005

Decode

• Declare the error on reading R1, if • If R1 isn’t read (i.e., dynamically dead), then no False DUE

Joel Emer December 7, 2005

% False DUE AVF Eliminated 6.823, L24-40

(PI = π) PI bit till I/O

commit PI bit till register12% commit

18%PI bit till store

commit8%

PI bit till registerread14%

CPU2000 Asim anti-PI bit

Simpoint 48%

Itanium®2-like

Practical to eliminate most of the False DUE AVF