SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep...

SWAT: Designing Reisilent Hardware by

Treating Software Anomalies

Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo,

Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu,

Sarita Adve, Vikram Adve, Yuanyuan Zhou

Department of Computer Science

University of Illinois at Urbana-Champaign

[email protected]

2

Motivation

• Hardware failures will happen in the field

– Aging, soft errors, inadequate burn-in, design defects, …

Need in-field detection, diagnosis, recovery, repair

• Reliability problem pervasive across many markets

– Traditional redundancy (e.g., nMR) too expensive

– Piecemeal solutions for specific fault model too expensive

– Must incur low area, performance, power overhead

Today: low-cost solution for multiple failure sources

3

Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

Watch for software anomalies (symptoms)

Hardware fault detection ~ Software bug detection

Zero to low overhead “always-on” monitors

Diagnose cause after symptom detected

May incur high overhead, but rarely invoked

SWAT: SoftWare Anomaly Treatment

4

SWAT Framework Components

• Detection: Symptoms of S/W misbehavior, minimal backup H/W

• Recovery: Hardware/Software checkpoint and rollback

• Diagnosis: Rollback/replay on multicore

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Flexible control through firmware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

5

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis


Recovery

Repair


6

Hardware-Only Symptom-based detection

• Observe anomalous symptoms for fault detection

– Incur low overheads for “always-on” detectors

– Minimal support from hardware

• Fatal traps generated by hardware

– Division by Zero, RED State, etc.

• Hangs detected using simple hardware hang detector

• High OS activity detected with performance counter

– Typical OS invocations take 10s or 100s of instructions

7

Experimental Methodology

• Microarchitecture-level fault injection

– GEMS timing models + Simics full-system simulation

– SPEC workloads on Solaris-9 OS

• Permanent fault models

– Stuck-at, bridging faults in latches of 8 arch structures

– 12,800 faults, <0.3% error @ 95% confidence

• Simulate impact of fault in detail for 10M instructions

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

App masked, or symptom > 10M, or silent data corruption (SDC)

8

Efficacy of Hardware-only Detectors

• Coverage: Percentage of unmasked faults detected

– 98% faults detected, 0.4% give SDC (w/o FPU)

Additional support required for FPU-like units

– 66% of detected faults corrupt OS state, need recovery

Despite low OS activity in fault-free execution

• Latency: Number of instr between activation and detection

– HW recovery for upto 100k instr, SW longer latencies

– App in 87% of detections recoverable using HW

– OS recoverable in virtually all detections using HW

OS recovery using SW hard

9

Improving SWAT Detection Coverage

Can we improve coverage, SDC rate further?

• SDC faults primarily corrupt data values

– Illegal control/address values caught by other symptoms

– Need detectors to capture “semantic” information

• Software-level invariants capture program semantics

– Use when higher coverage desired

– Sound program invariants expensive static analysis

– We use likely program invariants

10

Likely Program Invariants

• Likely program invariants

– Hold on all observed inputs, expected to hold on others

– But suffer from false positives

– Use SWAT diagnosis to detect false positives on-line

• iSWAT - Compiler-assisted symptom detectors

– Range-based value invariants [Sahoo et al. DSN ‘08]

– Check MIN value MAX on data values

– Disable invariant when diagnose false-positive

11

iSWAT implementation

Training PhaseApplication

Compiler Pass in LLVM

- - - - - Application

- - - - -

Ranges i/p #1

. . . .Ranges i/p #n

Invariant Ranges

Invariant Monitoring

Code

Test,train,

external inputs

12





- - - - -

Ranges i/p #1


Invariant Ranges


Code



- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

13

iSWAT Results

• Explored SWAT with 5 apps on previous methodology

• Undetected faults reduce by 30%

• Invariants reduce SDCs by 73% (33 to 9)

• Overheads: 5% on x86, 14% on UltraSparc IIIi

– Reasonably low overheads on some machines

– Un-optimized invariants used, can be further reduced

• Exploring more sophistication for coverage, overheads

14

Fault Diagnosis

• Symptom-based detection is cheap but

– High latency from fault activation to detection

– Difficult to diagnose root cause of fault

– How to diagnose SW bug vs. transient vs. permanent fault?

• For permanent fault within core

– Disable entire core? Wasteful!

– Disable/reconfigure µarch-level unit?

– How to diagnose faults to µarch unit granularity?

• Key ideas

– Single core fault model, multicore fault-free core available

– Checkpoint/replay for recovery replay on good core, compare

– Synthesizing DMR, but only for diagnosis

15

SW Bug vs. Transient vs. Permanent

• Rollback/replay on same/different core

• Watch if symptom reappears

No symptom Symptom

False positive (iSWAT) or Deterministic s/w orPermanent h/w bug

Symptom detected

Faulty Good

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or non-deterministic s/w bug

Symptom

Permanenth/w fault,

needs repair!

No symptom

False positive (iSWAT) orDeterministic s/w bug, send to s/w layer

16

Diagnosis Framework

Permanent fault

Microarchitecture-LevelDiagnosis

Unit X is faulty

Symptomdetected

Diagnosis

Softwarebug

Transientfault

17

Fault-Free CoreExecution

Faulty CoreExecution

Trace-Based Fault Diagnosis (TBFD)

Permanent fault detected

Invoke TBFD

DiagnosisAlgorithm

=?

18



Invoke TBFD

Rollback faulty-core to checkpoint

Replay execution, collect info

=?

DiagnosisAlgorithm

Fault-Free CoreExecution

19




Replay execution, collect info

=?

DiagnosisAlgorithm

Load checkpoint on fault-free core

Fault-free instruction exec

What info to collect?

What info to compare?What to do on divergence?

Invoke TBFD

20

Can a Divergent Instruction Lead to Diagnosis?

Simpler case: ALU fault

sub r6,r1,r2sub r6,r1,r2 2 1 72 x 9

FaultyFault-free HW usedresults

add r1,r3,r5add r1,r3,r5 0dec alu

1 12

dstpreg

5 x 3

Both divergent instructions used same ALU ALU1 faulty

21

r2 p20

p20 4

• Complex example: Fault in register alias table (RAT) entry

• Divergent instructions do not directly lead to faulty unit

• Instead, look backward/forward in instruction stream

– Need to collect and analyze instruction trace

Can a Divergent Instruction Lead to Diagnosis?

r2 p20

r1

log phyp4

r3 p13

r5 p24

RAT

IA: r3 r2 + r2

phy valp20 4p24 3

Reg File

p4 8r3 p55

error!

r3 p24

r5 p24

p24 3p24 8

IB: r1 r5 * r2

r1 p4

p4 32

Fault-freer1=12

Diverged!

But IB does not use faulty HW…

22

Diagnosing Permanent Fault to µarch Granularity

• Trace-based fault diagnosis (TBFD)

– Compare instruction trace of faulty vs. good execution

– Divergence faulty hardware used diagnosis clues

• Diagnose faults to µarch units of processor

– Check µarch-level invariants in several parts of processor

– Front end, Meta-datapath, datapath faults

– Diagnosis in out-of-order logic (meta-datapath) complex

• Results

– 98% of the faults by SWAT successfully diagnosed

– TBFD flexible for other detectors/granularity of repair

23

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis


Recovery

Repair


24

SWATSim: Fast and Accurate Fault Models

• Need accurate µarch-level fault models

– Gate level injections accurate but too slow

– µarch (latch) level injections fast but inaccurate

• Can we achieve µarch-level speed at gate-level accuracy?

• Mix-mode (hierarchical) Simulation

– µarch-level + Gate-level simulation

– Simulate only faulty component at gate-level, on-demand

– Invoke gate-level sim at online for permanent faults

Simulating fault effect with real-world vectors

25

SWAT-Sim: Gate-level Accuracy at µarch Speeds

µarch simulation

r3 r1 op r2

Faulty UnitUsed?

Continue µarch simulation

µarch-LevelSimulation

NoInput

Output

Gate-LevelFault

Simulation

Stimuli

Response

Fault propagatedto output

Yes

r3

26

Results from SWAT-Sim

• SWAT-sim implemented within full-system simulation

– NCVerilog + VPI for gate-level sim of ALU/AGEN modules

• SWAT-Sim: High accuracy at low overheads

– 100,000x faster than gate-level, same modeling fidelity

– 2x slowdown over µarch-level, at higher accuracy

• Accuracy of µarch models using SWAT coverage/latency

– µarch stuck-at models generally inaccurate

– Differences in activation rate, multi-bit flips

• Complex manifestations Hard to derive better models

– Need SWAT-Sim, at least for now

27

SWAT Summary

• SWAT: SoftWare Anomaly Treatment

– Handle all and only faults that matter

– Low, amortized overheads

– Holistic systems view enables novel solutions

– Customizable and flexible

• Prior results:

– Low-cost h/w detectors gave high coverage, low SDC rate

• This talk:

– iSWAT: Higher coverage w/ software-assisted detectors

– TBFD: µarch level fault diagnosis by synthesizing DMR

– SWAT-Sim: Gate-level fault accuracy at µarch level speed

28

Future Work

• Recovery: hybrid, application-specific

• Aggressive use of software reliability techniques

– Leverage diagnosis mechanism

• Multithreaded software

• Off-core faults

• Post-silicon debug and test

– Use faulty trace as fault-model oblivious test vector

• Validation on FPGA (w/ Michigan)

• Hardware assertions to complement software symptoms

BACKUP SLIDES

30

0%

20%

40%

60%

80%

100%

Decoder Int ALU

Reg Dbus

Int reg ROB RAT

AGEN FP ALU

Avg no FP

Total injections

SDC

Symp>10M

High-OS

Hang-App

Hang-OS

FatalTrap-AppFatalTrap-OSApp-Mask

Arch-Mask

100% 98% 98% 96% 100% 100% 95% 98%27%

Breakup of Detections by SW symptoms

• 98% unmasked faults detected within 10M instr (w/o FPU) – Need HW support or SW monitoring for FPU

31

SW Components Corrupted

• 66% of faults corrupt system state before detection

– Need to recover system state

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus

Int reg ROB RAT

AGEN FP ALU

Percentage of Injections

None

OS and maybe app

App only

32

Latency from Application mismatch

0%

20%

40%

60%

80%

100%


Int reg ROB RAT

AGEN FP ALU

10000000

1000000

100000

10000

1000

100

10

1

• 86% of faults detected under 100k

– 42% detected under 10k

33

0%

20%

40%

60%

80%

100%


Int reg ROB RAT

AGEN FP ALU

10000000

1000000

100000

10000

1000

100

10

1

Latency from OS mismatch

• 99% of faults detected under 100k

34





- - - - -

Ranges i/p #1


Invariant Ranges


Code



- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

35



Invoke diagnosis


Load checkpoint on fault-free core

Replay execution, collect µarch info

Fault-free instruction exec

TBFD

Faults in Front-end

Meta-datapath Faults

Datapath Faults

Faulty trace Test trace=?

36

Fault Diagnosability

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus Int Reg ROB RAT AGEN Overall

Percentage of Detected Faults

Incorrect

NoMismatch

D-Other

D-Unique

• 98% of detected faults are diagnosed

– 89% diagnosed to unique unit/array entry

– Meta-datapath faults in out-of-order exec mislead TBFD

37

Accuracy of existing Fault Models

• SWAT-sim implemented within full-system simulator

– NCVerilog + VPI to simulate gate-level ALU and AGEN

AGEN

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay


Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

97.1% 94.0% 95.3% 95.5%96.0%

Integer ALU

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay


Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

100% 98.8% 94.4% 89.4%93.9%

• Existing µarch-level fault models inaccurate

– Differences in activation rate, multi-bsit flips

• Accurate models hard to derive need SWAT-Sim!

38

Summary: SWAT Advantages

• Handles all faults that matter

– Oblivious to low-level failure modes & masked faults

• Low, amortized overheads

– Optimize for common case, exploit s/w reliability solutions

• Holistic systems view enables novel solutions

– Invariant detectors use diagnosis mechanisms

– Diagnosis uses recovery mechanisms

• Customizable and flexible

– Firmware based control affords hybrid, app-specific recovery (TBD)

• Beyond hardware reliability

– SWAT treats hardware faults as software bugs

Long-term goal: unified system (hw + sw) reliability at lowest cost

– Potential applications to post-silicon test and debug

39

Transients Results

• 6400 transient faults injected across 8 structures

• 83% unmasked faults detected within 10M instr

• Only 0.4% of injected faults results in SDCs

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep...

Documents

Transcript of SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep...