SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep...
-
Upload
noah-wilkins -
Category
Documents
-
view
216 -
download
1
Transcript of SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep...
SWAT: Designing Reisilent Hardware by
Treating Software Anomalies
Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo,
Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu,
Sarita Adve, Vikram Adve, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
2
Motivation
• Hardware failures will happen in the field
– Aging, soft errors, inadequate burn-in, design defects, …
Need in-field detection, diagnosis, recovery, repair
• Reliability problem pervasive across many markets
– Traditional redundancy (e.g., nMR) too expensive
– Piecemeal solutions for specific fault model too expensive
– Must incur low area, performance, power overhead
Today: low-cost solution for multiple failure sources
3
Observations
• Need handle only hardware faults that propagate to software
• Fault-free case remains common, must be optimized
Watch for software anomalies (symptoms)
Hardware fault detection ~ Software bug detection
Zero to low overhead “always-on” monitors
Diagnose cause after symptom detected
May incur high overhead, but rarely invoked
SWAT: SoftWare Anomaly Treatment
4
SWAT Framework Components
• Detection: Symptoms of S/W misbehavior, minimal backup H/W
• Recovery: Hardware/Software checkpoint and rollback
• Diagnosis: Rollback/replay on multicore
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Flexible control through firmware
Fault Error Symptomdetected
Recovery
Diagnosis Repair
Checkpoint Checkpoint
5
SWAT
4. Accurate Fault Modeling
2. Detectors w/ Software support [Sahoo et al., DSN ‘08]
3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]
1. Detectors w/ Hardware support [ASPLOS ‘08]
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
6
Hardware-Only Symptom-based detection
• Observe anomalous symptoms for fault detection
– Incur low overheads for “always-on” detectors
– Minimal support from hardware
• Fatal traps generated by hardware
– Division by Zero, RED State, etc.
• Hangs detected using simple hardware hang detector
• High OS activity detected with performance counter
– Typical OS invocations take 10s or 100s of instructions
7
Experimental Methodology
• Microarchitecture-level fault injection
– GEMS timing models + Simics full-system simulation
– SPEC workloads on Solaris-9 OS
• Permanent fault models
– Stuck-at, bridging faults in latches of 8 arch structures
– 12,800 faults, <0.3% error @ 95% confidence
• Simulate impact of fault in detail for 10M instructions
10M instr
Timing simulation
If no symptom in 10M instr, run to completion
Functional simulation
Fault
App masked, or symptom > 10M, or silent data corruption (SDC)
8
Efficacy of Hardware-only Detectors
• Coverage: Percentage of unmasked faults detected
– 98% faults detected, 0.4% give SDC (w/o FPU)
Additional support required for FPU-like units
– 66% of detected faults corrupt OS state, need recovery
Despite low OS activity in fault-free execution
• Latency: Number of instr between activation and detection
– HW recovery for upto 100k instr, SW longer latencies
– App in 87% of detections recoverable using HW
– OS recoverable in virtually all detections using HW
OS recovery using SW hard
9
Improving SWAT Detection Coverage
Can we improve coverage, SDC rate further?
• SDC faults primarily corrupt data values
– Illegal control/address values caught by other symptoms
– Need detectors to capture “semantic” information
• Software-level invariants capture program semantics
– Use when higher coverage desired
– Sound program invariants expensive static analysis
– We use likely program invariants
10
Likely Program Invariants
• Likely program invariants
– Hold on all observed inputs, expected to hold on others
– But suffer from false positives
– Use SWAT diagnosis to detect false positives on-line
• iSWAT - Compiler-assisted symptom detectors
– Range-based value invariants [Sahoo et al. DSN ‘08]
– Check MIN value MAX on data values
– Disable invariant when diagnose false-positive
11
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1
. . . .Ranges i/p #n
Invariant Ranges
Invariant Monitoring
Code
Test,train,
external inputs
12
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1
. . . .Ranges i/p #n
Invariant Ranges
Invariant Monitoring
Code
Compiler Pass in LLVM
- - - - - Application
- - - - -
Invariant Checking
Code
Full System Simulation
Inject Faults
SWAT Diagnosis
InvariantViolation
False Positive(Disable Invariant)
Fault Detection
Fault Detection Phase
Test,train,
external inputs
Refinput
13
iSWAT Results
• Explored SWAT with 5 apps on previous methodology
• Undetected faults reduce by 30%
• Invariants reduce SDCs by 73% (33 to 9)
• Overheads: 5% on x86, 14% on UltraSparc IIIi
– Reasonably low overheads on some machines
– Un-optimized invariants used, can be further reduced
• Exploring more sophistication for coverage, overheads
14
Fault Diagnosis
• Symptom-based detection is cheap but
– High latency from fault activation to detection
– Difficult to diagnose root cause of fault
– How to diagnose SW bug vs. transient vs. permanent fault?
• For permanent fault within core
– Disable entire core? Wasteful!
– Disable/reconfigure µarch-level unit?
– How to diagnose faults to µarch unit granularity?
• Key ideas
– Single core fault model, multicore fault-free core available
– Checkpoint/replay for recovery replay on good core, compare
– Synthesizing DMR, but only for diagnosis
15
SW Bug vs. Transient vs. Permanent
• Rollback/replay on same/different core
• Watch if symptom reappears
No symptom Symptom
False positive (iSWAT) or Deterministic s/w orPermanent h/w bug
Symptom detected
Faulty Good
Rollback on faulty core
Rollback/replay on good core
Continue Execution
Transient or non-deterministic s/w bug
Symptom
Permanenth/w fault,
needs repair!
No symptom
False positive (iSWAT) orDeterministic s/w bug, send to s/w layer
16
Diagnosis Framework
Permanent fault
Microarchitecture-LevelDiagnosis
Unit X is faulty
Symptomdetected
Diagnosis
Softwarebug
Transientfault
17
Fault-Free CoreExecution
Faulty CoreExecution
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Invoke TBFD
DiagnosisAlgorithm
=?
18
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Invoke TBFD
Rollback faulty-core to checkpoint
Replay execution, collect info
=?
DiagnosisAlgorithm
Fault-Free CoreExecution
19
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Rollback faulty-core to checkpoint
Replay execution, collect info
=?
DiagnosisAlgorithm
Load checkpoint on fault-free core
Fault-free instruction exec
What info to collect?
What info to compare?What to do on divergence?
Invoke TBFD
20
Can a Divergent Instruction Lead to Diagnosis?
Simpler case: ALU fault
sub r6,r1,r2sub r6,r1,r2 2 1 72 x 9
FaultyFault-free HW usedresults
add r1,r3,r5add r1,r3,r5 0dec alu
1 12
dstpreg
5 x 3
Both divergent instructions used same ALU ALU1 faulty
21
r2 p20
p20 4
• Complex example: Fault in register alias table (RAT) entry
• Divergent instructions do not directly lead to faulty unit
• Instead, look backward/forward in instruction stream
– Need to collect and analyze instruction trace
Can a Divergent Instruction Lead to Diagnosis?
r2 p20
r1
log phyp4
r3 p13
r5 p24
RAT
IA: r3 r2 + r2
phy valp20 4p24 3
Reg File
p4 8r3 p55
error!
r3 p24
r5 p24
p24 3p24 8
IB: r1 r5 * r2
r1 p4
p4 32
Fault-freer1=12
Diverged!
But IB does not use faulty HW…
22
Diagnosing Permanent Fault to µarch Granularity
• Trace-based fault diagnosis (TBFD)
– Compare instruction trace of faulty vs. good execution
– Divergence faulty hardware used diagnosis clues
• Diagnose faults to µarch units of processor
– Check µarch-level invariants in several parts of processor
– Front end, Meta-datapath, datapath faults
– Diagnosis in out-of-order logic (meta-datapath) complex
• Results
– 98% of the faults by SWAT successfully diagnosed
– TBFD flexible for other detectors/granularity of repair
23
SWAT
4. Accurate Fault Modeling
2. Detectors w/ Software support [Sahoo et al., DSN ‘08]
3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]
1. Detectors w/ Hardware support [ASPLOS ‘08]
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
24
SWATSim: Fast and Accurate Fault Models
• Need accurate µarch-level fault models
– Gate level injections accurate but too slow
– µarch (latch) level injections fast but inaccurate
• Can we achieve µarch-level speed at gate-level accuracy?
• Mix-mode (hierarchical) Simulation
– µarch-level + Gate-level simulation
– Simulate only faulty component at gate-level, on-demand
– Invoke gate-level sim at online for permanent faults
Simulating fault effect with real-world vectors
25
SWAT-Sim: Gate-level Accuracy at µarch Speeds
µarch simulation
r3 r1 op r2
Faulty UnitUsed?
Continue µarch simulation
µarch-LevelSimulation
NoInput
Output
Gate-LevelFault
Simulation
Stimuli
Response
Fault propagatedto output
Yes
r3
26
Results from SWAT-Sim
• SWAT-sim implemented within full-system simulation
– NCVerilog + VPI for gate-level sim of ALU/AGEN modules
• SWAT-Sim: High accuracy at low overheads
– 100,000x faster than gate-level, same modeling fidelity
– 2x slowdown over µarch-level, at higher accuracy
• Accuracy of µarch models using SWAT coverage/latency
– µarch stuck-at models generally inaccurate
– Differences in activation rate, multi-bit flips
• Complex manifestations Hard to derive better models
– Need SWAT-Sim, at least for now
27
SWAT Summary
• SWAT: SoftWare Anomaly Treatment
– Handle all and only faults that matter
– Low, amortized overheads
– Holistic systems view enables novel solutions
– Customizable and flexible
• Prior results:
– Low-cost h/w detectors gave high coverage, low SDC rate
• This talk:
– iSWAT: Higher coverage w/ software-assisted detectors
– TBFD: µarch level fault diagnosis by synthesizing DMR
– SWAT-Sim: Gate-level fault accuracy at µarch level speed
28
Future Work
• Recovery: hybrid, application-specific
• Aggressive use of software reliability techniques
– Leverage diagnosis mechanism
• Multithreaded software
• Off-core faults
• Post-silicon debug and test
– Use faulty trace as fault-model oblivious test vector
• Validation on FPGA (w/ Michigan)
• Hardware assertions to complement software symptoms
BACKUP SLIDES
30
0%
20%
40%
60%
80%
100%
Decoder Int ALU
Reg Dbus
Int reg ROB RAT
AGEN FP ALU
Avg no FP
Total injections
SDC
Symp>10M
High-OS
Hang-App
Hang-OS
FatalTrap-AppFatalTrap-OSApp-Mask
Arch-Mask
100% 98% 98% 96% 100% 100% 95% 98%27%
Breakup of Detections by SW symptoms
• 98% unmasked faults detected within 10M instr (w/o FPU) – Need HW support or SW monitoring for FPU
31
SW Components Corrupted
• 66% of faults corrupt system state before detection
– Need to recover system state
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
Percentage of Injections
None
OS and maybe app
App only
32
Latency from Application mismatch
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
10000000
1000000
100000
10000
1000
100
10
1
• 86% of faults detected under 100k
– 42% detected under 10k
33
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
10000000
1000000
100000
10000
1000
100
10
1
Latency from OS mismatch
• 99% of faults detected under 100k
34
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1
. . . .Ranges i/p #n
Invariant Ranges
Invariant Monitoring
Code
Compiler Pass in LLVM
- - - - - Application
- - - - -
Invariant Checking
Code
Full System Simulation
Inject Faults
SWAT Diagnosis
InvariantViolation
False Positive(Disable Invariant)
Fault Detection
Fault Detection Phase
Test,train,
external inputs
Refinput
35
Trace-Based Fault Diagnosis (TBFD)
Permanent fault detected
Invoke diagnosis
Rollback faulty-core to checkpoint
Load checkpoint on fault-free core
Replay execution, collect µarch info
Fault-free instruction exec
TBFD
Faults in Front-end
Meta-datapath Faults
Datapath Faults
Faulty trace Test trace=?
36
Fault Diagnosability
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus Int Reg ROB RAT AGEN Overall
Percentage of Detected Faults
Incorrect
NoMismatch
D-Other
D-Unique
• 98% of detected faults are diagnosed
– 89% diagnosed to unique unit/array entry
– Meta-datapath faults in out-of-order exec mislead TBFD
37
Accuracy of existing Fault Models
• SWAT-sim implemented within full-system simulator
– NCVerilog + VPI to simulate gate-level ALU and AGEN
AGEN
0%
20%
40%
60%
80%
100%
uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay
Percentage of Injections
Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC
97.1% 94.0% 95.3% 95.5%96.0%
Integer ALU
0%
20%
40%
60%
80%
100%
uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay
Percentage of Injections
Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC
100% 98.8% 94.4% 89.4%93.9%
• Existing µarch-level fault models inaccurate
– Differences in activation rate, multi-bsit flips
• Accurate models hard to derive need SWAT-Sim!
38
Summary: SWAT Advantages
• Handles all faults that matter
– Oblivious to low-level failure modes & masked faults
• Low, amortized overheads
– Optimize for common case, exploit s/w reliability solutions
• Holistic systems view enables novel solutions
– Invariant detectors use diagnosis mechanisms
– Diagnosis uses recovery mechanisms
• Customizable and flexible
– Firmware based control affords hybrid, app-specific recovery (TBD)
• Beyond hardware reliability
– SWAT treats hardware faults as software bugs
Long-term goal: unified system (hw + sw) reliability at lowest cost
– Potential applications to post-silicon test and debug
39
Transients Results
• 6400 transient faults injected across 8 structures
• 83% unmasked faults detected within 10M instr
• Only 0.4% of injected faults results in SDCs