IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults
description
Transcript of IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults
Songjun Pan1,2, Yu Hu1, and Xiaowei Li1
1Key Laboratory of Computer System and Architecture
Institute of Computing Technology
Chinese Academy of Sciences
2Graduate University of Chinese Academy of Sciences
Outline• Background and Related Work
• IVF Computing Methodology
• Experimental Results
• Conclusions
Background
Intermittent faults are emerging as a major source of failures in microprocessors [DSN’02]
FailureRate
InfantMortality
Stage
Useful LifeStage
Wear-outStage
Deep Submicron Era
Defect escape
Soft Errors
FasterAging
Lifetime
Intermittent faults
Intermittent Faults• Description
• Occur frequently and irregularly for a period of time
• Caused by loose connection, manufacturing residuals, process variation, or in-progress wear-out, combined with voltage and temperature fluctuations
• Characteristics
• Occur in bursts at the same location
• Removed if replace the offending circuit
• Activated or deactivated by PVT (process, temperature, and voltage) variations
Protecting the Microprocessor• Information redundancy techniques
• Parity and error-correcting codes– High area overhead
– High power consumption
• Hardware redundancy techniques• Dual modular redundancy/Triple modular redundancy
– 100%~200% area overhead
• Software redundancy techniques• Redundant multi-threading
– 10%~30% performance overhead
• Conventional protection methods ensure high reliability but also cause high overhead
Trade-off Reliability and Overhead• Key Observation
• Not all faults lead to external program failures
• A fault in branch predictor: doesn’t matter at all
• A fault in program counter: almost always matters
• Which bit matters? • ACE bit / un-ACE bit: Architectural Correct
Execution (ACE) bit [MICRO’03]
• ACE bit: If changed will lead to an external error
• Reliability evaluation• Protect the most vulnerable structures
Related Metrics• Mean Time To Failure (MTTF) / Mean Time
Between Repair (MTBR)• Masking effect
• Structure utilization
• Soft Error Vulnerability Analysis• Architectural Vulnerability Factor (AVF) [MICRO’03]
• Program Vulnerability Factor (PVF) [HPCA’09]
• Hard Fault Vulnerability Analysis• Hard-Faults AVF (H-AVF) [SIGMETRICS’06]
The vulnerability to intermittent faults are rarely considered due to their rich causes and behaviors
Our Contributions• Propose a metric Intermittent Vulnerability Factor
(IVF) to characterize the vulnerability to intermittent faults
• IVF definition: a structure’s IVF is the probability an intermittent fault in that structure causes an external visible error
• Present IVF computing algorithms for reorder buffer and register file
• Compute IVF with different fault configurations
Intermittent Fault ModelsCauses and mechanisms
Fault models at the logic level
Cell Solder jointInductive
noiseElectro-
migrationCrosstalk
Soft breakdown
Variation ofmetal R&C
Fluctuation of leakage current
IntermittentStuck-at
Intermittentshort
Intermittentopen
Intermittentpulse
Intermittentdelay
Intermittentindetermination
Manufacturingresidues Timing violations Oxide breakdown
Memory Buses Interconnectionlines, buses
Power supply
Intermittentcontacts
Intermittent Stuck-at Faults• Intermittent stuck-at faults
• Change the correct value intermittently to logic one or logic zero
• Vulnerable structures: storage structures such as memory and register file
• Key Parameters• Burst length/active time/inactivity time• Have adverse effect during the active time
. .
.burst lengthburst length
active timeinactive time
time
IVF Computing• Determine whether an intermittent fault affects
program execution or not
• Analyze ACE bit / critical time
• Set the three key parameters: burst length, active time, and inactive time
• Burst length: randomly generated from [10T, 30T]
• Duty cycle: 50%
• Start time: randomly generated
• Compute IVFs for reorder buffer and register file
. .
.burst lengthburst length
active timeinactive time
time
IVF Computing – Reorder buffer
entry
cycle
Y
Z
ACE
X
bit
ACE Bit AnalysisTime
An exampleof an intermittent fault
Active time
Inactive time
1
( )B
DACE
srob
U sIVF
B
2 / 6 1/ 3 robIVF Planar representation
B1
B2 B3
IVF Computing – Register File
register version n
Allocation W R1 R2 Rlast DeallocationTime
n+1n-1 critical timenon-
criticalnon-
critical
F1 F3F2
…
Critical Time Analysis
1
( )E
DCT
ereg
U eIVF
E
Experimental Setup• Simulated processor configurations
• Execution-driven simulator Sim-Alpha
• Reorder buffer/register file 80/80 entries
• 4 integer ALUs, 2 integer multipliers, 2 float ALUs
• Hybrid, 4K global + 2-level 1K local + 4K choice branch predictor
• 64KB 2-way L1 data cache, 2MB direct mapped L2 cache
• Workload• SPEC2000 integer benchmark suite
• Simulate 100M instructions with SimPoint
IVF vs AVF
0
10
20
30
40
50
60
70
80 AVF BL 10BL 20 BL 30
IVF varies significantly across benchmarksLonger burst length, higher IVF
IVF is much higher than AVF
Reorder Buffer
Different Fault Configurations
0
10
20
30
40
50
60
70 Config1_4 Config1_2Config2_4 Config2_2
Reorder Buffer
IVF varies little across burst length configuration filesIVF varies significantly for different active time
IVF at Entry Level
0
20
40
60
80
100
1 11 21 31 41 51 61 71
Active Time 1Active Time 2Active Time 4
IVF varies across different entriesArchitecture registers are more vulnerable
Register File
Architecture registers
Renaming registers
Implications• Quantitatively guide reliability design at early
design stage and evaluate system reliability
• Harden partial structures/entries for high reliability while minimizing the overhead
• Razor [MICRO’03]• Parshield [DSN’07]
• Easily extend to analyze other structures (issue queue, load/store queue, and cache)
Conclusions• Propose a methodology to characterize the
vulnerability of microprocessor structures to intermittent faults
• Compute IVF for reorder buffer and register file
• IVF varies significantly across inter- and intra-structures, motivating to protect the most vulnerable structures to improve system reliability
• Thank You for Your Attention
• Question?