Sidestepping performance bottlenecks and design...
Transcript of Sidestepping performance bottlenecks and design...
Sidestepping performance bottlenecks and design crises with
Better-Than-Worst-Case design
Munich - March 7th, 2005
ARM, Ltd. Cambridge, UK
Dept. of EECSUniversity of MichiganAnn Arbor, MI – USA
Krisztian [email protected]
Valeria [email protected]
Todd [email protected]
2
Introduction
Limitations of traditional design approachesin light of current technology trends
Pressing Design Challengesin the Nanometer Regime
Design complexity• Billions of transistors lead to untenable designs…
Uncertainty in design parameters• Process and temperature variation, supply noise…
Soft errors upset logic and memory• Cosmic rays, alpha particles, neutrons, etc…
Power demands• Bounding performance, area, battery life
4
Design Complexity TrendsChart Title
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1980 1985 1990 1995 2000 2005
tran
sist
ors
,i486
GeForceFX
i386i286
PentiumProGeforceII
PentiumIVPrescott
K7Riva TNT2
Radeon97
Nvidia NV2Pentium
GeForceIVCrusoe
PowerIV
PentiumII
5
The Burden of VerificationImmense test space• Impossible to fully test the system• Example: 32 regs, 8k caches, 300 pins = 2132396 states
Done with respect to ill-defined reference• What is correct? Often defined by old designs + gurus
Expensive• Large fraction of design team dedicated to verification• Increases time-to-market, often as much as 1-2 years
High-risk• Typically only one chance to “get it right”• Failures can be costly: replacement parts, bad PR, lawsuits, fatalities
6
0
50
100
150
200
250
Hea
t Flu
x (W
/cm
2)
40
50
60
70
80
90
100
110
Tem
pera
ture
(C)
10
100
1000
10000
1000 500 250 130 65 32
Technology Node (nm)
Mea
n N
umbe
r of D
opan
t Ato
ms
Extreme VariationsHeat Flux (W/cm2)Results in Vcc variation
Temperature Variation (°C)Results in Hot spots
Random Dopant Fluctuations
7
Uncertainty in Design Parameters
Uncertainly leads to performance and power overheads• Increasing uncertainty with design scaling• Intra-die process/temperature variations, inductive noise, deep•Key Observation: worst-case scenario is highly improbable• Significant gain for circuits optimized for common case• Efficiency mechanisms needed to tolerate worst-case scenarios
Temperature Si variation Noise Model uncertainty
+++ =
Temperature Si variation Model uncertaintySiTemperature variation Noise Model uncertainty
+++ =
f, Yield, MTTF Vdd
8
Secondary source of upsets: alpha particles from packaging
Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device
+- ++ +-- -
Transistor Device
source drain
neutron strike
Impact of Neutron Strike on a Si Device
9
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
Sof
t Err
or R
ate
(FIT
/chi
p)
SRAM latch 6 FO4s logic 6 FO4s
Soft-Error Trends
SER per chip of logic circuits• Nine orders of magnitude increase from 600 nm to 50 nm • Dominant source of soft errors after 50 nm
1992 1994 1997 1999 2002 2005 2008 2011Technology Generation
600nm 350nm 250nm 180nm 130nm 100nm 70nm 50nm
[P. Shivkumar et al., DSN 2002]
10
Source: The New York Times, 25 June 2002
Fried Egg a la Athlon XP1500+
11
Wat
ts/c
m2
1
10
100
1000
1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ
i386i386i486i486
PentiumPentium®®PentiumPentium®® ProPro
PentiumPentium®® IIIIPentiumPentium®® IIIIIIHot plateHot plate
RocketNozzleRocketRocketNozzleNozzle
Nuclear ReactorNuclear ReactorNuclear Reactor
* * ““New Microarchitecture Challenges in the Coming Generations of CMNew Microarchitecture Challenges in the Coming Generations of CMOS Process TechnologiesOS Process Technologies”” ––Fred Pollack, Intel Corp. Micro32 conference key note Fred Pollack, Intel Corp. Micro32 conference key note -- 1999. Courtesy 1999. Courtesy AviAvi MendelsonMendelson, Intel., Intel.
PentiumPentium®® 44
Power Density TrendsPower doubles every 4 yearsPower doubles every 4 years55--year projection: 200W total, 125 W/cmyear projection: 200W total, 125 W/cm2 2 !!
P=VI: 75W @ 1.5V = 50 A!P=VI: 75W @ 1.5V = 50 A!
12
Better-Than-Worst-Case (BTWC) designTraditional worst-case design works to avoid errors/faults by assuming worst-case conditions for design validation
Better than worst-case design couples a complex designs with a checker component that validates correctness during operation
Reduces design effort and enables typical-case optimizations
13
What is this tutorial aboutBTWC design• Basic Concepts• DIVA Checker• Razor Logic• Other BTWC solutionsCAD challenges and opportunities• Typical-Case design Optimization (TCO)• Circuit-level observability and system-level performanceOpen discussionConclusion
14
Goals of this tutorial1. Introduce and motivate the concept of Better Than
Worst-Case design2. Familiarize the attendees with a number of BTWC
designs (ours and others)3. Introduce efforts (circuit-aware architectural simulation
typical case optimization) that highlight the challenges and opportunities that BTWC poses to CAD
4. Facilitate an open discussion on the implications of BTWC design on CAD
15
Better-Than-Worst-Case design
16
Design-TimeVerification
andOptimization
Traditional Worst-Case Design
L H
Time-to-Market
L H
Performance
17
Run-TimeVerification
TypicalCase
Optimization
Better-Than-Worst-Case (BTWC) design
L H
Time-to-Market
L H
Performance
L H
Performance
L H
Time-to-Market
Online
Checker
Hardware
18
OutlineDIVA CheckerRazor LogicOther BTWC designs
19
Motivating ObservationsOnline functional verification cover most faults• Single-event upsets and noise-related faults• Design faults and incomplete implementation• Untestable silicon defects and in field circuit failures• Utilize N(2)-version hardware to detect and correct faults
Increasing speculation reduces exposure to faults• Predictors need not be correct, functionally or electrically• Approach leverages a maximally speculative architecture
While complex, processors have simple semantics• Need not validate all internals, only exposed semantics• Only check instruction semantics for low overheads
20
Example BTWC Design:DIVA Checker [Austin’99]
All core function is validated by checker• Simple checker detects and corrects faulty results, restarts core
Checker relaxes burden of correctness on core processor• Tolerates design errors, electrical faults, defects, and failures• Core has burden of accurate prediction, as checker is 15x slower
Core does heavy lifting, removes hazards that slow checker
speculativeinstructions
in-orderwith PC, inst,inputs, addr
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Performance Correctness
Core CheckerOnline
Checker
Hardware
21
result
Checker Processor Architecture
IF
ID
CTOK
CoreProcessorPrediction
Stream
PC
=inst
PC
inst
EX
=regs
regs
core PC
core inst
core regs
MEM
=res/addr
addrcore res/addr/nextPC
result
D-cache
I-cache
RF
WT
22
Check Mode
result
IF
ID
CTOK
CoreProcessorPrediction
Stream
PC
=inst
inst
EX
=regs
regs
core PC
core inst
core regs
MEM
=res/addr
addrcore res/addr/nextPC
result
D-cache
I-cache
RF
WT
23
Recovery Mode
result
IF
ID
CT
PC inst
PC
inst
EX
regs
regs
MEM
res/addr
addr result
D-cache
I-cache
RF
24
How Can the Simple Checker Keep Up?
Slipstream
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream• fast moving air ⇒ branch predictions and cache prefetches• Core processor slipstream reduces complexity requirements of checker• Checker rarely sees branch mispredictions, data hazards, or cache misses
25
How Can the Simple Checker Keep Up?
Slipstream
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream• fast moving air ⇒ branch predictions and cache prefetches• Core processor slipstream reduces complexity requirements of checker• Checker rarely sees branch mispredictions, data hazards, or cache misses
26
How Can the Simple Checker Keep Up?
Slipstream
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream• fast moving air ⇒ branch predictions and cache prefetches• Core processor slipstream reduces complexity requirements of checker• Checker rarely sees branch mispredictions, data hazards, or cache misses
27
Checker Performance ImpactsChecker throughput bounds core IPC• Only cache misses stall checker pipeline• Core warms cache, leaving few stalls
Checker latency stalls retirement• Stalls decode when speculative state
buffers fill (LSQ, ROB)• Stalled instructions mostly nuked!
Storage hazards stall core progress• Checker may stall core if it lacks resources
Faults flush core to recover state• Small impact if faults are infrequent
0.970.980.991.001.011.021.031.041.05
Relat
ive C
PI
Uber-Check
er
Pico-Check
er
12-cyc
le Check
er
1/4 Cach
e Size
1k Faults
28
REMORA: Physical Checker DesignPhysical checker design effort underway
• Alpha integer ISA subset• 4-wide checker, 0.5k I-cache, 4k D-cache• Synthesized design (using Synopsys)
Physical design estimates• 950 MHz clock speed (degree-8 pipe)• 12 mm2 total area in 0.25um technology• 941 mW worst-case power
Design also includes:• Pipelined checker design, simple core• Clock/voltage tuning infrastructure• Extensive BIST support
205 mm2
(in 0.25um)
Alpha 21264
REMORAChecker
datacache
instcache
pipe-line
BIST
12 mm2
(in 0.25um)
29
Verifying the Checker Processor
Simple checker permits complete functional verification• In-order blocking pipelines (trivial scheduler, no rename/reorder/commit)• No “internal” non-architected state
Fully verified design using Sakallah’s GRASP SAT-solver• For Alpha integer ISA without exceptions• With small register file and memory, and small data types
X
CheckerModel
ReferenceModel
(ISA sim)
==
output
output
ϕUnspecified CorePredictions Always true if
uArch model == Ref model
Identical state?
30
Example BTWC Design:DIVA Checker
All core function is validated by checker• Simple checker detects and corrects faulty results, restarts core
Checker relaxes burden of correctness on core processor• Tolerates design errors, electrical faults, defects, and failures• Core has burden of accurate prediction, as checker is 15x slower
Core does heavy lifting, removes hazards that slow checker
speculativeinstructions
in-orderwith PC, inst,inputs, addr
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Performance Correctness
Core CheckerOnline
Checker
Hardware
31
OutlineDIVA CheckerRazor LogicOther BTWC designs
32
Motivating Study:Voltage vs. Circuit Error Rate
33
Circuit Under Test
48-b
it LF
SR
48-b
it LF
SR
48-b
it LF
SR
48-b
it LF
SR
X
X
X
clk/2
clk/2
clk clk
clk/2
clk/2
clk
!=
40-b
it E
rror C
ount
er40
-bit
Erro
r Cou
nter
Slow Pipeline A
Slow Pipeline B
Fast Pipeline
clk/2
18
18
36
36
36
18x18
18x18
18x18
stabilize
34
18x18-bit Multiplier Block at 90 MHz and 27 C
0.0000000%0.0000001%0.0000010%0.0000100%0.0001000%0.0010000%0.0100000%0.1000000%1.0000000%10.0000000%100.0000000%
1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78
Supply Voltage (V)
Err
or r
ate
random
Zero-margin@ 1.54 V
Safety-margin@ 1.63 V
Environmental-margin@ 1.69 V
Error Rate Studies – Empirical Results
35% energy savings with 1.3% error
22% saving
once every 20 seconds!
35
Error Rate Studies – SPICE-Level SimulationsBased on a SPICE-level simulations of a Kogge-Stone adder
Kogge-Stone Adder at 870 MHz and 27 C
0.00%
0.01%
0.10%
1.00%
10.00%
100.00%
0.60.811.21.41.61.82
Supply Voltage
Erro
r ra
te
random
bzip
ammp
200 mV
36
Another BTWC Design:Razor Logic [Ernst’03]
Main
FF
Shad
ow La
tch
Main
FF
clk clk
clk_del
5
49 MEM39
9
Double-sampling metastability tolerant latches detect timing errors• Second sample is correct-by-design
Microarchitectural support restores state• Timing errors treated like branch mispredictions
Online
Checker
Hardware
37
998
Razor Short Path Constraint
Main
FF
Shad
ow La
tch
Main
FFclk clk
clk_del
5
4
Hold Constraint(~1/2 cycle)
MEM
8
3
2
Short-path timing constraint prevents shadow latch corruption
38
Razor Flip-Flop
D_inQ
Clk_p Clk_n
Clk_n
Clk_n
Clk_p
Restore_p
Restore_n
Restore
Restore
Clk_n
Rbar_latched
Clk_p
P-skewed
N-skewed
Error
Driver
39
inst2
IF
Razo
r FF ID
Razo
r FF EX
Razo
r FF MEM WB
(reg/mem)
error
recover recover recover
Razo
r FF
PC
recover
errorerror error
clock
Cycle: 0inst1inst3inst4inst5
123456inst6
Centralized Pipeline Recovery Control
Once cycle penalty for timing failureGlobal synchronization may be difficult for fast, complex designs
40
recover
IF
Razo
r FF ID
Razo
r FF EX
Razo
r FF MEM
(read-only)WB
(reg/mem)
error bubble
recover recover
Razo
r FF
Stab
ilizer
FF
PC
recover
flushID
bubbleerror bubble
flushID
error bubble
flushIDFlushControl
flushID
error
Cycle: 0
inst1inst2inst3inst4inst5
123456
inst6 inst2inst7inst8
789
inst3inst4
Builds on existing branch prediction frameworkMultiple cycle penalty for timing failureScalable design as all communication is local
Distributed Pipeline Recovery
41
Goal: reduce voltage margins with in-situ error detection and correction
Approach:• Tune processor voltage based on error rate• Eliminate margins, run below critical voltage
- Trade-off: power savings vs. overhead of correction
Shaving Voltage Margins with Razor
0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0
0
2 0
4 0
6 0
S u p p l y V o l t a g e
Per
cent
age
Erro
rs
Traditional DVS
Zero margin Sub-critical
42
Razor Opportunity:Typical-Case Energy Reduction
Eref
VoltageControl
Function Σ...
Pipeline
reset
Vdd
Ediff = Eref - Esample
-
EsampleVoltageRegulator
Ediff
errorsignals
Energy reduction can be realized with a simple proportional control function• Control algorithm implemented in software
43
Energy/Performance Characteristics
Decreasing Supply Voltage
Energy
Energy of ProcessorOperations, Eproc
Energy ofPipeline
Recovery,Erecovery
Total Energy,Etotal = Eproc + Erecovery
Optimal Etotal
PipelineThroughput
IPC
Energy of Processorw/o Razor Support
50%
1%
44
Simulation Results:Optimal Voltage Sweep
BZIP
0.31% Error Rate,58% Energy Savings
0.3
0.5
0.7
0.9
1.1
1.3
1.5
0.6
0.650.7
0.750.8
0.850.9
0.951
1.051.1
1.151.2
1.251.3
1.351.4
1.451.5
1.551.6
1.651.7
1.751.8
Voltage
Rel
ativ
e IP
C a
nd E
nerg
y
Rel EnergyRel Performance
Recovery cost includes energy torecover entire pipeline (18x an add)
45Runtime Samples
0 100 200 300 400 500 60002468
10121416
1.351.401.451.501.551.601.651.701.751.80120MHz
27CPe
rcen
tage
Err
or R
ate
Volta
ge O
utpu
t of C
ontr
olle
rVoltage Controller Performance
46
0
10
20
30
40
50
60
70
80
90
100
bzip crafty eon gap gcc gzip mcf parser twolf vortex vpr Average
Rel
ativ
e En
ergy
(%)
Total EnergyDVS EnergyIPCDVS IPC
Simulation Results:Razor DVS Performance
47
Razor Prototype Silicon4 stage 64-bit Alpha pipeline• 120 - 160MHz operation • 0.18μm technology, 1.8V
Razor overhead:• Total of 192 Razor flip-flops
out of 2408 total (9%)• Error-free power overhead:
- Razor flip-flops: < 1%- Short path buffer: 2.1%
• Recovery power overhead:- Razor latch power overhead: 2% at
10% error rate- Additional power overhead due to re-
execution of instructions
D-Cache
IF ID EX
ME
M
WB
Register FileI-Cache
3.3 mm
3 mm
48
Razor Prototype Testbed
49
Razor Prototype Testbed
50
Razor Prototype Testbed
51
1.52 1.56 1.60 1.64 1.68 1.72 1.760.700.750.800.850.900.951.001.051.101.151.20
1E-8
1E-7
1E-6
1E-5
1E-4
1E-3
0.01
0.1
1
10 120MHz
140MHz
Perc
enta
ge E
rror
Rat
e
Nor
mal
ized
Ene
rgy
Voltage (in Volts)
Error Rate and Normalized Energy Savings
52
Point of 0.1% Error Rate andPoint of First Failure
1.4 1.5 1.6 1.7 1.8
1.4
1.5
1.6
1.7
1.8 Chips Linear Fit y=0.78685x + 0.22117
Voltage at First Failure
Volta
ge a
t 0.1
%Er
ror R
ate
1.5 1.6 1.7 1.81.4
1.5
1.6
1.7
1.8 Chips (Linear Fit)
(0.55976x + 0.61752)
Volta
ge a
t 0.1
%Er
ror R
ate
Voltage at First Failure
120MHz 140MHz
53
Razor Prototype Testbed
54
1.40 1.44 1.48 1.521E-91E-81E-71E-61E-51E-41E-30.010.1 45C
65C 95C
Perc
enta
ge E
rror
Rat
e
120MHz
Voltage
45C
65C 95C
Temperature Margins
55
80
100
120
140
160
27.3mW180mV
PowerSupply
Integrity11.3m
W80mVTemp17.3mW
130mV
Process
104.5mW
4.2mW30mV
Process
89.7mW
99.6mW
104.5mW
119.4mW
89.7mW
119.4mW
11.5mW
80mVTemp
27.7mW180mV
PowerSupply
Integrity
104.5mW
119.4mW99.6mW
chip1 chip2 chip1 chip2 chip1 chip2Measured Power
with supply, temperatureand process margins
Power with Razor DVS when Operating at Point
of First Failure
Power with Razor DVS when Operating at Point
of 0.1% Error Rate
Mea
sure
d Po
wer
(in
mW
)160.5mW 162.8mW
Razor Energy Savings@120MHz,45C
56
Another BTWC Design:Razor Logic
Main
FF
Shad
ow La
tch
Main
FF
clk clk
clk_del
5
49 MEM39
9
Double-sampling metastability tolerant latches detect timing errors• Second sample is correct-by-design
Microarchitectural support restores state• Timing errors treated like branch mispredictions
Online
Checker
Hardware
57
OutlineDIVA CheckerRazor LogicOther BTWC designs
58
Other Better Than Worst-Case designsAlgorithmic-Noise Tolerance, Shanbhag et al.• Converting circuit faults to S/N component
Approximate Circuits, Lu et al.• Architecture-level speculation on computation
TEAtime Adaptive Clock, Uht et al.• Adaptive clock control
On-Chip Self-Calibrating Busses, Worm et al.• Error recovery logic for on-chip busses
Self-Tuning Circuits, Kehl et al.• Early work on dynamic timing error avoidance
Time Based Transient Fault Detection, Anghel et al.• Double sampling latches for speed testing
March 2004
59
Algorithmic Noise Tolerance
[Shanbhag ’04]
60
Approximate Circuits
[Lu ’04]
61
TEAtime Adaptive Clock
[Uht ’04]
62
On-Chip Self-Calibrating Busses
[Worm ’04]
ddv
FIFO
chF
ControllerFIFOn
ddv
Enco
der
Dec
oder
Ack
chv
errors
chv
63
Self-Tuning Circuits
[Kehl ’93]
64
Time Redundancy BasedTransient Fault Detection
[Anghel ’00]
65
CAD opportunities for BTWC
66
Key observation
In a BTWC context infrequent faults in the core design are tolerable.infrequent faults in the core design are tolerable.
• A fault is infrequent if:- Environmental conditions trigger it rarely- Normal system operation activate a faulty configuration rarely
67
CAD opportunitiesSynthesis• Optimize performance/power for the most common scenarios
(typical-case optimization)• Flexible synthesis tools - Optimization constraints can be relaxed• Finer granularity of synthesis objectives – probability density curves
Verification• Accurate evaluation- Statistical analysis of execution scenarios• Verification focuses on “frequent” transactions/ path of execution• Verification focuses on critical components• Safety mechanisms detect “rare” problems at a performance cost
- Performance and verification are intertwined
68
OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release processors
69
Kogge-Stone AdderPa
ralle
l Pre
fix C
ompu
tatio
n
€
G0P0
G1P1
G2P2
G3P3
G4P4
G5P5
G6P6
G7P7
G8P8
G9P9
G10P10
G11P11
G12P12
G13P13
G14P14
G15P15
€€€€€€€
€ € € €
€
€
€
€
C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16
€€€€€€€€
€ € € € € € € € € €
€ € € € € € € € € €
€ € € € € €
70
BTWC Opportunity:Typical-Case Optimized Adder
Kogge-Stone Adder
G0P0
G1P1
G2P2
G3P3
G4P4
G5P5
G6P6
G7P7
G8P8
G9P9
G10P10
G11P11
G12P12
G13P13
G14P14
G15P15 Cin
…
71
Carry Propagations for Random Data
081624324048
56
016
3248
64
0
0.01
prob
abili
ty
carry propagationcarry start
Bit PositionCarry Distance
Pro
babi
lity
72
Carry Propagations for Typical Data
08162432404856
016
3248
64
0
0.16
prob
abili
ty
carry propagationcarry start
Carry DistanceBit Position
Pro
babi
lity
73
Typical Case Optimized Adder
G0P0
G1P1
G2P2
G3P3
G4P4
G5P5
G6P6
G7P7
G8P8
G9P9
G10P10
G11P11
G12P12
G13P13
G14P14
G15P15 Cin
…
ripple carry circuitcarry-lookahead circuit
74
Benefits of Typical Case Optimization
Typical-case performance much better than worst case,relevant in a TCO context
3.693.03128TCO Adder
7.095.088Kogge-Stone
RandomTypical-CaseWorst-Case
Latency (in gate delays)Adder Topology
75
OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release processors
76
Simulation for BTWCElectrical, transient phenomena affect the performance of BTWC designs
Simulation tools need to be “electrically accurate”
Functional correctness is re-defined in a statistical context
Need ability to gatherstatistical simulation data
Main
FF
Shad
ow La
tch
Main
FF
clk clk
clk_del
5
4 MEM
93
Example: Razor latches
77
Key Challenges: Speed and Fidelity
Speed
Fideli
tySPICE
AnalyticalCircuit Model
Circuit-AwareSimulation
GOAL: Balance between accuracyand speed of simulation
~4 hr/cycle
30min/prog
5 hrs/prog
78
Circuit-Aware is not only for BTWC designsThere is a recent trend in computer architecture design toward system that can adapt to circuit-level phenomena• e.g., di/dt, thermal throttlingThese novel circuit-aware architectural optimization share a modeling requirement of detailed circuit• Needs to be interaction between architectural state and circuit
behavior ( e.g., device switching activity, detail timing information of pipeline states )
Analytical circuit modeling has been widely used • Simple and fast • At the cost of accuracy
79
Circuit-Aware Architectural Simulation Platform Overview
App
ArchConfig
ArchitecturalSimulator
ArchitecturalSimulator
CircuitSimulatorCircuit
Simulator
Output
ArchMetrics
ModuleCircuitModels
TechModels
CircuitMetrics
Inputs,Voltage,
Constraints
Delay,Power,Switching
IF ID EX MEM WBSpeedand
Scope
Fidelityand
Observability
80
Architectural Simulator Structure
BPred SimulatorCore
Machine DefinitionFunctionalCore
SimpleScalar ISA POSIX System Calls
Proxy Syscall Handler
Dlite!
Cache MemoryRegsLoader
Resource
StatsPerformance
Core
Prog/SimInterface
SimpleScalar Program BinaryUserPrograms
81
Standard Circuit Simulation Approach
a: 0.00336631 b: 0.00262017
c: 0.00143158
d: 0.0046271
n45: 0.0389713
n46: 0.0239953n47: 0.0569968
f: 0.0233215
g: 0.0234223
extracted wire cap
Gate inputcap fromtypical.lib
output capacitance
slewdelaypower
slew
SPICE-characterizedStandard Cell
voltage
U7
U8 U9 U10
U11
82
Event Driven Implementation10
1 f
g
n45
n46n47
U7
U8U9 U10
U11
0
0
0
1
0
Transition Event
1
1
0
1
1
10
101
glitch
• Event driven simulation can capture glitch behavior• Validation against a set of SPICE simulation• Error rates are consistently less than 11%, with most less than 3%• The initial speed of simulation without optimization is 150 insts / sec
83
Some serious performance boostersConstraint-based pruning• Static pruning• Dynamic pruning
Circuit timing memoization (a.k.a. Cashing of electrical simulation results )SimPoints
84
Constraint-Based Pruning – An overview
IF/ID ID/EX EX/MEM MEM/WB
Logicundersim.
Constraintsnot violated
if (delay < Tclk)stop
else simulate
Check constraints
Some analyses have a “don’t care” scenario, e.g. ,“less than X switches”, “less than tclk”
Logic
Architecture does not react to “don’t care” sets
Logic
85
Static Constraint PruningAt each new voltage and temperature, domain specific STA computes worst case values of the constraints measure and can be pruned where the constraints cannot be violatedWhenever the supply voltage is changed, constraint based pruning is re-evaluated before simulation continues
Domain specific STA
1.32 ns
0.65 ns
0.83 ns
Less than clock threshold
abc
d
e
abc
d
eStatically pruned
Clock cycle time: 1 ns @ 1.8Vt_max_req
0.87
86
Dynamic PruningDuring simulation, an event can be dropped if a particular input vector causes a transition such that: tevent + Tmax2output < Tconstraint• Guarantees that simulation will reach output net without a timing violation• Must still perform logic simulation to compute circuit state
Example: Clock cycle time: 5 ns
arrive @ 2.23 ns 01 1
3.2ns 2.5ns 2ns 1ns
01 1
Tmax2outputTiming budget
@ 2.93 ns @ 3.23 ns @ 4.23 ns arrive @ 1.23 ns
Fast logic sim
87
Constraint-Based Circuit PruningIn our case study of 200Mhz Razor system• At 1.8V nominal voltage, pruning eliminated 64% of prime inputs of circuit• At most highly constrained voltage 1.4V, 24% of prime inputs of circuit is
eliminated
Static and Dynamic pruning achieve 445 instructions per second
88
Circuit Timing MemoizationRemember previous circuit evaluations, reuse results if they recurLeverage value locality by recording switching history• Previous vector encodes the internal node values of circuit, input vector
indicates new input transition
Sum16
16
16
opA
opB
0x00AE
0x003B
10x00AE0x003B
0x01FC0x0012
2.5ns250pJ
0x01FC
0x0012
Full Circuit Simulation
(2.5ns, 250pJ)Check
0x0012
0x01FCHash table
89
Circuit Timing MemoizationSize of hash table is limited by 256MBDynamic reordering hash bucket chains• Bringing most recently referenced (MRU) element to the head of chain,
reduces average number of hops• At most 50% hit rate on averagePer-opcode input vector filtering mechanism• Observation:
- load/store instructions ignore “operand B”• Each instruction opcode indicates with mask which do not influence stage
logic evaluation• 70% hit rate on average
90
SimPoint AnalysisAfter marshalling all optimization technique, we achieve approximately 1000 instructions per secondWe deploy a recent developed simulation sampling technique, SimPoint• Uses Basic Block Distribution Analysis to extract representative
samples(10 million insts) of original benchmark (1 billion insts)• Drastically reduces the number of instructions observed to characterize the
program’s performance• Error analysis indicates an error of less than 10% (typically less than 3%)
for a wide variety of benchmarks
A full benchmark execution can be completed within A full benchmark execution can be completed within 5 hours5 hours
91
Circuit-aware simulation to evaluate RazorInitial simulators utilized a hand-generated EX-stage circuit model• insufficient performance
Challenge: instruction latency (in cycles) depends on circuit evaluation latency • Cycle count may vary with input, voltage, temperature, process variation
Circuit-Aware Architectural Simulation combines architectural and circuit simulation• SimpleScalar architectural-level simulation• Gate-level timing simulation of per-stage logic blocks
92
Case Study:Razor Timing Speculation
Benchmark: GCC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.21.251.31.41.51.61.71.8Voltage
Rel
ativ
e Pe
rf &
Ener
gy
0
0.2
0.4
0.6
0.8
1
1.2
1.4REL_ENERGY REL_PERF
93
Case Study:Razor Timing Speculation
Benchmark: GCC
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
cycles
Volta
ge
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%voltage error rate
94
Circuit-Aware simulation in summaryChallenge is integration between architectural simulation and circuit simulation• Must balance fidelity and speed of simulationThree optimizations were utilized to enhance speed of simulation• Constraint-based Pruning• Circuit Simulation Memoization• SimPoint Simulation SamplingDemonstrated with Razor pipeline simulations• Razor architecture reacts cycle-by-cycle to circuit-level phenomenon
95
OutlineSynthesis - Typical-Case Optimized AddersPerformance/verification - Circuit-Aware simulationVerification - Beta-release designs
96
Beta-Release Designs
Traditional verification stalls launch until debug completeChecked processor verification could overlap with launch• Beta-release when checker works• Launch when performance stable• Step as needed without recalls
0
5
10
15
20
Desi
gn E
rror
s
0
20
40
60
80
100
Perfo
rman
ce
0
5
10
15
20
Desi
gn E
rror
s
0
20
40
60
80
100
Perfo
rman
ce
Beta Launch Step
LaunchChecked Processor Verification
Traditional Verification
TapeOut
Tape Out
97
Additional CAD Opportunities
For synthesis:• Typical-case library characterization (e.g., pdf of delay)• Synthesize design for target performance, power, etc…• TCO-style optimizations possible for macro-modules
For verification:• Full formal verification for checker components• Profile-directed simulation-based verification for core
For testing:• Checker component can facilitate software-based manufacturing test
of core components
98
Open discussion
99
Conclusion
100
Conclusions
Better than worst-case design abandons traditional worst-case design constraintsCouples complex designs with checkersEnables CAD opportunities for typical-case optimizationRequires tool support for observability, synthesis and verification
For more information:http://www.eecs.umich.edu/razor
101
References• Todd Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,”
ACM/IEEE 32nd Annual Symposium on Microarchitecture (MICRO-32), November 1999.• D. Ernst, N. S. Kim, S. Das, S. Pant, T. Pham, R. Rao, C. Ziesler, D. Blaauw, T. Austin, T. Mudge,
and K. Flautner, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” in the 36th Annual Int’l Symposium on Microarchitecture (MICRO-36), December 2003.
• Shanbhag, N.R., “Reliable and efficient system-on-chip design,” Computer, Vol.37, Iss.3, Mar 2004.• Uht, A.K., “Going beyond worst-case specs with TEAtime,” Computer, Vol.37, Iss.3, Mar 2004.• Austin, T.; Blaauw, D.; Mudge, T.; Flautner, K., “Making typical silicon matter with Razor,”
Computer, Vol.37, Iss.3, Mar 2004.• S.-L. Lu, “Speeding up processing with approximation circuits,” Computer, Vol.37, Iss.3, Mar 2004.• Worm, F.; Ienne, P.; Thiran, P.; DeMicheli, G., ”A Robust Self-Calibrating Transmission Scheme for
On-Chip Networks,” IEEE Trans. on VLSI Systems, Vol. 13, Iss. 1, January 2005.• T. Kehl. Hardware self-tuning and circuit performance monitoring, in Proceedings of lnternational
Conference on Computer Design, 1993.• L. Anghel and M. Nicolaidis, “Cost reduction and evaluation of temporary faults detecting
technique,” in Proceedings of the conference on Design, automation and test in Europe (DATE-2000), March 2000.
102
Supplemental Materials
103
More Details on Meta-StabilitySub-critical operation invites meta-stability
• Meta-stability detector itself can become meta-stable• double latch error signal to obtain sufficient small probability
clk_b
clk
clk
clk_b
D Q
clk_del
clk_del_brestore
restorebubbleflush
Dynamic Or / Latch
• Flush entire pipe• No forward progress• Reduce frequency
restorebubbleflush
pos
neg
fail
pos
neg
error
104
998
Razor Short Path Constraint
Main
FF
Shad
ow La
tch
Main
FF
clk clk
clk_del
5
4
Hold Constraint(~1/2 cycle)
MEM
8
3
2
Double-sampling metastability tolerant latches detect timing errors• Second sample correct-by-design, use guarantees forward progress
Microarchitectural support restores correct program state• Timing errors treated in the same way as branch mispredictions
105
Overcoming Short Path ConstraintsDelayed clock imposes a short-path constraint
Pad with extra delay
Razor_ff
ff
clock
Long Paths
Short Paths
• Razor necessary only for latches on slow paths
• Pad fast path for latches with mixed path delays
• Trade-off between DVS headroom and short path constraints
clock
clock_del
tdelay thold
Min. path delay
Min. Path Delay > tdelay + tholdintended path short path
106
Power Overhead of the Razor Flip-Flop
38% error-free latch overhead (assuming 20% switching activity)42% latch overhead with errors (20% switching, 1% error rate)Overhead mitigated by latch-frugal architecture
260fJEnergy of RFF per error event
Error Detection and Recovery
60fJ / 203fJRFF Energy (static/switching)
49fJ / 125fJStandard Flip-Flop Energy (static/switching)
Error Free Operation
107
Simultaneous Events
A B
A
BF
WN
WP
WP
WNCL=10pF
1
0
1 0
Cancel a pair of close events that may cause software static glitch, which cannot occur in real circuits; source of inaccuracy
Software Glitchsimultaneous events
0
NORVdd
GND
108
Accuracy of Simulators
Accuracy of simulation • Validation against a set of SPICE simulation with number of circuit
topology at varied voltages and input slew rates• Error rates are consistently less than 11%, with most less than 3%
The initial speed of simulation without optimization is 150 instructions per second ( comparable to VCS )VCS stands for Verilog Compiler Simulator
109
ld f1,(X)f4 = f1 * f2 + f3br f4 < 0, skip
r8 = r8 + 1skip: ...
ld f1,(X)f4 = f1 * f2 + f3br f4 < 0, skip
r8 = r8 + 1skip: ...
Core Processor Execution Checker Execution
ld * + br +cache miss long operation misprediction ld
+*
br+
ok
ok
ok
ok
ok
Speeding the Checker with Core Computation
Checker executes in wake of core• Leverages non-binding predictions & prefetches
Virtually no stalls remain to slow checker• Control hazards resolved during core execution• Data hazards eliminated by prefetches and input value predictions
Complex microarchitectural structures only necessary in core
110
Motivating ObservationsSpeculative execution is fault-tolerant• Design errors, timing errors, and electrical
faults only manifest as performance divots• Correct checking mechanism will fix errors
What if all computation, communication, control, and progress were speculative?• Any incorrect computation fixed
- maximally speculative• Any core fault fixed
- minimally correct
X
PC
alwaysnot taken
stuck-atfault
branchpredictor
array
111-0.4% Best: -3.2%
Worst: 0.2%
Slowdowns
RF
8 ports
L0 Data4 KB
4 ports
L0 Inst0.5KB2 ports
RF
8 ports
L1 D-cache
64 KB, 2 ports
L1 I-cache
64 KB, 1 ports
L2 Unified Cache
256 KB, 1 port
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Core Checker
Optimized System ArchitecturePerformance impacts eliminated• Checker RF allows core commit• No storage hazards• Few checker cache misses• Less expensive core storage
architecture (same as baseline)Core cache failures affect checker
112
Fully Decoupled System ArchitectureChecker fully decoupled• Core L1 caches may fail• All L2 writebacks from checker• Core caches flushed on fault• Core accesses and misses warm
up checker caches Eliminates common mode core cache failures• But, generates more L2 traffic• Further optimizations possible
1.2% Best: 0%Worst: 6.7%
Slowdowns
RF
8 ports
L1 Data4 KB
4 ports
L1 Inst0.5KB2 ports
RF
8 ports
L1 D-cache
64 KB, 2 ports
L1 I-cache
64 KB, 1 ports
L2 Unified Cache
256 KB, 1 port
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Core Checker
prefetchstream
113
Slack detector• Automatic tuning mechanism
- ARM’s Intelligent Energy Manager (IEM)- Processor voltage automatically tuned to
external ambient conditions - Inverter chain designed to track most
restrictive critical path, margin still required
Intelligent Energy Management
L2 Cache L2 Cache
control
Floating pointand
graphics
Data cache
Cache control
L2tags
Ex Unit
ControlUnit
IOUNIT
MemControl
114
EX-Stage Analysis – Optimal Voltage SweepGCC
1.62% Error Rate,24% Energy
Savings
0.3
0.5
0.7
0.9
1.1
1.3
1.5
0.60.68
0.75
0.830.90.9
81.0
51.1
31.21.28
1.35
1.431.51.5
81.6
51.7
31.8
Voltage
Rel
ativ
e IP
C a
nd E
nerg
y
Rel EnergyRel Performance
115
Simulation Results:Energy-Optimal Voltage
0
10
20
30
40
50
60
70
80
90
100
bzip crafty eon gap gcc gzip mcf parser twolf vortex vpr Average
Rel
ativ
e En
ergy
(%)
Total EnergyIPC
116
Low-Cost SER and Noise Protection
Only need to address transients in checker• Checker detects and corrects noise-related faults in core• Core processor designed without regard to strikes (e.g., no ECC…)
Recycle checker inputs suspected core fault• If no error on third execution, transient strike in checker processor• If error on third execution, core processor fault occurred (e.g., SER, design error)
Protect critical checker control with triple-modular redundant (TMR) logic• TMR on simple control results in only 1.3% larger checker (synthesized design)
IF ID REN REG SCHEDULER
EX/MEM CHK
IF
CHKID/REG
CHKEX
CT
CHKMEM
CTL 3rd opinion
CTL
CTL
117
Fully Testable Microprocessor DesignsChecker structure facilitates manufacturing tests• All checker inputs exposed
to built-in-self-test logic• Checker provides built-in
test signature compressionChecker can be fully tested with small BIST module• less than 0.5% area increase
Reduces burden of testing on core• Missed core defects corrected• Checker acts as core tester
IF
ID
OK
PC
=inst
PC
inst
EX
=regs
regs
MEM
=res/addr
addr result
D-cache
I-cache
RF
CT
WT
result
OKresult
BIST ROM and ControlDefectFree?
118
1.48 1.52 1.56 1.60 1.641E-81E-71E-61E-51E-41E-30.01
0.11
10
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
Pe
rcen
tage
Err
or R
ate
Nor
mal
ized
Ene
rgy
120MHz
140MHz
Voltage (in Volts)
Error Rate and Normalized Energy Savings for Chip1
119
Point of 0.1% Error RatePoint of First Failure
120MHz27C
99.6
89.7
Power(mW)
830
740
Energy perInstruction
(Power/IPC/Freq)(pJ)
119.4
104.5
Power(mW)
990
870
Chip2
Chip1
Energy perInstruction
(Power/IPC/Freq)(pJ)
Measured Power and Energy