Post on 06-Oct-2020
1
Better Than Worst-Case DesignBetter Than Worst-Case Design
Todd AustinUniversity of Michigan
austin@umich.edu
Acknowledgements:Valeria Bertacco, David Blaauw, Shidhartha Das, Dan Ernst, Nam Sung Kim, Seokwoo Lee, Trevor Mudge, Chris Weaver
Kris Flautner & ARM LtdRobert Colwell
Challenges in the Nanometer RegimeChallenges in the Nanometer Regime
Design complexityBillions and billions of transistors lead to untenable designs…
Device-level faults in logic and memoryCosmic rays, alpha particles, gate wear-out, silicon defects, etc…
Uncertainty in design parametersProcess and temperature variation, supply noise…
Power/performance demandsBounding performance, area, and battery life
2
Design-TimeVerification
andOptimization
Traditional Worst-Case DesignTraditional Worst-Case Design
L H
Time-to-Market
L H
Performance
Run-TimeVerification
TypicalCase
Optimization
Better Than Worst-Case DesignBetter Than Worst-Case Design
L H
Time-to-Market
L H
Performance
L H
Performance
L H
Time-to-Market
Online
Checker
Hardware
3
Presentation AgendaPresentation Agenda
BTWC Design ExamplesDIVA CheckerRazor Logic
BTWC Design Opportunities and ChallengesTypical-Case design Optimization (TCO)
Conclusion
Example BTWC Design:DIVA Checker [MICRO ‘99]Example BTWC Design:DIVA Checker [MICRO ‘99]
All core function is validated by checkerSimple checker detects and corrects faulty results, restarts core
Checker relaxes burden of correctness on core processorTolerates design errors, electrical faults, defects, and failuresCore has burden of accurate prediction, as checker is 15x slower
Core does heavy lifting, removes hazards that slow checker
speculativeinstructions
in-orderwith PC, inst,inputs, addr
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Performance Correctness
Core CheckerOnline
Checker
Hardware
4
result
Checker Processor ArchitectureChecker Processor Architecture
IF
ID
CTOK
CoreProcessorPrediction
Stream
PC
=inst
PC
inst
EX
=regs
regs
core PC
core inst
core regs
MEM
=res/addr
addrcore res/addr/nextPC
result
D-cache
I-cache
RF
WT
Check ModeCheck Mode
result
IF
ID
CTOK
CoreProcessorPrediction
Stream
PC
=inst
inst
EX
=regs
regs
core PC
core inst
core regs
MEM
=res/addr
addrcore res/addr/nextPC
result
D-cache
I-cache
RF
WT
5
Recovery ModeRecovery Mode
result
IF
ID
CT
PC inst
PC
inst
EX
regs
regs
MEM
res/addr
addr result
D-cache
I-cache
RF
How Can the Simple Checker Keep Up? How Can the Simple Checker Keep Up?
Slipstream
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream
fast moving air ⇒ branch predictions and cache prefetchesCore processor slipstream reduces complexity requirements of checkerChecker rarely sees branch mispredictions, data hazards, or cache misses
6
How Can the Simple Checker Keep Up? How Can the Simple Checker Keep Up?
Slipstream
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream
fast moving air ⇒ branch predictions and cache prefetchesCore processor slipstream reduces complexity requirements of checkerChecker rarely sees branch mispredictions, data hazards, or cache misses
How Can the Simple Checker Keep Up? How Can the Simple Checker Keep Up?
Slipstream
IF ID REN REG
EX/MEM
SCHEDULER CHK CT
Slipstream reduces power requirements of trailing carChecker processor executes inside core processor’s slipstream
fast moving air ⇒ branch predictions and cache prefetchesCore processor slipstream reduces complexity requirements of checkerChecker rarely sees branch mispredictions, data hazards, or cache misses
7
REMORA: Physical Checker DesignREMORA: Physical Checker Design
Physical checker designAlpha integer ISA subset4-wide checker, 0.5k I-cache, 4k D-cache
Less than 3% slowdown for Alpha core
Only a 6% area overhead incurred
Design also includes:Pipelined checker design, simple coreClock/voltage tuning infrastructureExtensive BIST support
205 mm2
(in 0.25um)
Alpha 21264
REMORAChecker
datacache
instcache
pipe-line
BIST
12 mm2
(in 0.25um)
Verifying the Checker ProcessorVerifying the Checker Processor
Simple checker permits complete functional verificationIn-order blocking pipelines (trivial scheduler, no rename/reorder/commit)No “internal” non-architected state
Fully verified design using Sakallah’s GRASP SAT-solverFor Alpha integer ISA without exceptionsWith small register file and memory, and small data types
X
CheckerModel
ReferenceModel
(ISA sim)
==
output
output
ϕUnspecified CorePredictions Always true if
uArch model == Ref model
Identical state?
8
Presentation AgendaPresentation Agenda
BTWC Design ExamplesDIVA CheckerRazor Logic
BTWC Design Opportunities and ChallengesTypical-Case design Optimization (TCO)
Conclusion
Motivating Study:Voltage vs. Circuit Error RateMotivating Study:Voltage vs. Circuit Error Rate
9
Circuit Under TestCircuit Under Test
48-b
it LF
SR
48-b
it LF
SR
48-b
it LF
SR
48-b
it LF
SR
X
X
X
clk/2
clk/2
clk clk
clk/2
clk/2
clk
!=
40-b
it E
rror C
ount
er40
-bit
Erro
r Cou
nter
Slow Pipeline A
Slow Pipeline B
Fast Pipeline
clk/2
18
18
36
36
36
18x18
18x18
18x18
stabilize
18x18-bit Multiplier Block at 90 MHz and 27 C
0.0000000%0.0000001%0.0000010%0.0000100%0.0001000%0.0010000%0.0100000%0.1000000%1.0000000%10.0000000%100.0000000%
1.141.181.221.261.301.341.381.421.461.501.541.581.621.661.701.741.78
Supply Voltage (V)
Err
or r
ate
random
Zero-margin@ 1.54 V
Safety-margin@ 1.63 V
Environmental-margin@ 1.69 V
Error Rate Studies – Empirical ResultsError Rate Studies – Empirical Results
35% energy savings with 1.3% error
22% saving
once every 20 seconds!
10
Error Rate Studies – SPICE-Level SimulationsError Rate Studies – SPICE-Level Simulations
Based on a SPICE-level simulations of a Kogge-Stone adder
Kogge-Stone Adder at 870 MHz and 27 C
0.00%
0.01%
0.10%
1.00%
10.00%
100.00%
0.60.811.21.41.61.82
Supply Voltage
Erro
r ra
te
random
bzip
ammp
200 mV
Another BTWC Design:Razor LogicAnother BTWC Design:Razor Logic
Main
FF
Shad
ow La
tch
Main
FF
clk clk
clk_del
5
49 MEM39
9
Double-sampling latches detect timing errorsSecond sample is correct-by-design
Microarchitectural support restores stateTiming errors treated like branch mispredictions
Research challenges: metastability and short-path constraints
Online
Checker
Hardware
11
recover
IF
Razo
r FF ID
Razo
r FF EX
Razo
r FF MEM
(read-only)WB
(reg/mem)
error bubble
recover recover
Razo
r FF
Stab
ilizer
FF
PC
recover
flushID
bubbleerror bubble
flushID
error bubble
flushIDFlushControl
flushID
error
Cycle: 0
inst1inst2inst3inst4inst5
123456
inst6
Distributed Pipeline Recovery
inst2inst7inst8
789
inst3inst4
Builds on existing branch prediction frameworkMultiple cycle penalty for timing failureScalable design as all communication is local
Razor PrototypeRazor Prototype
Icache
Dcache
RF
IF ID EX MEM WB
3.3mm
3.0mm
12
Razor Opportunity:Typical-Case Energy ReductionRazor Opportunity:Typical-Case Energy Reduction
Eref
VoltageControl
Function Σ...
Pipeline
reset
Vdd
Ediff = Eref - Esample
-
EsampleVoltageRegulator
Edifferror
signals
Energy reduction can be realized with a simple proportionalcontrol function
Control algorithm implemented in software
20 40 60 80 100 120 1400123456789
10
1.481.521.561.601.641.681.721.761.80
Voltage Controller ResponseVoltage Controller Response
Two minute snapshot of a 15 min run
Con
trol
ler O
utpu
t Vol
tage
(V)
Perc
enta
ge E
rror
Rat
e
Time (Seconds)
13
Energy/Performance CharacteristicsEnergy/Performance Characteristics
Decreasing Supply Voltage
Energy
Energy of ProcessorOperations, Eproc
Energy ofPipeline
Recovery,Erecovery
Total Energy,Etotal = Eproc + Erecovery
Optimal Etotal
PipelineThroughput
IPC
Energy of Processorw/o Razor Support
50%
1%
Measured ResultsMeasured Results
Num
ber o
f Chi
ps
1.4 1.5 1.6 1.7 1.8
1.4
1.5
1.6
1.7
1.8 Chips Linear Fit y=0.78685x + 0.22117
Voltage at First FailureVolta
ge a
t 0.1
%Er
ror R
ate
Point of 0.1% Error Rate Vs
Point of First Failure
0 5 10 15 20 25 3002468
10121416 Lot0
Lot1
Normalized Energy Savings over First Failure Point at 0.1% Error Rate
Percentage Savings
14
Other Better Than Worst-Case designsOther Better Than Worst-Case designs
Algorithmic-Noise Tolerance, Shanbhag et al.Converting circuit faults to S/N component
Approximate Circuits, Lu et al.Architecture-level speculation on computation
TEAtime Adaptive Clock, Uht et al.Adaptive clock control
On-Chip Self-Calibrating Busses, Worm et al.Error recovery logic for on-chip busses
Self-Tuning Circuits, Kehl et al.Early work on dynamic timing error avoidance
Time Based Transient Fault Detection, Anghel et al.Double sampling latches for speed testing
March 2004
Presentation AgendaPresentation Agenda
BTWC Design ExamplesDIVA CheckerRazor Logic
BTWC Design Opportunities and ChallengesTypical-Case design Optimization (TCO)
Conclusion
15
BTWC Design OpportunitiesBTWC Design Opportunities
Key observation:
Infrequent faults in the core design are tolerable.Infrequent faults in the core design are tolerable.
Opportunities:Focus only on the critical components, no need to verify ad infinitumOptimize performance/power/implementation for the most common scenarios (typical-case optimization)
BTWC Design Opportunity:Typical-Case Optimized AdderBTWC Design Opportunity:Typical-Case Optimized Adder
Kogge-Stone Adder
G0P0
G1P1
G2P2
G3P3
G4P4
G5P5
G6P6
G7P7
G8P8
G9P9
G10P10
G11P11
G12P12
G13P13
G14P14
G15P15 Cin
…
16
Carry Propagations for Random DataCarry Propagations for Random Data
081624324048
56
016
3248
64
0
0.01
prob
abili
ty
carry propagationcarry start
Bit Position Carry Distance
Pro
babi
lity
Carry Propagations for Typical DataCarry Propagations for Typical Data
08162432404856
016
3248
64
0
0.16
prob
abili
ty
carry propagationcarry start
Carry DistanceBit Position
Pro
babi
lity
17
Typical Case Optimized AdderTypical Case Optimized Adder
G0P0
G1P1
G2P2
G3P3
G4P4
G5P5
G6P6
G7P7
G8P8
G9P9
G10P10
G11P11
G12P12
G13P13
G14P14
G15P15 Cin
…
ripple carry circuitcarry-lookahead circuit
Benefits of Typical Case OptimizationBenefits of Typical Case Optimization
3.0316TCO Adder
5.088Kogge-Stone
Typical-CaseWorst-Case
Latency (in gate delays)Adder Topology
Typical-case performance much better than worst caseEspecially for typical-case optimized design
18
Presentation AgendaPresentation Agenda
BTWC Design ExamplesDIVA CheckerRazor Logic
BTWC Design Opportunities and ChallengesTypical-Case design Optimization (TCO)Circuit-level observability and system-level performance
Conclusion
BTWC Design Challenge:Observability of Circuit-Level CharacteristicsBTWC Design Challenge:Observability of Circuit-Level Characteristics
App
ArchConfig
ArchitecturalSimulator
ArchitecturalSimulator
CircuitSimulatorCircuit
Simulator
Output
ArchMetrics
ModuleCircuitModels
TechModels
CircuitMetrics
Inputs,Voltage,
Constraints
Delay,Power,Switching
IF ID EX MEM WBSpeedand
Scope
Fidelityand
Observability
Circuit-Aware Architectural Simulator efficiently melds circuit simulation with architectural simulation
19
ConclusionConclusion
Better than worst-case design abandons traditional worst-case design constraintsCouples complex designs with checkers
DIVA Checker verifies program computationRazor Logic verifies circuit timing
Enables CAD opportunities for typical-case optimization