Blade – A Path Towards Average-Case Silicon via Bundled-Data Resilient Design Peter A. Beerel CHIP...
-
Upload
stephanie-gordon -
Category
Documents
-
view
214 -
download
0
Transcript of Blade – A Path Towards Average-Case Silicon via Bundled-Data Resilient Design Peter A. Beerel CHIP...
Blade – A Path Towards Average-Case Silicon via Bundled-Data Resilient Design
Peter A. Beerel
CHIP in Bahia, Salvador Brazil
Sept 4th, 2015
Collaborations and Acknowledgements
MOTIVATION | 2
PUCRS (Brazil): Ney Calazans, Matheus Moreira, Guilherme Heck, Leandro Heck, Matheus Gibiluka
USCS (Brazil): Frederico Butzke
Tsinghua (China): Benmao Cheng
BITS (India): Ajay Singhvi
USC (USA): Dylan Hand, Fei Huang, Ramy Tadros, Alan Huang, Yang Zhang, Mel Breuer, Danlei Chen, Weizhe Hua, Huimei Cheng, Austin Katzin, Yuqi Song, Zhe Liu, Jun He
Motivation: Delay Overheads
Traditional synchronous design suffers from increased margins• Worse at low and near-threshold regions
MOTIVATION | 2
PVT margin
Clock margin
Flip-flop alignment
Cycle time ofclocked logic
Logic gates
Logic Time
[Dreslinksi et al., IEEE Proc. 2010]
Data delay in Plasma CPU
Motivation: Data Dependent Delays
Delay
Num
ber
of O
pera
tions
Slower
Logic gates
Logic Time
Worst-CaseData
Worst-Case DelayAverage DelayCycle time ofclocked logic
MOTIVATION | 3
Delay variation due to data is rarely exploited in traditional designs
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 5
Resilient Design
Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock
Effective approaches have been elusive!
RESILIENT DESIGN | 5
ShadowSequential
LogicNext PipelineStage
To Control LogicErrorDetection
SequentialGate
CLKCLK = 0
0
CLK = 1
1
The Problem [Beer et al, 2014]
• Flop metastability can propagate to ERR signal
• Metastability in control path can cause system failure
RESILIENT DESIGN | 6
[Bubble Razor, Fojtik, 2013]
Latc
hS
hado
w
FF
Logic Next Pipeline Stage
CLK
X To Control LogicX
Metastability Issue
Handling Metastability
Transition detector may exhibit metastability
• Error signal must go through synchronizer, increasing delay
• Uses architectural replay to recover from errors
Latch
Transition Detector
Logic
To Control Logic
Next Pipeline Stage
CLK
FF FF
Synchronizer[RazorII, Das, 2009]
X 1 or 0
RESILIENT DESIGN | 7
Hold Time Concerns
Relies on latch for error correction
Hold times are problematic
RESILIENT DESIGN | 8
Data In
Latch
MSDetector
FF
STOP
In
OutInternal Data
Comb.Logic
Ring Oscillator Clock Generator
[SafeRazor, Cannizzaro, 2014]
CLKCLK = 1
CLK = 0
Design Template
Sync / Async
MTBF Safe
Avoids Replay Logic
Hold Time
Robust
Low Error Penalty
Bubble Razor
Sync No Yes Yes Yes
Razor II Sync Yes No No No
SafeRazor Async Yes* Yes No Yes
Blade combines the best features of past resiliency schemes
RESILIENT DESIGN | 9
Design Template
Sync / Async
MTBF Safe
Avoids Replay Logic
Hold Time
Robust
Low Error Penalty
Bubble Razor
Sync No Yes Yes Yes
Razor II Sync Yes No No No
SafeRazor Async Yes* Yes No Yes
Blade Async Yes Yes Yes Yes
Resiliency Landscape
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 11
EDL R.dataCombinational
LogicL.data
δ Err
L.ackL.req
LE.reqLE.ack
R.ack
RE.ack
R.req
RE.req
Sam
ple
CLK
Blade StageError Detection Logic
2
BladeController
Δ
Asynchronous Controller
Error Detecting Latch
Single Rail Datapath
Reconfigurable Delay Lines
BLADE TEMPLATE | 11
Blade Template
Blade Template Operation
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
BLADE TEMPLATE | 12
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
Blade Template Operation
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
BLADE TEMPLATE | 12
CLK Err
Sam
ple
CLK Err
Sam
ple
Err = 0
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
Blade Template Operation
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
BLADE TEMPLATE | 12
CLK Err
Sam
ple
CLK Err
Sam
ple
Err = 1
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
Positive Hold Margins
BLADE TEMPLATE | 13
CLK Err
Sam
ple
CLK Err
Sam
ple
Err = 0
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
Positive Hold Margins
BLADE TEMPLATE | 13
CLK Err
Sam
ple
CLK Err
Sam
ple
Handshaking delays create positive hold margin!
Error Detection Logic
C-element stores error signal, which is sampled by Q-Flop
BLADE TEMPLATE | 14
OutIn
Error Detection Logic
Blade Controller
C+
D QLatch
delayX
EDL
Q-Flop
{ {From other Q-FlopstTD
CLK
Err1
Err0
CLK = 0
CE = 0
CLK = 1
CE = 1
Err1
= 0
Err0
= 0
Sam
ple
Sam
ple
= 0
Sam
ple
= 1
Err1
= 1
Metastable-Safe Operation
Q-Flop prevents metastability propagation to control path
BLADE TEMPLATE | 15
OutIn
Error Detection Logic
Blade Controller
C+
D QLatch
delayX
EDL
Q-Flop
{ {From other Q-FlopstTD
CLK
Err1
Err0
CLK = 0
CE = 0
CLK = 1
CE = X
Err1
= 0
Err0
= 0
Sam
ple
Sam
ple
= 0
Sam
ple
= 1
Err0
= 1
Some time later…
Out
CLK
In
Sample
Error Detection Logic
Blade Controller
C+
D Q
delay
Err0
Latch
delayX
EDL
Q-Flop
Err1
From other C-elements
{
{ {
From other Q-Flops
tcomp
tTD
Overhead Reduction
C-element and OR gates amortize overhead over many EDLs
BLADE TEMPLATE | 16
4-Input
C-element
collects output
from 3 EDLs
4-Input
OR gate collects
output from 4
C-elements
Each Q-Flop
covers 12
EDLs
Controller Implementation
BLADE TEMPLATE | 17
Bottom
Left
goL
goR
goD
L.reqL.ack
LE.reqLE.ack
R.reqR.ackRE.reqRE.ack
Err[1] Err[0]Sample CLK
δ
Δ
Δ edo
edi
delayBlade Controller
Right
Three part burst-mode state machine• Implemented using 3D [Yun, 1992]
Bottom
Left
goL
goR
goD
L.reqL.ack
LE.reqLE.ack
R.reqR.ackRE.reqRE.ack
Err[1] Err[0]Sample CLK
δ
Δ
Δ edo
edi
delayBlade Controller
Right
Three part burst-mode state machine• Implemented using 3D [Yun, 1992]
Controller Implementation
BLADE TEMPLATE | 17
L.req+ / LE.req+
LE.ack+ / goR+
goL+ / Lack+ L.req- / LE.req-
LE.ack- / goR-
goL- / L.ack-RE.req+ Err[1]+ /
edo+
edi+ /RE.ack+ goD- edo-
Err[1]- edi- / Err[0]- /
RE.req+ Err[0]+ /RE.ack+ goD-
RE.req- Err[1]+ /edo+
Err[1]- edi- / Err[0]- /
RE.req- Err[0]+ /RE.ack- goD-
edi+ /RE.ack- goD- edo-
goR+ R.ack- / goD+ R.req+
goR- R.ack+ / goD+ R.req-
goD+ / CLK+
delay+ /goL+ Sample+ CLK-
goD- delay- /Sample-
goD+ / CLK+
delay+ /goL- Sample+ CLK-
goD- delay- /Sample-
DELAY ELEMENTS | 18
Delay Elements: MUX-DE
Most popular programmable Delay Elements (DEs)
Delay controlled by number of buffers in the signal path
Binary codewords N. Mahapatra et al., “Comparison and Analysis of Delay Elements,” in MWSCAS, 2002, pp. 473–476.
DELAY ELEMENTS | 19
Delay Elements: One-Hot DCCS
Lengths of the current source transistors increase linearly
• (1L, 2L, 3L, ..., nL)
Replicate inverting stages to balance rise-fall delays
For a particular codeword, only one of the current source transistors is ON [Singhvi et al., ISVLSI 2015]
Delay Elements: Delay Shift Inverter
Simple inverter structure
3 flavors for back-biasing
• For RBB, PBIAS > 1V
and NBIAS < 0V
• Use flavor (b)
Amount of delay shift
controlled by two factors:
• Inverter sizing
• Magnitude of back-bias
voltage
DELAY ELEMENTS | 20
[Singhvi et al., ISVLSI 2015]
Delay Lines and Voltage Scaling
*Delays range: 300ps to 1.4ns *Sized to match rise/fall delays
Voltage Scaled Delay Ratio
• Delay at nominal voltage over delay at near threshold voltage
The problem
• VSDR changes with codeword
• Forces needed margin
Solution
• Replicate and size designs to reduce margins
DELAY ELEMENTS | 23
[Singhvi et al, VLSID 2015]
DELAY ELEMENTS | 21
Comparison of Characteristics
Energy vs Quantization Error Trade-offs Exist Can be Exploited
*DQE: Delay Quantization Error*EPT: Energy Per Transition
*
[Singhvi et al., ISVLSI 2015]
Delay Line Quantification Effects
Quantification effects reduced due to inherenttradeoff between nominal delayδ and error penalty p * Δ
Del
ay
δ p * Δ
𝑑=𝛿+ p∗ ΔRecall:
DELAY LINE QUANTIZATION | 25
Delay Line Quantification in BD
Linear relationship between delay line quantizationand average stage delay in Bundled Data
Del
ay
δ
𝑑=𝛿Bundled Data:
DELAY LINE QUANTIZATION | 26
Bundled Data
Error Detecting Latch: TBTD
Formed by a transition detector and an error latch (an asymmetric C-Element)
TD EL
Transition Detector (TD) and Error Latch (EL)
[Bowman et al, ISSCC 2007]
ERROR DETECTING LATCH | 27
Increased sensitivity Enables safe operation even in the presence of glitches
Transistor size XOR• OX-TD+EL
Use static C-element• SOX-TD+EL
Analyze for glitch sensitivity
[Moreira et al., ISQED 2015]
ERROR DETECTING LATCH | 28
Error Detecting Latch: Glitch Sensitivity
twin
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 29
Resiliency Performance Benefit
ANALYSIS | 30
Data delay in Plasma CPU
Delay
Num
ber
of O
pera
tions
Worst-Case DelayNominal Delay
Key Question: How do we set δ to optimize performance
δ δ + Δ
δ Δ
Delay Models
Our approach
• Analyze the performance of Blade for a variety of delay models
NormalDistribution
Logic Delay
Pro
babi
lity
Logic Delay
Pro
babi
lity
Log-NormalDistribution
Logic Delay
Fre
quen
cy Real WorldDistribution
DELAY MODELS | 31
Definitions• C : Clock Period / Cycle Time• EC : Effective Clock Period• p : Probability of error• : Average delay of Blade stage
Optimal performance achieved by minimizing
Optimal Average-Case Performance
DELAY MODELS | 32
Pro
babi
lity
δ δ + Δ
PR( d ≤ δ )
Pro
babi
lity
δ δ + Δ
1-PR( d ≤ δ )
Probability oferror (p)
𝐶=𝛿+ Δ
𝑑=𝛿+ p∗ Δ
σ
μ
*Assumes backward latency is hidden via latch retiming
popt observations
• Varies between 20%
and 35% for log-
normal distributions
• Significantly higher
than in sync resiliency
• Constant for normal
distributions!
Optimal Probability of Error - popt
DELAY MODELS | 33
Higher Variance
Proof of constant popt
DELAY MODELS | 34
Implication: Tuning of delay line may target fixed probability!
𝐾=𝜇+𝑚∗𝜎𝐾=𝛿+ΔAssume worst case delay per stage is constant
Worst case delay is set by mean, variance, and SER
Systematic Error Rate (ξ) sets the worst-case delay per stage, K
𝑚= 𝑓 (𝜉 )𝜉=1− [𝑃𝑅 {𝑑≤𝐶 } ]𝑁
Recall: 𝑑=𝛿+𝑝∗ Δ¿ (1 −𝑝 )∗𝛿+𝑝∗𝐾For Normal distribution:
Taking inverse error function of both sides:
erf −1 (1− 2𝑝 )=𝛿−𝜇√2𝜎
𝛿=√2𝜎 [erf − 1 (1−2𝑝 )]+𝜇
Rewrite:𝑑=(1−𝑝 )[√2𝜎 [erf −1 (1−2𝑝 )]+𝜇]+ p∗K
𝜕𝑑𝜕𝑝
=(1+𝑦 )[√2𝜕erf −1 (𝑦 )
𝜕 𝑦 ]−√2erf −1 𝑦+𝑚=0
𝑦=1 −2𝑝
Minimize by taking derivative and setting it equal to zero :
Note y and p are independent of σ and μ! m depends only on
Optimal Size of Resiliency Window - Δ
Blade supports maximum Δ of 50% of clock cycleOptimal Δ is larger for designs with high-variance!
OPTIMIZATION | 35
(%)
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
Blade
COMPARISON | 36
FF
FF
M S M S
ED
L
ED
L
ED
L
ED
L
N-Stage Rings
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
Blade
COMPARISON | 36
Normal Distribution
Synchronous BR
Blade
35% 23%
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
Blade
COMPARISON | 36
Normal Distribution
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Circuits
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 37
Automatic Conversion Flow
Convert single-clock sync RTL design to Blade
Re-uses synchronous EDA tools and libraries
Seamless integration into existing flows
CASE STUDY | 38
Simulation
Blade Design Flow
SyncSynthesis
Latch Retiming
FF to Latch ConversionR
TL
Spe
c
Asy
ncN
etlis
t
Add EDL + Async Control
Synthesize sync RTL design using standard EDA tools
CASE STUDY | 39
Synthesis
FF FF
CLK
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
Blade Design Flow
RT
L S
pec
Asy
ncN
etlis
t
SimulationSyncSynthesis
Latch Retiming
FF to Latch Conversion
Add EDL + Async Control
CASE STUDY | 40
Latches
M S M S
Replace flip flops with master-slave latches
Two-phase non-overlapping clocking
CLK1CLK2
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
CASE STUDY | 41
Latch Retiming
M M
CLK1
S
Retime latches to spread logic delay across stages
Allow time borrowing to reduce area overhead
CLK2
S
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
CASE STUDY | 42
EDL Insertion
M
CLK1
S
Replace non-TB latches with error detecting latches
Not all latches need be error detecting
CLK2
SEDLMEDL TB TB
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
CASE STUDY | 43
Async Control
TB
Remove synchronous clock trees
Add Blade controllers and delay lines
TBEDLEDL
BladeController
BladeController
BladeController
BladeController
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
CASE STUDY | 44
Simulation
TB
Back annotated SDF simulation using final netlist
TBEDLEDL
BladeController
BladeController
BladeController
BladeController
SimulationSyncSynthesis
Latch Retiming
Add EDL + Async Control
FF to Latch Conversion
Resynthesis
CASE STUDY | 45
EDL
A
LogicEDL
B
EDL
C
EDL
D
EDL
E
EDL
F
Critical Path
Delay > δ
Correlated
EDLDelay > δ
Latch
F
Add set_max_delay from A to F = δ
Latch
E
Delay < δ
Delay < δ
Brute Force Resynthesis
Evaluated hundreds of resynthesis runs• Each run sets a max delay constraint to a single latch
Chose result that led to largest reduction in area and error rate
CASE STUDY | 46
Best run reduced error rate by 39% and EDLs by 27% (-1.9% area)
Best Run122 Correlated EDLs!
MIPS OpenCore
3-stage pipeline
28nm FDSOI @ 666MHz
(w/ ideal clock and Vdd)Type Count
Combinational 11,740
Buf/Inv 1,683
Seq. (Non-RF) 531
RF 2,048
Total 14,319
http://opencores.org/project,plasma
CASE STUDY | 47
Case Study: Plasma
Case Study: Area and Performance
Overall area overhead is 8.4%
CASE STUDY | 48
2%
24%
6%
32%
12%
24%
FF to Latch + EDL
Q-Flops
C-elements
AND/OR Trees
DelayElements
Controllers
Performance increases• Average: 19%• Peak: 42%
Plasma running CoreMark
Performance Comparison with Margins
Must add margins for PVT variation, clock skew / jitter, and/or aging
• Synchronous frequency degraded to accommodate
Margin in Blade design is only imposed when an error occurs
• A 30% error rate reduces impact of margin by ~70%
CASE STUDY | 49
Margin Synchronous Blade (30% ER) % Advantage
0% 666MHz 800MHz 20%
15% 566MHz 764MHz 35%
30% 466Mhz 728MHz 56%
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Circuits
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 50
Other Related Work
Canary Circuits [Sato, 2007]• Removes some PVT margins
• But cannot take advantage of data dependency
Bundled Data Designs [Sutherland’89, Nowick’97]• Speculative completion sensing exploits some data dependency
• Margins impact performance on every cycle
• No observability of errors
Soft Mousetrap [Liu, 2013]• Hold time constraints remain difficult to meet
CONCLUSIONS | 51
ConclusionsBlade Template
• Achieves higher performance by exploiting data dependency
• Benefits from average vs worst-case MS resolution times
• Reduces impact of margins for PVT variations
• Enables voltage scaling for power savings
• Supported by design flow w/ commercial EDA tools
Plasma Case Study
• Highlights design and CAD techniques for area efficiency
• Achieves 19% increase in performance with 8.4% area overhead
CONCLUSIONS | 52
On-Going WorkDesign
• Further design and analysis of efficient DEs and EDLs
Automated Flow• Support designs specified in System Verilog CSP
• Develop average-case-driven re-synthesis tools
• Automated resilient-aware PnR Flow
Testing Strategy• On-line tuning of delay lines
• Screen chips with variations too large to correct
Applications• Encryption Designs (Triple DES and Light Weight Crypto)
ON-GOING WORK | 53
D. Hand, M. Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li, M. Gibiluka, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Blade - A Timing Violation Resilient Asynchronous Template. ASYNC 2015: 21-28
D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. Moreira, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Performance Optimization and Analysis of Blade Designs under Delay Variability. ASYNC 2015: 61-68
Y. Zhang, L. S. Heck, . M. Moreira, D. Zar, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Design and Analysis of Testable Mutual Exclusion Elements. ASYNC 2015: 124-131
M. Moreira, D. Hand, P. A. Beerel, N. Calazans: TDTB error detecting latches: Timing violation sensitivity analysis and optimization. ISQED 2015: 379-383
G. Heck, L. S. Heck, A. Singhvi, M. Moreira, P. A. Beerel, N. L. V. Calazans: Analysis and Optimization of Programmable Delay Elements for 2-Phase Bundled-Data Circuits. VLSI Design 2015: 321-326.
A. Singhvi, M. T. Moreira, R. N. Tadros, N. L. V. Calazans, P. A. Beerel: A Fine-Grained, Uniform, Energy-Efficient Delay Element for FD-SOI Technologies, ISLVSI 2015
P. A. Beerel, N. Calazans: A Path Towards Average Case Silicon using Resilient Bundled Data Design, ECCTD 2015 (Invited)
ON-GOING WORK | 55
Blade Publications (2015)
The Blade Team++
Asynchronous Circuit in Industry
3D Network on chips (STMicroelectronics)
Ethernet Switches (Intel)
Ultra high-speed FPGAs (Achronix)
Low-power smart cards (Tiempo)
Neuromorphic Computing (IBM)
Internet of Things (???)IBM True North Multi-Chip
Neuromorphic System
Tiempo TAM16 - Clockless 16-bit Microcontroller
STMicroelectronics WIOMING 3D-IC
Achronix FPGA. 1.7 M LUTs. 2.1 Gbps IO
Fulcrum/Intel Ethernet Switch Chip
Obrigado!
Do not miss....
22nd IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC)
MAY 8 - 11, 2016 PORTO ALEGRE, BRAZIL
Homepage: http://www.inf.pucrs.br/async2016
We hope to see you there!