A Distributed Stallable Architectureto Handle Delay Variations
Dr. Alberto A. Del BarrioComplutense University of Madrid
UCLA Stay
• VLSI, Architecture, Synthesis and Technology(VAST) Laboratory– http://cadlab.cs.ucla.edu/beta/cadlab/news
• Lead by Prof. Jason Cong• Around 20 students (postdocs, predocs, master, undergrads, visitors)
• More than 400 papers• Tools releases, startups
– xPilot AutoESL Vivado HLS (Xilinx)
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
How to modeldelay whileconsidering
process variations[Jung and Kim, ICCAD’07]
Comparisons
• State of the art: Worst Case– Overpesimistic
• CODES’09: BTW + Centralized Stallable Arch.– Every failure in execution time will stall the wholedatapath
• Many operations finishing its execution at the same time willincurr an extra cycle penalty
• Dynamic behavior escaping from static analysis will stall thedatapath
– Can only recover failures up to 1 cycle– Worse behavior when sharing resources
• Proposal: Distributed Stallable Arch.
7
An example: Differential EquationSolver (DES)
8
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
Worst Case vs Best Case
Worst Case = Better ThanWorse Case
Best Case, 5 cycles of difference. DistributedArch. will be close to BC
9
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
Razor Register
Shadow
Main 1
0
Comp
clk
dclk
hit
din Comparison is performedbetween two registers
11
Razor Register ChronogramD
dclk If the inputs changein the lapse of time dclk_shift ishappening, thevalue stored in theshadow registercould be dirty
If a FU is shared, theworst case delay isallocated for itsoperations
T
Literature: Centralized StallableArchitecture
13
Buffe
r
Rz. Inp
ut
Register
Rz. O
utpu
t Re
gister
Combina onal Logic
FSM Combina onal
Logic
Rz. State Register
Rz. Stab.
Re
gister
By Cong et al., CODES’09Problems: FU sharing restrictsthe possibilities of the design. Theworst case timingmust be allocatedwhen happening
Deals withprocess variations
Better Than Worse Case (unconstrained)
Additional slack guaranteesops. 4 and 7 to be correct
14
Better Than Worse Case (unconstrained)
If only operations4 and 7 haveproblems, it´s ok
15
Better Than Worse Case (unconstrained)
Operation 5 failurewas not considered
in the staticanalysis
Every failure istranslated into an
extra cycle
1st iterationfinishes after 11 cycles (1 failure)
16
My approach: Distributed StallableArchitecture
17
Buffe
r
Raz.
Inpu
t Re
gister
Rz. O
utpu
t Re
gister
Combina onal Logic
FSM CL 1
St Reg 1
Rz. Stab.
Re
gister
…
…
Commit Signals Logic Unit
…
FSM CL N
St Reg N
DistributedArchitecture, by Del Barrio et al., DATE’10, TCAD (March, 2011)
The controller is splitinto several local controllers, plus a coordinatorresponsible forchecking hazardsdynamically
But how to integratewith Razor Registers???
Usefulness of a DistributedArchitecture
×1 × 2
+ 5 ×6
× 3
< 8- 9
× 7 4
- 11 + 10
×
1
2
3
4
×1 × 2 + 5
×6
× 3
< 8
- 9
× 7
× 4
- 11
+ 10
1
2
3
4
BTW static schedule (unconstrained) Priority‐list static schedule (unconstrained)
Operations 7 and 4 are covered, ifa failure happens, it will have no impact over the latency
Usefulness of a DistributedArchitecture
×1 × 2 + 5
×6
× 3
< 8- 9 × 7
× 4
- 11
+ 10
R R
R
R
R
1
2
3
4
×1 × 2
+ 5 ×6
× 3
< 8- 9
×7 4
- 11 + 10
×
1
2
3
4
R
R
R
1-stall
2-stall
3-stall
R4-stall
BTW execution example Distributed execution example
But what if more failures happen ?? BTW could not be enough
Distributed Architecture: Best Case Static Scheduling (unconstrained)
20
Distributed Architecture
We schedule consideringthe Best Case, but thedatapath is able toreschedule on the fly, and besides some failures can be hidden
If there are not more failures, 1st iteration will finish in 10 cycles, besides hiding 2 failures
21
Simulator Results:
Benchmark Codes BTW DisM Barrier DisMDiffEq 10.5 10.394 6.31ARF [13] 18.79 18.379 14.103FFT [12] 11.57 12.218 11.844FIR16 [12] 20.65 20.342 16.074EWF [13] 12.892 12.829 12.316
Benchmark Codes BTW DisM Barrier DisMDiffEq 10.66 10.374 8.305ARF [13] 23.32 20.318 17.825FFT [12] 16.51 14.92 14.38FIR16 [12] 24.27 21.22 16.6EWF [13] 16.1 12.79 12.47
Unconstrained
RC‐constrained: 4+, 4*
Codes and DisMBarrier havesimilar results. DisM reduces 17% latency
DisM Barrier and DisM reduce 12% and 23% latency, respectively
Implementation Results (P & R)
Distr. Arch. implementsmodulo scheduling
dynamically, as it executesoperations when ready
Comparison is performedbetween two registers, there isno problem with FU sharing, as din (combinational) will notinfluence on the comparisonresult
23
10 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 4 8.165 32.66 2 135 71 0.26 0.04 0.055 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 5 9.003 45.015 2 157 73 0.26 0.05 0.05
Alb+Xilinx 14.4 (2*,2+) 3 8.7 26.1 2 246 118 0.26 0.08 0.08Alb tuned+Xilinx 14.4 (2*,2+) 3 9.535 28.605 2 204 86 0.26 0.07 0.06
Toolchain ResourcesLatencyCycle Time (nsEx. Time (ns) DSPs LUTs Regs %DSPs %LUTs %RegsTime Area
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
Binding Problem: bad binding
cstep 1
cstep 2
cstep 3
cstep 4
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
The hazard between 8 and 9 stallsseveral components of the graph
Clock Cycle M1 M2 A1 A2
S1
S6
S7
S7
S1
S6
S7
S7
0
0
0
0
1
1
1
1
S2
S3
S4
S4
S2
S3
S4
S4
0
0
0
0
1
1
1
1
S5
S5
S10
S10
S10
S5
S10
S10
0
0
0
0
0
1
1
1
S8
S8
S8
S9
S11
S8
S8
S8
0
0
0
0
0
1
1
1
T T T T State State State State
x 1 x 2 + 5
x 6 x 3 + 5
8 x 7
- 9
- 11 + 10 x 2
x 6 x 3 + 5
- 11
x 7
+ 10
x 4
Issued Committed
1
2
3
4
5
6
7
8
x 1 x 2
x 6 x 3 + 5
x 6 x 3 + 5
8
- 11 + 10 x 1 x 2
x 7 x 4
8
- 11
x 7
+ 10 x 2 x 1
x 1
- 9
x 4
x 4 8
- 9
x 2 x 1
9
10
S7 1 S4 1 S10 1 S9 1
S1 0 S2 0 S10 1 S11 1
< <
< 8
< <
x 7 x 4 x 7 x 4
x 4 x 7 - 9
Binding problem: good binding
cstep 1
cstep 2
cstep 3
cstep 4
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
The hazard between 8 and 10 is lessdamaging than the one between 6 and
4, because of the extra cstepThe cost function depends on
two bound operations
Clock Cycle M1 M2 A1 A2
S1
S6
S4
S1
S6
S4
S1
S1
0
0
0
1
1
1
0
0
S2
S3
S7
S2
S3
S7
S7
S2
0
0
0
1
1
1
1
0
S5
S5
S8
S10
S5
S8
S8
S10
0
0
0
0
1
1
1
1
S9
S9
S9
S11
S9
S9
S11
S11
0
0
0
0
1
1
1
1
T T T T State State State State
x 1 x 2 + 5
x 6 x 3 + 5
< 8 x 4 x 7 - 9
- 11 + 10 x 1 x 2
x 6 x 3 + 5
- 11
x 1 x 7
+ 10 x 2
Issued Committed
1
2
3
4
5
6
7
8
x 1 x 2
x 6 x 3 + 5
x 4
x 6 x 3 + 5
< 8 x 7 - 9
- 11 + 10 x 1 x 2
< 8 x 4 x 7 - 9 - 9
< 8
- 11
< 8
x 1 + 10 x 2 x 1
x 4
x 7
State of the Research
• Simulator: OK• Implementation:
– Sharing problem not solved yet• Binding algorithm
– Greedy version: OK– ILP formulation: difficult to model, not working– Network Flow formulation: possible target
• Study of controllers granularity– 1 FSM per FU– 1 FSM per operation cluster … but define what is anoperation cluster??!!
Top Related