Download - A Distributed Stallable Architecture to Handle Delay Variations

A Distributed Stallable Architectureto Handle Delay Variations

Dr. Alberto A. Del BarrioComplutense University of Madrid

UCLA Stay

• VLSI, Architecture, Synthesis and Technology(VAST) Laboratory– http://cadlab.cs.ucla.edu/beta/cadlab/news

• Lead by Prof. Jason Cong• Around 20 students (postdocs, predocs, master, undergrads, visitors)

• More than 400 papers• Tools releases, startups

– xPilot AutoESL Vivado HLS (Xilinx)

UCLA research summary

• Goal: handling delay variations applying theDistributed Architecture developed in mythesis

• How to do this ?– Simulator & HW Implementation– Binding algorithm

How to modeldelay whileconsidering

process variations[Jung and Kim, ICCAD’07]

Comparisons

• State of the art: Worst Case– Overpesimistic

• CODES’09: BTW + Centralized Stallable Arch.– Every failure in execution time will stall the wholedatapath

• Many operations finishing its execution at the same time willincurr an extra cycle penalty

• Dynamic behavior escaping from static analysis will stall thedatapath

– Can only recover failures up to 1 cycle– Worse behavior when sharing resources

• Proposal: Distributed Stallable Arch.

7

An example: Differential EquationSolver (DES)

8

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

Worst Case vs Best Case

Worst Case = Better ThanWorse Case

Best Case, 5 cycles of difference. DistributedArch. will be close to BC

9

Razor Register

Shadow

Main 1

0

Comp

clk

dclk

hit

din Comparison is performedbetween two registers

11

Razor Register ChronogramD

dclk If the inputs changein the lapse of time dclk_shift ishappening, thevalue stored in theshadow registercould be dirty

If a FU is shared, theworst case delay isallocated for itsoperations

T

Literature: Centralized StallableArchitecture

13

Buffe

r

Rz. Inp

ut

Register

Rz. O

utpu

t Re

gister

Combina onal Logic

FSM Combina onal

Logic

Rz. State Register

Rz. Stab.

Re

gister

By Cong et al., CODES’09Problems: FU sharing restrictsthe possibilities of the design. Theworst case timingmust be allocatedwhen happening

Deals withprocess variations

Better Than Worse Case (unconstrained)

Additional slack guaranteesops. 4 and 7 to be correct

14


If only operations4 and 7 haveproblems, it´s ok

15


Operation 5 failurewas not considered

in the staticanalysis

Every failure istranslated into an

extra cycle

1st iterationfinishes after 11 cycles (1 failure)

16

My approach: Distributed StallableArchitecture

17

Buffe

r

Raz.

Inpu

t Re

gister

Rz. O

utpu

t Re

gister

Combina onal Logic

FSM CL 1

St Reg 1

Rz. Stab.

Re

gister

…

…

Commit Signals Logic Unit

…

FSM CL N

St Reg N

DistributedArchitecture, by Del Barrio et al., DATE’10, TCAD (March, 2011)

The controller is splitinto several local controllers, plus a coordinatorresponsible forchecking hazardsdynamically

But how to integratewith Razor Registers???

Usefulness of a DistributedArchitecture

×1 × 2

+ 5 ×6

× 3

< 8- 9

× 7 4

- 11 + 10

×

1

2

3

4

×1 × 2 + 5

×6

× 3

< 8

- 9

× 7

× 4

- 11

+ 10

1

2

3

4

BTW static schedule (unconstrained) Priority‐list static schedule (unconstrained)

Operations 7 and 4 are covered, ifa failure happens, it will have no impact over the latency

Usefulness of a DistributedArchitecture

×1 × 2 + 5

×6

× 3

< 8- 9 × 7

× 4

- 11

+ 10

R R

R

R

R

1

2

3

4

×1 × 2

+ 5 ×6

× 3

< 8- 9

×7 4

- 11 + 10

×

1

2

3

4

R

R

R

1-stall

2-stall

3-stall

R4-stall

BTW execution example Distributed execution example

But what if more failures happen ?? BTW could not be enough

Distributed Architecture: Best Case Static Scheduling (unconstrained)

20

Distributed Architecture

We schedule consideringthe Best Case, but thedatapath is able toreschedule on the fly, and besides some failures can be hidden

If there are not more failures, 1st iteration will finish in 10 cycles, besides hiding 2 failures

21

Simulator Results:

Benchmark Codes BTW DisM Barrier DisMDiffEq 10.5 10.394 6.31ARF [13] 18.79 18.379 14.103FFT [12] 11.57 12.218 11.844FIR16 [12] 20.65 20.342 16.074EWF [13] 12.892 12.829 12.316

Benchmark Codes BTW DisM Barrier DisMDiffEq 10.66 10.374 8.305ARF [13] 23.32 20.318 17.825FFT [12] 16.51 14.92 14.38FIR16 [12] 24.27 21.22 16.6EWF [13] 16.1 12.79 12.47

Unconstrained

RC‐constrained: 4+, 4*

Codes and DisMBarrier havesimilar results. DisM reduces 17% latency

DisM Barrier and DisM reduce 12% and 23% latency, respectively

Implementation Results (P & R)

Distr. Arch. implementsmodulo scheduling

dynamically, as it executesoperations when ready

Comparison is performedbetween two registers, there isno problem with FU sharing, as din (combinational) will notinfluence on the comparisonresult

23

10 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 4 8.165 32.66 2 135 71 0.26 0.04 0.055 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 5 9.003 45.015 2 157 73 0.26 0.05 0.05

Alb+Xilinx 14.4 (2*,2+) 3 8.7 26.1 2 246 118 0.26 0.08 0.08Alb tuned+Xilinx 14.4 (2*,2+) 3 9.535 28.605 2 204 86 0.26 0.07 0.06

Toolchain ResourcesLatencyCycle Time (nsEx. Time (ns) DSPs LUTs Regs %DSPs %LUTs %RegsTime Area

Binding Problem: bad binding

cstep 1

cstep 2

cstep 3

cstep 4

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

The hazard between 8 and 9 stallsseveral components of the graph

Clock Cycle M1 M2 A1 A2

S1

S6

S7

S7

S1

S6

S7

S7

0

0

0

0

1

1

1

1

S2

S3

S4

S4

S2

S3

S4

S4

0

0

0

0

1

1

1

1

S5

S5

S10

S10

S10

S5

S10

S10

0

0

0

0

0

1

1

1

S8

S8

S8

S9

S11

S8

S8

S8

0

0

0

0

0

1

1

1

T T T T State State State State

x 1 x 2 + 5

x 6 x 3 + 5

8 x 7

- 9

- 11 + 10 x 2

x 6 x 3 + 5

- 11

x 7

+ 10

x 4

Issued Committed

1

2

3

4

5

6

7

8

x 1 x 2

x 6 x 3 + 5

x 6 x 3 + 5

8

- 11 + 10 x 1 x 2

x 7 x 4

8

- 11

x 7

+ 10 x 2 x 1

x 1

- 9

x 4

x 4 8

- 9

x 2 x 1

9

10

S7 1 S4 1 S10 1 S9 1

S1 0 S2 0 S10 1 S11 1

< <

< 8

< <

x 7 x 4 x 7 x 4

x 4 x 7 - 9

Binding problem: good binding

cstep 1

cstep 2

cstep 3

cstep 4

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

The hazard between 8 and 10 is lessdamaging than the one between 6 and

4, because of the extra cstepThe cost function depends on

two bound operations

Clock Cycle M1 M2 A1 A2

S1

S6

S4

S1

S6

S4

S1

S1

0

0

0

1

1

1

0

0

S2

S3

S7

S2

S3

S7

S7

S2

0

0

0

1

1

1

1

0

S5

S5

S8

S10

S5

S8

S8

S10

0

0

0

0

1

1

1

1

S9

S9

S9

S11

S9

S9

S11

S11

0

0

0

0

1

1

1

1

T T T T State State State State

x 1 x 2 + 5

x 6 x 3 + 5

< 8 x 4 x 7 - 9

- 11 + 10 x 1 x 2

x 6 x 3 + 5

- 11

x 1 x 7

+ 10 x 2

Issued Committed

1

2

3

4

5

6

7

8

x 1 x 2

x 6 x 3 + 5

x 4

x 6 x 3 + 5

< 8 x 7 - 9

- 11 + 10 x 1 x 2

< 8 x 4 x 7 - 9 - 9

< 8

- 11

< 8

x 1 + 10 x 2 x 1

x 4

x 7

State of the Research

• Simulator: OK• Implementation:

– Sharing problem not solved yet• Binding algorithm

– Greedy version: OK– ILP formulation: difficult to model, not working– Network Flow formulation: possible target

• Study of controllers granularity– 1 FSM per FU– 1 FSM per operation cluster … but define what is anoperation cluster??!!