High Level Synthesis · High Level Synthesis (HLS) • The process of converting a high-level...

Post on 19-Mar-2020

21 views 0 download

Transcript of High Level Synthesis · High Level Synthesis (HLS) • The process of converting a high-level...

1

Shankar BalachandranAssistant Professor, Dept. of CSE

IIT Madrasshankar@cse.iitm.ac.in

High Level SynthesisHigh Level Synthesis

2

References and Copyright• Textbooks referred (none required)

[Mic94] G. De Micheli“Synthesis and Optimization of Digital Circuits”McGraw-Hill, 1994.

• Slides used:Giovanni de Micheli’s Slides on SynthesisKia Bazargan’s Material on High Level Synthesis[©Gupta] © Rajesh Gupta

UC-Irvinehttp://www.ics.uci.edu/~rgupta/ics280.html

3

High Level Synthesis (HLS)• The process of converting a high-level description

of a design to a netlistInput:

o High-level languages (e.g., C)o Behavioral hardware description languages (e.g., VHDL)o Structural HDLs (e.g., VHDL)o State diagrams / logic networks

Tools:o Parsero Library of modules

Constraints:o Area constraints (e.g., # modules of a certain type)o Delay constraints (e.g., set of operations should finish in λ

clock cycles)Output:

o Operation scheduling (time) and binding (resource)o Control generation and detailed interconnections

4

Architectural-level synthesis motivation• Raise input abstraction level

Reduce specification of detailsExtend designer baseSelf-documenting design specificationsEase modifications and extensions

• Reduce design time• Explore and optimize macroscopic structure:

Series/parallel execution of operations

5

Synthesis

• Transform behavioral into structural view• Architectural-level synthesis:

Architectural abstraction levelDetermine macroscopic structureExample: major building blocks

• Logic-level synthesis:Logic abstraction levelDetermine microscopic structureExample: logic gate interconnection

6

High-Level Synthesis Compilation Flow

Lex

Parse

Compilationfront-end

BehavioralOptimization

Intermediateform

Arch synthLogic synth

Lib Binding HLS backend

x = a + ( b × c )+ d

++

×

a b c d

+

+ ×

a d b c

7

Example

diffeq {read (x; y; u; dx; a);repeat

xl = x+dx;ul = u –(3 . x . u . dx) – (3 . y . dx)yl = y + u . dx ;c = xl < a;X = xl; u = ul; y = yl;

until ( c )write (y);

}

8

Compilation and behavioral optimization

• Software compilation:Compile program into intermediate formOptimize intermediate formGenerate target code for an architecture

• Hardware compilation:Compile programs/HDL into sequencing graphOptimize sequencing graphGenerate gate-level interconnection for a cell library

9

Behavioral-level optimization

• Semantic-preserving transformations aiming at simplifying the model

• Applied to parse-trees or during their generation

• Taxonomy:Data-flow based transformationsControl-flow based transformations

10

Architectural synthesis and optimization• Synthesize macroscopic structure in terms of

building-blocks• Explore area/performance trade-off:

maximize performance of implementations subject to area constraintsminimize area implementations subject to performance constraints

• Determine an optimal implementation• Create logic model for data-path and control

11

Design space and objectives• Design space:

Set of all feasible implementations

• Implementation parameters:AreaPerformance:

o Cycle-timeo Latencyo Throughput (for pipelined implementations)

Power consumption

12

Design evaluation space

Area

Area

Area

Latency

Latency

Latency

Latency Max

Area Max

Cycle-time

13

Dependency Graph (Sequencing Graph)

diffeq {read (x; y; u; dx; a);repeat

xl = x+dx;ul = u –(3 . x . u . dx) – (3 . y . dx)yl = y + u . dx ;c = xl < a;X = xl; u = ul; y = yl;

until ( c )write (y);

}

+××××

× × + <-

-

x dxayu 3

ul vl c xl

14

Hardware modeling

• Circuit behavior:Sequencing graphs

• Building blocks:Resources

• Constraints:Timing and resource usage

15

Resources• Functional resources:

Perform operations on dataExample: arithmetic and logic blocks

• Storage resources:Store dataExample: memory and registers

• Interface resources:Example: busses and ports

16

Resources and circuit families

• Resource-dominated circuits.Area and performance depend on few, well-characterized blocksThe most common in DSP Circuits

• Non resource-dominated circuitsArea and performance are strongly influenced by sparse logic, control and wiringExample: some ASIC circuits

17

Implementation constraints• Timing constraints:

Cycle-timeLatency of a set of operationsTime spacing between operation pairs

• Resource constraints:Resource usage (or allocation)Partial binding

18

Sequence Graph• Remove all nodes

corresponding to constants (with respect to the loop)

• Remove edges from those constants as well

• Remove nodes that are being written in the loop and the corresponding edges

• Add NOP nodes at both ends to represent reads and writes of the loop

• Such a graph is much simpler to operate on

+××××

× × + <-

-

x dxayu 3

ul vl c xlNOP

NOP

19

Synthesis in the temporal domain

• Scheduling:Associate a start-time with each operationDetermine latency and parallelism of the implementationThe schedule is called φ

• Scheduled sequencing graph:Sequencing graph with start-time annotation

20

Synthesis in Temporal Domain

• Schedule:Mapping of operations to time slots (cycles)A scheduled sequencing graph is a labeled graph

+

NOP

××××

× × + <-

-NOP

1

23

4

+

NOP

×

×

××

×

×

+

<-

-

NOP

1

23

4Schedule 1 Schedule 2

21

Operation Types• For each operation, define its type.• For each resource, define a resource type,

and a delay (in terms of # cycles)• T is a relation that maps an operation to a

resource type that can implement itT : V {1, 2, ..., nres}.

• More general case:A resource type may implement more than one operation type (e.g., ALU)

• Resource binding:Map each operation to a resource with the same typeMight have multiple options

22

Synthesis in the spatial domain• Binding:

Associate a resource with each operation with the same typeDetermine the area of the implementation

• Sharing:Bind a resource to more than one operationOperations must not execute concurrently

• Bound sequencing graph:Sequencing graph with resource annotation

23

Schedule in Spatial Domain• Resource sharing

More than one operation bound to same resourceOperations have to be serializedCan be represented using hyperedges (define vertex partition)

+

NOP

××××

× × + <

-

-

NOP

1

2

3

4

24

Binding Should Change with Schedules

+

NOP

×

×

××

×

×

+

<-

-

NOP

1

23

4

+

NOP

××××

× × + <-

-NOP

1

23

4

4 Multipliers2 Adders

2 Multipliers2 Adders

25

Scheduling and Binding

Resource dominated

Control dominated

• Resource constraints:Number of resource instances of each type{ak : k=1, 2, ..., nres}.

• Scheduling:Labeled vertices φ (v3)=1.

• Binding:Hyperedges (or vertex partitions) β (v2)=adder1.

• Cost:Number of resources ≈ areaRegisters, steering logic (Muxes, busses), wiring, control unit

• Delay:Start time of the “sink” nodeMight be affected by steering logic and schedule (control logic) – resource-dominated vs. ctrl-dominated

26

Architectural Optimization• Optimization in view of design space flexibility• A multi-criteria optimization problem:

Determine schedule φ and binding β.Under area Α, latency λ and cycle time τ objectives

• Find non-dominated points in solution space• Solution space tradeoff curves:

Non-linear, discontinuousArea / latency / cycle time (more?)

27

Area/latency trade-off

1 2 3 4 5 6 7 8

5

10

15

78

1213

(3,2)

(2,1)

(3,1)

Area

Latency

20

1817

30

40

(2,2)(2,1)

(1,2)

(1,1)

Cycle-time

X

28

Scheduling and Binding• Cost λ and Α determined by both φ and β.

Also affected by floorplan and detailed routing

• β affected by φ:Resources cannot be shared among concurrent ops

• φ affected by β:Resources cannot be shared among concurrent opsWhen register and steering logic delays added to execution delays, might violate cycle time.

• Order?Apply either one (scheduling, binding) first

29

How Is the Datapath Implemented?• In the following schedule and binding, every

operation has two inputs. If an input is not shown explicitly, it comes from a unique register

+

×

×

××

×

×

+

<

-

-

1

2

3

4

30

Operation Scheduling• Input:

Sequencing graph G(V, E), with n verticesCycle time τ.Operation delays D = {di: i=0..n}.

• Output:Schedule φ determines start time ti of operation vi.Latency λ = tn – t0.

• Goal: determine area / latency tradeoff• Classes:

Non-hierarchical and unconstrainedLatency constrainedResource constrainedHierarchical

31

Min Latency Unconstrained Scheduling• Simplest case: no constraints, find min latency• Given set of vertices V, delays D and a partial

order > on operations E, find an integer labeling of operations φ: V Z+ Such that:

ti = φ(vi).ti ≥ tj + dj ∀ (vj, vi) ∈ E.λ = tn – t0 is minimum.

• Solvable in polynomial time• Bounds on latency for resource constrained

problems• ASAP algorithm used: topological order

32

ASAP SchedulesSchedule v0 at t0=0.While (vn not scheduled)

o Select vi with all scheduled predecessorso Schedule vi at ti = max {tj+dj}, vj being a predecessor of vi.

Return tn.

+

NOP

××××

× × + <-

-NOP

1

23

4

33

ALAP SchedulesSchedule vn at tn=λ.While (v0 not scheduled)

o Select vi with all scheduled successorso Schedule vi at ti = min {tj-dj}, vj being a succecessor of vi.

+

NOP

×

×××

×

×

+ <-

-NOP

1

23

4

34

Remarks• ALAP solves a latency-constrained problem• Latency bound can be set to latency computed by

ASAP algorithm• Mobility:

Defined for each operationDifference between ALAP and ASAP schedule

• Slack on the start time

35

Example

• Operations with zero mobility:{ v1, v2, v3, v4, v5 }Critical path

• Operations with mobility one: { v6, v7 }• Operations with mobility two: { v8, v9, v10, v11 }

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

*

*

+ <

-

-

* *

*

* +

NOP

NOP

0

1 2

3

4

5

6

7 8

9

10

11

n

36

Scheduling under Relative Timing constraints• Motivation:

Deadlines and Release times are absoluteAlso makes sense to have relative constraintso Eg: Memory fetch must be done within 6 cycles and takes a

minimum of 2 cycles

• Constraints:Upper/lower bounds on start-time difference of any operation pairA minimum timing constraint lij ≥ 0 for specified i,jpairsA maximum timing constraint uij ≥ 0 for specified i,jpairs

• Feasibility of a solution

37

Constraint graph model• Start from sequencing graph

Model delays as weights on edges

• Add forward edges for minimum constraints:Edge ( vi , vj ) with weight lij → tj ≥ ti + lij

• Add backward edges for maximum constraints:That is, for constraint from vi to vjadd backward edge ( vj , vi ) with weight: -uij

o because tj ≤ ti + uij→ ti ≥ tj - uij

38

Example

NOP

NOP

* *

+ +

0

1 3

2 4

n

NOP

NOP

* *

+ +

0

1 3

2 4

n

MAX TIME

3

MIN TIME

4

-3

4

0 0

222

1 1

6vn

5v4

1v3

3v2

1v1

1v0

Start timeVertex1. Ensure that there are no positive cycles in the graph

2. Find longest paths in the graph between 2 nodes i and j and use as delay separations

39

Resource Constraint Scheduling• Constrained scheduling

General case NP-completeMinimize latency given constraints on area orthe resources (ML-RCS)Minimize resources subject to bound on latency (MR-LCS)

• Exact solution methodsILP: Integer Linear ProgrammingHu’s heuristic algorithm for identical processors

• HeuristicsList schedulingForce-directed scheduling

40

• Use binary decision variablesi = 0, 1, ..., nl = 1, 2, ..., λ’+1 λ’ given upper-bound on latency xil = 1 if operation i starts at step l, 0 otherwise.

• Set of linear inequalities (constraints),and an objective function (min latency)

• Observations

ti = start time of op i.

is op vi (still) executing at step l?

ILP Formulation of ML-RCS

[Mic94] p.198

))(),((

0

iLii

Si

Li

Siil

vALAPtvASAPt

tlandtlforx

==

><=

ill

i xlt ∑= .

⇒=∑+−=

11

l

dlmim

i

x ?

41

Constraints• Operations start only once

Σ xil = 1 i = 1, 2,…, n

• Sequencing relations must be satisfiedti ≥ tj + dj ti - tj - dj ≥ 0 for all (vj, vi) є E

• Resource bounds must be satisfiedSimple case (unit delay)Σ l xil ≤ ak k = 1,2,…nres ; for all l

• Equation 2 can be rewritten as Σ l • xil – Σ l • xjl – dj ≥ 0 for all (vj, vi) є E

i:T(vi)=k

42

Start Time vs. Execution Time• For each operation vi , only one start time• If di=1, then the following questions are the

same:Does operation vi start at step l?Is operation vi running at step l?

• But if di>1, then the two questions should be formulated as:

Does operation vi start at step l?o Does xil = 1 hold?

Is operation vi running at step l?o Does the following hold?

11

=∑+−=

l

dlmim

i

x ?

43

Operation vi Still Running at Step l ?• Is v9 running at step 6?

Is x9,6 + x9,5 + x9,4 = 1 ?

• Note:Only one (if any) of the above three cases can happenTo meet resource constraints, we have to ask the same question for ALL steps, and ALL operations of that type

v9

456

x9,4=1

v9

456

x9,5=1

v9

456

x9,6=1

44

Operation vi Still Running at Step l ?• Is vi running at step l ?

Is xi,l + xi,l-1 + ... + xi,l-di+1 = 1 ?

vi

l

l-1

l-di+1

...

xi,l-di+1=1

vil

l-1

l-di+1

...

xi,l-1=1

vil

l-1

l-di+1

...

xi,l=1

. . .

45

• Constraints:Unique start times:

Sequencing (dependency) relations must be satisfied

Resource constraints

• Objective: min cTt.t =start times vector, c =cost weight (e.g., [0 0 ... 1])When c =[0 0 ... 1], cTt =

ILP Formulation of ML-RCS (cont.)

∑ ==l

il nix ,...,1,0,1

jl

jll

ilijjji dxlxlEvvdtt ∑∑ +≥⇒∈∀+≥ ..),(

1,...,1,,...,1,)(: 1

+==≤∑ ∑= +−=

λlnkax reskkvTi

l

dlmim

i i

nll

xl∑ .

46

ILP Example• Assume λ = 4• First, perform ASAP and ALAP

(we can write the ILP without ASAP and ALAP, but using ASAP and ALAP will simplify the inequalities)

+

NOP

××××

× × + <-

-NOP

1

23

4

+

NOP

×

×××

×

×

+ <-

-NOP

1

23

4

v2v1

v3

v4

v5

vn

v6

v7

v8

v9

v10

v11

v2v1

v3

v4

v5

vn

v6

v7 v8

v9

v10

v11

47

ILP Example: Unique Start Times Constraint• Without using ASAP and

ALAP values:• Using ASAP and ALAP:

1.........

11

4,113,112,111,11

4,23,22,21,2

4,13,12,11,1

=+++

=+++

=+++

xxxx

xxxxxxxx

....11

11

11111

4,93,92,9

3,82,81,8

3,72,7

2,61,6

4,5

3,4

2,3

1,2

1,1

=++

=++

=+

=+

=

=

=

=

=

xxxxxx

xxxx

xxxxx

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

48

ILP Example: Dependency Constraints• Using ASAP and ALAP, the non-trivial inequalities

are: (assuming unit delay for + and *)

01.4.3.2.501.4.3.2.501.3.2.401.3.2.4.3.201.3.2.4.3.201.2.3.2

4,113,112,115,

4,93,92,95,

3,72,74,5

3,102,101,104,113,112,11

3,82,81,84,93,92,9

2,61,63,72,7

≥−−−−

≥−−−−

≥−−−

≥−−−−++

≥−−−−++

≥−−−+

xxxxxxxxxxx

xxxxxxxxxxxxxxxx

n

n* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

49

ILP Example: Resource Constraints• Resource constraints (assuming 2 adders and 2

multipliers)

• Objective:Since λ=4 and sink has no mobility, any feasible solution is optimum, but we can use the following anyway:

2222222

4,114,94,5

3,113,103,93,4

2,112,102,9

1,10

3,83,7

2,82,72,62,3

1,81,61,21,1

≤++

≤+++

≤++

≤+

≤+++

≤+++

xxxxxxxxxxxxxxxxxxxxx

4,3,2,1, .4.3.2 nnnn xxxxMin +++

50

Example

*

*

+

<

-

-

* *

*

*

+

NOP

NOP

0

1 2

3

4

5

6

78

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

Note that the schedule is different from both ALAP and ASAP schedules.

51

ILP Formulation of MR-LCS• Dual problem to ML-RCS• Objective:

Goal is to optimize total resource usage, a.Objective function is cTa , where entries in c are respective area costs of resources

• Constraints:Same as ML-RCS constraints, plus:Latency constraint added:

Note: unknown ak appears in constraints.• Resource usage is unknown in the constraints• Resource usage is the objective to minimize

1. +≤∑ λnll

xl

52

ILP Solution• Use standard ILP packages• Transform into LP problem • Advantages:

Exact methodOthers constraints can be incorporated

• Disadvantages:Works well only up to few thousand variables

53

Hu’s Algorithm• Simple case of the scheduling problem

Operations of unit delayOperations (and resources) of the same type

• Hu’s algorithmGreedyPolynomial AND optimalComputes lower bound on number of resources for a given latencyOR: computes lower bound on latency subject to resource constraints

• Basic idea:Label operations based on their distances from the sinkTry to schedule nodes with higher labels first(i.e., most “critical” operations have priority)

54

Hu’s algorithm• Assumptions:

Graph is a forestAll operations have unit delayAll operations have the same type

• Algorithm:Greedy strategyExact solution

55

Example

• Assumptions:One resource type onlyAll operations have unit delay

• Labels:Distance to sink

3 2 1 1

2

1

4 4 3 2 2

0

1 2

3

4

5

6

7

8

9

10

11

n

56

Hu’s AlgorithmHU (G(V,E), a) {

Label the vertices // label = length of longest pathpassing through the vertex

l = 1repeat {

U = unscheduled vertices in V whosepredecessors have been scheduled(or have no predecessors)

Select S ⊆ U such that |S| ≤ a and labels in Sare maximal

Schedule the S operations at step l by settingti=l, i: vi ∈ S.

l = l + 1} until vn is scheduled.

}

57

3 11

Example

Step 1: Op 1,2,6

Step 2: Op 3,7,8

Step 3: Op 4,9,10

Step 4: Op 5,11

2 1

2

4 4 3 2 2

0

1 2

3

4

5

6

7

8

9

10

11

n

_a = 3

4 4 3 2

23

2

1

2

11

1

58

Hu’s Algorithm: Example (a=3)

59

List Scheduling• Greedy algorithm for ML-RCS and MR-LCS

Does NOT guarantee optimum solution

• Similar to Hu’s algorithmOperation selection decided by criticalityO(n) time complexity

• More general inputResource constraints on different resource types

60

List Scheduling Algorithm: ML-RCS

LIST_L (G(V,E), a) {l = 1repeat {

for each resource type k {Ul,k = available vertices in V.Tl,k = operations in progress.

Select Sk ⊆ Ul,k such that |Sk| + |Tl,k| ≤ akSchedule the Sk operations at step l

}l = l + 1

} until vn is scheduled.}

61

Example

* *

+

<

-

-

* * *

*

+

NOP

NOP

0

1 2

3

4

5

6

7 8

9

10

11

n

TIME 1

TIME 2

TIME 3

TIME 4

TIME 5

TIME 6

TIME 7

Resource bounds:3 multipliers with delay 2

1 ALU with delay 1

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

62

List Scheduling Algorithm: MR-LCSLIST_R (G(V,E), λ’) {

a = 1, l = 1Compute the ALAP times tL.if t0

L < 0return (not feasible)

repeat {for each resource type k {

Ul,k = available vertices in V.Compute the slacks { si = ti

L - l, ∀ vi∈ Ul,k }.Schedule operations with zero slack, update aSchedule additional Sk ⊆ Ul,k under a constraints

}l = l + 1

} until vn is scheduled.}

63

Example

TIME 1

TIME 2

TIME 3

TIME 4

*

*

+

<

-

-

* *

*

*

+

NOP

NOP

0

1 2

3

4

5

6

7 8

9

10

11

n

* * + <

-

-

* * * * +

NOP

NOP

0

1 2

3

4

5

6

7

8

9

10

11

n

AssumptionsUnit-delay resourcesMaximum latency = 4

Start with :a1 = 1 multipliera2 = 1 ALUs

Step 1Two multiplications on CPSet a1 = 2 Schedule Mult 1,2 Schedule ALU 10

Step 2Schedule Mult 3, 6Schedule ALU 11

Step 3Schedule Mult 7,8Schedule ALU 4

Step 4Set a2=2Schedule ALU 5, 9

64

Summary• Scheduling algorithms are used by tools

Compilers use them when you write code for DSP processorsTools like Xilinx ISE, Synopsys DC etc. use them when you compiler HDL models

• Good understanding of “under the hood”operations of tools is useful

• The constraint solving techniques can be used directly for your custom designs

Eg: In DSP software, if you know the resources, you write assembly code to minimize latency

65

Force-Directed Scheduling• Similar to list scheduling

Can handle ML-RCS and MR-LCSFor ML-RCS, schedules step-by-stepBUT, selection of the operations tries to find the globally best set of operations

• Idea:Find the mobility μi = ti

L – tiS of operations

Look at the operation type probability distributionsTry to flatten the operation type distributions

• Definition: operation probability densitypi ( l ) = Pr { vi starts at step l }.Assume uniform distribution:

],[1

1)( Li

Si

ii ttlforlp ∈

+=

μ

66

Force-Directed Scheduling: Definitions• Operation-type distribution (NOT normalized to 1)

• Operation probabilities over control steps:

• Distribution graph of type k over all steps:

qk ( l ) can be thought of as expected operator costfor implementing operations of type k at step l.

∑=

=kvTi

iki

lplq)(:

)()(

)}(,),1(),0({ npppp iiii Κ=

)}(,),1(),0({ nqqq kkk Κ

67

Example

+

NOP

××××

× × + <-

-NOP

1

23

4

0)4(

83.031

21)3(

33.231

21

211)2(

83.231

2111)1(

=

=+=

=+++=

=+++=

mult

mult

mult

mult

q

q

q

q

2.83

2.33

.83

66.131

311)4(

231

31

311)3(

131

31

31)2(

33.031)1(

=++=

=+++=

=++=

==

add

add

add

add

q

q

q

q

0

1

2

1.66

0.33

68

Force-Directed Scheduling Algorithm: Idea• Very similar to LIST_L(G(V,E), a)

Compute mobility of operations using ASAP and ALAPComputer operation probabilities and type distributionsSelect and schedule operationsUpdate operation probabilities and type distributionsGo to next control step

• Difference with list sched in selecting operationsSelect operations with least forceConsider the effect on the type distributionConsider the effect on successor nodes and their type distributions