ECE 565High-Level SynthesisAn IntroductionShantanu DuttECE Dept., UIC
HLS Flow Code/Algorithm Architecture (interconnected functional units (FUs), memory units (MUs) via muxes, demuxes, tristate buffers, buses, dedicated interconnects)Classically, these 3 stages were performed sequentially but currently performed together (which leads to better optimization)
HLS Flow (contd)
HLS Flow (contd)Allocation: Simple counting of FUs after theabove 2 stages(Binding)
Simple HLS Examples+
Simple HLS Examples (contd)2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) w/ X delay of 2 ccs and + delay of 1 ccNote: A register is loaded at the +ve/-ve edge (in a +ve/-ve edge triggered system) of the cc after the one in which its load signal is asseted.lda=1, ldb=1,ldc=1, ldd=1,mux1=1, mux2=1demux=1,ldz=1 mux1=0,mux2=0demux=0,ldy=1 ldx=1 [z x+y](c3)[y c+d](c2)[x a x b](c1)cc 3icc 3(i+2)ResetController FSM:Note: Unspecified control signals have either an inactive value, or if such a concept doesnt exists for the cs, then the dont-care value(a) Scheduling(b) Arch. Synthesis(c) Controller FSMSynthesisO0O1
Simple HLS Examples (contd)2) Mapping to h/w w/ constraints: use only 1 (X) and 1 (+) (contd)c1(1)c1(2)c2(1)c3(1)c2(2)c3(2)X+ii) Overlapped pipelined schedulingcc 3(i+1)lda=1, ldb=1,mux1=0, mux2=0demux=0,ldy=1, ldx=1 ldc=1, ldd=1,mux1=1,mux2=1,demux=1,ldz=1 [y c+d, x a x b]((c1, c2)[z x+y,](c3)cc 3iResetController FSM:ccs For 4 iterations, the overlapped schedule takes 9 ccs versus 12 ccs by the non-overlapped sched. Overlap. sched: Time for n iterations = 2n+1 Throughput = n/(2n+1) ~ 0.5 outputs/cc Nonoverlap. sched: Time for n iterations = 3n Throughput = n/3n ~ 0.33 outputs/cc ~ 34% throughput improvement using an overlapped schedule(a) Scheduling(b) Arch. Synthesis(c) Controller FSMSynthesis
Simple HLS Examples (contd) Conditional code:If (a > b) then c a-b;Else c b-a; Possible DFGs corresponding to the above conditional code:
Simple HLS Examples (contd)c1c2a(a) Scheduling (using only 1 adder/sub)(b) Arch. Synthesis
Delay Nodes in DFGsA delay node is generally implemented as a register; a delay node thus becomes a state variable.
Delay Nodes in DFGs (contd)registerTransformation in the DFGMapping to the architecture
Detailed HLS Example
Detailed HLS Example (contd)Note: Not clear how register allocation has been done.It is sub-optimal (4 non-primary i/p regs. needed)(a) Scheduling w/ one X (2 ccs) & one + (1 cc); goal: min. latencyDifferent paths (i/p o/p) in the DFG(b) Reg. alloc. for o/p of operations(c) Arch. synthesisFor WAR constraintScheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose sibling o/ps (o/ps to the same children) that are avail. or will be available at us earliest finish will have the largest lifetime at that point.
Detailed HLS Example (contd)
Detailed HLS ExampleRegister Allocation
Detailed HLS ExampleRegister Allocation (contd) In the conflict graph (one per FU), there is an edge between 2 var. nodes if their lifetimes overlap (indicating that different registers need to be allocated to them) Graph coloringusing min. # of colors to color node s.t. connected node pairs have different colorsin general is NP-hard The above type of conflict graph is called an interval graph (derived from a 1-dimensional interval of the lifetimes) Min. graph coloring can be solved optimally in linear time for interval graphs (using the left-edge algorithm that we will see later for channel routing)Scheduling heuristic: Among available opers schedule those on avail. FUs whose delay to o/p is the highest, breaking ties in favor of those opers u whose sibling o/ps (o/ps to the same children) that are avail. or will be avail. at us earliest finish will have the largest lifetime at that point.
Detailed HLS ExampleRegister Allocation (contd)3 non-primary i/pregs. neededScheduling heuristic: Among available opers schedule those on available FUs whose delay to o/p is the highest, breaking arbitrarily: Bs lifetime oncreases, but Ds (dep. of B) decreases similarlyheuristic should be based on more global information
Top Related