VLSI Programming 2016: Lecture 2wsinmak/Education/2IMN35/2IMN35... · 2016-04-21 · • Wikipedia:...
Transcript of VLSI Programming 2016: Lecture 2wsinmak/Education/2IMN35/2IMN35... · 2016-04-21 · • Wikipedia:...
21/04/16 1
VLSI Programming 2016: Lecture 2
Course: 2IMN35
Teachers: Kees van Berkel [email protected] Rudolf Mak [email protected]
Lab: Kees van Berkel, Rudolf Mak, Alok Lele
www: http://www.win.tue.nl/~wsinmak/Education/2IMN35/ Lecture 2 (representations, bounds)
transposition, pipelining, retiming, J-slow, unfolding
21/04/16 2
VLSI Programming (2IMN35): time table 2016 2015 in Tue:h5-h8;MF.07 out 2015 in Thu:h1-h4;Gemini-Z3A-08/10/13 out
19-Apr
introduc/on,DSPgraphs,bounds,…
21-Apr
pipelining,re/ming,transposi/on,J-slow,unfolding
T1+T2
26-Apr
toolsinstalled
Introduc/onstoFPGAandVerilog
L1:audiofiltersimula/on
L1L2
28-Apr
T1+T2
unfolding,look-ahead,strengthreduc/on
L1cntd
T3+T4
3-May
folding
L2:audiofilteronXUPboard
5-May
10-May
T3+T4
DSPprocessors
L2cntd
L3
12-May
L3:sequen/alFIR+strength-reducedFIR
17-May
L3cntd
19-May
L3cntd
L4
24-May
systoliccomputa/on
T5
26-May
L4
31-May
T5
L4:audiosamplerateconvertor
2-Jun
L3
L4cntd
L5
7-Jun
L5:1024xaudiosamplerateconvertor
9-Jun
L4
L5cntd
14-Jun
16-Jun
L5
deadlinereportL5
21/04/16 3
Preparation for Lab work
• Prepare your notebook for lab work
• See preparation link on 2IMN35 web-site
• Install the required tools and test them
• First Lab exercises: Tuesday April 26
• Find a partner (team size equals 2)
21/04/16 4
Note on course literature
Lectures VLSI programming are loosely based on: • Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and
Implementation. Wiley Inter-Science 1999. • This book is recommended, but not mandatory
Accompanying slides can be found on: • http://www.ece.umn.edu/users/parhi/slides.html • http://www.win.tue.nl/~cberkel/2IN35/
Mandatory reading: • Edward A. Lee and David G. Messerschmitt. Synchronous Data
Flow. Proc. of the IEEE, Vol. 75, No. 9, Sept 1987, pp 1235-1245. • Keshab K. Parhi. High-Level Algorithm and Architecture
Transformations for DSP Synthesis. Journal of VLSI Signal Processing, 9, 121-143 (1995), Kluwer Academic Publishers.
21/04/16 5
Outline Lecture 2
Representations: • block diagrams, • data-flow graphs (DFGs) and signal-flow graphs (SFGs). Bounds: loop bounds, iteration bounds, critical paths Transformations of DFGs and SFGs: • (commuting of an SFG) (lecture) • transposition of an SFG Parhi3.pdf • pipelining of a DFG Parhi3.pdf • retiming of a DFG Parhi4.pdf • J-slow transformation of a DFG Parhi4.pdf • unfolding of a DFG Parhi3.pdf Parhi5.pdf Assignment: T1 and T2
Parhi
• Note: Many examples and ideas are taken from Parhi’s slides
21/04/16 6
DSP systems and programs
• infinite input stream (samples): x(0), x(1), x(2), …
• infinite output stream (samples): y(0), y(1), y(2), …
• (there may be multiple input and/or output streams)
• non-terminating program:
for n=1 to ∞ y(n) = a*x(n) + b*x(n-1) + c*x(n-2) end
21/04/16 7
DSP System
x(n) y(n)
21/04/16 8
DSP system: block diagram
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• delay element = memory element = register
• multiply with constant a
• adder: output value = sum of input values
× a × b × c
+ +
D D
y(n)
x(n) x(n-1) x(n-2)
D
× a
+
21/04/16 9
DSP system: data-flow graph (DFG)
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• D is (non-negative) number of delays
• multiplier: output value = (constant a) × input value
• adder: output value = sum of input values
a b c
y(n)
x(n)
a
D D
Signal-flow graph (representation method 3)
• A join-node denotes an adder
• Label a next to an edge denotes multiplication by constant a • z-k denotes k units delay
• Signal-flow graphs are used to represent Linear Time Invariant systems LTI.
• A signal flow-graph represents a so-called Z-transform (Laplace), a powerful LTI system theory. (outside the scope of 2IN35)
21/04/16 10
Iteration bound cntd
Example:
• TL1 = (10+2)/1 = 12
• TL2 = (2+3+5)/2 = 5
• TL3 = (10+2+3)/2 = 7.5
• Iteration bound = max (12, 5, 7.5) = 12
Notes:
• Delays are non-negative (negative delay would imply non-causality).
• If loop weight equals 0 (no delay elements in loop) then TL/0 = ∞ (deadlock).
21/04/16 11
Critical path cntd
Example (FIR filter): • Tm= 10 ns
• Ta= 4 ns
• No loops!
1. 1 path from input to state: 0 ns
2. 4 path from state to outputs: 26, 22, 18, 14 ns
3. 1 path from input to output: 26 ns
4. 3 paths from state to state: 0, 0, 0 ns
The critical path is 26 ns. (can be reduced by pipelining and parallel processing.)
21/04/16 12
TRANSPOSITION OF LTI SYSTEMS
21/04/16 13
21/04/16 14
Commutativity of LTI systems
LTI System A
LTI System B
x(n) y(n) f(n)
LTI System B
LTI System A
x(n) y(n) g(n)
is equivalent to
21/04/16 15
Transposition of LTI systems
LTI System A
LTI System B
x(n) y(n) f(n)
LTI System A
LTI System B
y(n) x(n) g(n)
is equivalent to
Transposition of LTI systems
Consider an LTI system (represented as an SFG or DFG) with a single input and a single output
Transposition = invert all edges:
• Input becomes output, output becomes input
• Fork becomes adder, adder becomes fork
• Edge (delay, multiply by a constant) remains edge
Theorem: the transposed version of an LTI graph is also an LTI graph and defines the same DSP function
(in some cases also applicable to multi-input/output LTI graphs)
21/04/16 16
21/04/16 17
Transposition of LTI graphs, example
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• is an LTI system
• Assume add and multiply times: 2 and 5 nsec resp.
MHzTT
fAM
sample 11191
2251
21
==×+
=+
≤
× a × b × c
+ +
D D
y(n)
x(n) x(n-1) x(n-2)
21/04/16 18
Transposition of LTI graphs, example, cntd
× c × b × a
+ + D D y(n)
x(n)
× a × b × c
+ +
D D
y(n)
x(n) x(n-1) x(n-2)
21/04/16 19
Transposition of LTI graphs, example cntd
• Redraw (rotate by 180 degrees): “transposed FIR”
Notes:
• cycle times reduces from 7 to 5 nsec
• (throughput increases from 111 to 143 MHz)
• same HW resources
fsample ≤1
TM +TA=
15+ 2
=17=143MHz
× c × b × a
+ + D D y(n)
x(n)
PIPELINING
21/04/16 20
Pipeline
An early example of a pipeline is a car assembly line.
[Henry Ford, 1908]. Some 1914 numbers:
• sample time (car parts in / car out): 3 minutes;
• hence, throughput: 20 cars/hour, or approximately 5.5 mHz.
• latency (time between first parts in, car out): 93 minutes;
• number of pipeline stages (stations): 93/3 = 31.
Notes:
• maximum processing time per stage is 3 minutes; some stages may be (much) faster.
• As many as 31 (partial) cars are under assembly.
21/04/16 21
Pipeline
• A pipeline is a chain (cascade) of data-processing elements (“stages”) connected in series (output from one connected to input of next) with a buffer (storage) inserted between these stages.
• Different pipeline stages can operate in parallel, during the same (clock) period.
• The overall throughput is independent of the length of the chain (the number of stages in series).
• The overall throughput is determined by the slowest stage.
Pipelining
• is a transformation that changes the number of stages in a pipeline with the objective to e.g. increase the overall throughput.
• can be applied to block diagrams, DFGs, and SFGs.
21/04/16 22
Pipelining
• Example (FIR): adding 2 D-elements at the red cut line reduces the critical path from Tm+2Ta to Tm+T.
• Note: the FIR function changes! Each output becomes available one iteration later.
21/04/16 23
output y(n) becomes y(n-1) after pipelining
Pipelining
• A cut set is a set of edges of a graphsuch that if these edges are removed from the graphthe graph becomes disjoint.
• A feed-forward cut set is a cut set if all edges are directed in parallel, from one side of the cut set to the other side.
• An M-level pipelined graph has M delay elements in every path from input to output.
Pipelining:
• Increase pipeline level to M+1 by a inserting a D-element in each edge of a feed-forward cut.
• Decrease pipeline level to M-1 (assuming M≥1) by a removing a D-element from each edge of a feed-forward cut.
21/04/16 24
21/04/16 25
Pipelining a simple FIR filter
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
• Assume add and multiply times: 2 and 5 nsec resp.
MHzTT
fAM
sample 11191
2251
21
==×+
=+
≤
× a × b × c
+ +
D D
y(n)
x(n) x(n-1) x(n-2)
21/04/16 26
Pipelining a simple FIR filter (cntd)
• y(n) = a*x(n) + b*x(n-1) + c*x(n-2)
MHzTT
fAM
sample 20051
051
01
==+
=+
≤
+ +
D D
y(n-1)
x(n) x(n-1) x(n-2)
× a × b × c
D D D forward cut set
Pipelining/advantages
• Advantage of pipelining: by suitable choosing the feed-forward cut sets, the critical path can be reduced.
• In theory an arbitrary number of pipelines can be inserted, but in practice the added gains (reduced critical path) diminish.
• Note: pipelining can never be used to reduce loop bounds, since graph cycles can not be partitioned by means of a feed-forward cut set.
• Accordingly, pipelining only pays of until the critical path becomes less than the iteration bound.
21/04/16 27
Pipelining/disadvantages
• Disadvantage 1: added D-elements cost additional hardware (latches).
• Disadvantage 2: increased latency. For each added level of pipelining the output(s) become available one iteration (clock period) later.
• Wikipedia: latency is a time interval between the stimulation and response, or, from a more general point of view, as a time delay between the cause and the effect of some physical change in the system being observed.
• added latency [measured in clock cycles] is the added number of pipeline stages due to pipelining.
21/04/16 28
Fine-grained pipelining
• In our course arithmetic nodes (multipliers, adders, …) are considered atomic, with given computation times.
• Implementations of arithmetic nodes are also graphs that can be retimed, down to gate-level (XOR, AND, ..) granularity.
• E.g. it is perfectly possible to find a forward cut set in the graph of a “multiply-by-a-constant” gate-level circuit, such that the circuit is partitioned into two parts with approximately half the computation times each.
• This so-called fine-grained pipelining is a common practice in processor and ASIC (Application Specific IC) design.
• The multipliers offered on the Spartan FPGAs are not pipelined.
21/04/16 29
RETIMING
21/04/16 30
Retiming
• Ideally all pipeline stages take approximately equal time, resulting in a balanced pipeline.
• Retiming is a graph transformation by which D-elements are relocated.
• The simplest retiming technique is to apply a sequence of node retiming steps.
• Node retiming: move a D-element from each input edge to each output edge (or vice versa). Example:
• To achieve an optimal pipeline, pipelining and retiming transformations can be combined.
21/04/16 31
21/04/16 32
Retiming example (Parhi ’95, Fig 3a)
2
2
2
1
1
1 1 Critical path is 10 time units long (transposed version: 8 time units)
21/04/16 33
Parhi ’95, Fig 3a / step 1:pipelining
Critical path is 10 time units long
21/04/16 34
Parhi ’95, Fig 3a / step 2: node retiming
Critical path is 10 time units long
21/04/16 35
Parhi ’95, Fig 3a / step 3: node retiming
Critical path is 7 time units long
21/04/16 36
Parhi ’95, Fig 3a / step 4: pipelining
Critical path is 7 time units long
21/04/16 37
Parhi ’95, Fig 3a / step 5: node retiming
Critical path is 4 time units long
21/04/16 38
Parhi ’95, Fig 3a / step 6: pipelining
Critical path is 4 time units long
21/04/16 39
Parhi ’95, Fig 3a / step 7: node retiming
Critical path is 3 time units long
3 3
21/04/16 40
Parhi ’95, Fig 3a / step 8: node retiming
Critical path is 3 time units long
4 3
21/04/16 41
Parhi ’95, Fig 3a / step 9: node retiming
Critical path is 2 time units long
4 3
21/04/16 42
Parhi ’95, Fig 3a after retiming and pipelining = Fig 3b
Critical path is 2 time units long
K-SLOW TRANSFORMATION
21/04/16 43
k-slow transformation
• Replace each D-element by k*D elements.
• This transformation changes the function of the graph.
• However, the original function is performed when inputs are offered at clock cycles k*i (i =0,1,2, …) and outputs are consumed during these clock cycles.
• Hence, the hardware is only utilized during 1/k clock cycles
21/04/16 44
K-slow transformation
• Example: 2-slow transformation
21/04/16 45
k-slow transformation
• Benefit 1: the hardware offers the same function during the intermediate clock cycles: k*i+1, k*i+2 … k*i+k-1.
• Example: an stereo audio filter must be performed on the left and right audio channels. Design the filter for one channel, and apply the 2-slow transformation.
• Example: RGB processing for color video.
• Benefit 2: the k-slow transformation increases the number of D-elements in loops (k times), enabling a reduction of critical paths by means of retiming.
21/04/16 46
A 2-slow lattice filter
21/04/16 47
• A 100-stage lattice filter with a critical path comprising 2 multiplications + 101 additions
Lattice filter cntd.
• Result 2-slow transformation after retiming:
• The critical path now equals 2 multiplications + 2 additions(was 2 multiplications + 101 additions)
• With Tadd =2 nsec and Tmul=5 nsec this corresponds to a reduction from 207 nsec to 14 nsec. (good value, despite the 50% hardware utilization)
21/04/16 48
PARALLEL PROCESSING
21/04/16 49
Parallel processing
• We follow Parhi in defining parallel processing as a quite specific form of parallel processing: computing L outputs simultaneously.
• This is also known as block processing (with block-size L), or MIMO processing (Multiple Input, Multiple Output), as opposed to SISO processing (Single Input, Single Output):
• Objective: improve throughput with a factor L.
21/04/16 50
Parallel processing: example
Example (3-tap FIR filter), SISO version: • y(n) = ax(n) + bx(n-1) + cx(n-2) MIMO version, block size L=3: • y(3k) = ax(3k ) + bx(3(k-1)+2)+ cx(3(k-1)+1) • y(3k+1) = ax(3k+1) + bx(3k ) + cx(3(k-1)+2) • y(3k+2) = ax(3k+2) + bx(3k+1) + cx(3k )
Notes: • L output values become available every clock cycle. • The number of multipliers and adders increase L-fold.
21/04/16 51
21/04/16 52
Parallel processing: example cntd
x(3k) × a × b × c
+ + y(3k)
x(3k+1) × a × b × c
+ + y(3k+1)
x(3k+2) × a × b × c
+ + y(3k+2)
D
D
Only 2 D elements1
Parallel processing
• A MIMO implementation (block size L) can be obtained from a SISO implementation by means of L-unfolding.
• Pipelining and unfolding are “dual” techniques (Parhi): if one can be applied (with benefit), so can the other.
• Costs and benefits of pipelining and unfolding differ; often the best result is obtained by applying a combination of both.
21/04/16 53
21/04/16 54
x(2(k-1))
x(10(k-1))
21/04/16 55
Unfolding, L=2
• Parhi’s paper, Fig 1/2, paper p123/124
• y(n) = ax(n) + bx(n-1) + cx(n-2)
• y(2k) = ax(2k) + bx(2k-1) + cx(2k-2) • y(2k+1) = ax(2k+1) + bx(2k) + cx(2k-1)
• Rewrite all indices in equations to the form (L(k - i) + j), with 0 ≤ j < L Result is graph with inputs x(0) .. X(L-1) (block of L samples)
• y(2k) = ax(2k) + bx(2(k-1)+1) + cx(2(k-1)) • y(2k+1) = ax(2k+1) + bx(2k) + cx(2(k-1)+1) = Fig 2
21/04/16 56
Unfolding, L=3
• Same FIR
• y(3k) = ax(3k ) + bx(3k-1) + cx(3k-2) • y(3k+1) = ax(3k+1) + bx(3k ) + cx(3k-1) • y(3k+2) = ax(3k+2) + bx(3k+1) + cx(3k )
• Rewrite all indices in equations to the form (L(k - i) + j), with 0 ≤ j < L
• y(3k) = ax(3k ) + bx(3(k-1)+2)+ cx(3(k-1)+1) • y(3k+1) = ax(3k+1) + bx(3k ) + cx(3(k-1)+2) • y(3k+2) = ax(3k+2) + bx(3k+1) + cx(3k )
21/04/16 57
Parallel processing: example cntd
x(3k) × a × b × c
+ + y(3k)
x(3k+1) × a × b × c
+ + y(3k+1)
x(3k+2) × a × b × c
+ + y(3k+2)
D
D
Only 2 D elements1
Conversion from samples to blocks and vv
• A serial-to-parallel converter
• A parallel-to-serial converter
• D-elements operate at sample period T/L
• A block shifts in/out each interval T;
• Switches must be operated accordingly
21/04/16 58
Impact of unfolding on delays
• delay element in input graph (before unfolding):
that implies a delay by 1 sample time.
• delay element in unfolded graph:
that implies a delay by L sample times.
21/04/16 59
D x(n) x(n-1)
D x(Lk) x(L(k-1))
x(Lk-L))
21/04/16 60
Parhi 5, slide 3 (Fig 5.3, pp 123)
Original program: v(n) = u(n-37)
4-unfolded version: v(4k) = u(4k-37)
v(4k+1) = u(4k-36)
v(4k+2) = u(4k-35)
v(4k+3) = u(4k-34)
4-unfolded version: v(4k) = u(4(k-10) +3)
v(4k+1) = u(4(k-9))
v(4k+2) = u(4(k-9)+1)
v(4k+3) = u(4(k-9)+2)
ASSIGNMENTS T1 AND T2
21/04/16 61
21/04/16 62
T1: FIR assignment
• Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-3)
• Assume add and multiply times: 2 and 5 nsec resp.
1) Draw DFG of FIR, calculate throughput.
2) Pipeline and retime FIR for maximal throughput.
3) Unfold FIR J=2; draw the unfolded DFG. Max throughput?
4) Pipeline and retime unfolded FIR; draw DFG. Max throughput?
5) Same for J=3 (draw DFG), and J=16 (no need to draw DFGs). Max hroughputs?
• Return deadline: Thursday April 28, 13:45
21/04/16 63
T2: IIR assignment
• Consider IIR: y(n) = x(n) + a*y(n-2) • Assume add and multiply time: 2 and 5 nsec resp.
1) Draw DFG of IIR, calculate throughput.
2) Pipeline and retime IIR for maximal throughput.
3) Unfold IIR J=2; draw the unfolded DFG. Max throughput?
4) Pipeline and retime unfolded IIR; draw DFG. Max throughput?
5) Same for J=3 (draw DFG), and J=16 (no need to draw DFGs). Max hroughputs?
• Return deadline: Thursday April 28, 13:45
THANK YOU