BeiHang Short Course, Part 4 - Carnegie Mellon...
Transcript of BeiHang Short Course, Part 4 - Carnegie Mellon...
9/24/2014
1
BeiHang Short Course, Part 4: Spiral: Domain‐Specific Synthesis
James C. Hoe
Department of ECE
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s1
Department of ECE
Carnegie Mellon University
Collaborators: Peter Milder (SUNY Stonybrook), Markus Pueschel (ETH), Franz Franchetti
Peter A. Milder, et al., “Computer Generation of Hardware for Linear Digital Signal Processing Transforms,” ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 17, Issue 2, pp 15:1‐33, April 2012.
Conflict btw High‐Level vs Generality
high‐level:t l k
RTL synthesis: general‐purposebut special handling of
tools knowbetter than you
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s2
place‐and‐route: works the sameno matter what design
structures like FSM, arith, etc.
9/24/2014
2
The SPIRAL Project
• High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc.) are an important class of design problems
• Hand design and tuning is tricky and expensive
– needs both math and implementation knowledge
– time‐consuming and tedious
– needs to repeat effort for every new context
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s3
p y
• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design
Why we can do better than hand design
• SPIRAL is only focused on linear DSP transforms
Th f hi hl d hi hl• These transforms are highly structured, highly regular and very well understood mathematically
• Algorithmic implementations of a transform can be enumerated following a known set of rules
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s4
• For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations
9/24/2014
3
SPIRAL Framework
SPIRAL
I want a DFT of size 1024on a {Xilinx, P4, Cell....}
where mosttools begin
automationstarts here
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s5
Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction
automatingthe problem
www.spiral.net/hardware/dftgen.html
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s6
9/24/2014
4
Outline
• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐core
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s7
Linear Transforms
• Linear transform is a matrix‐vector multiplication
– computing by definition takes O(N2) operations( )
– the matrix has structure
• E.g. discrete Fourier transform: y = DFTN x
y0
y1
x0
x1
k 0 .. N-1
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s8
1
.yj
.
.yN-1
=
1
.xk
.
.xN-1
.Njkie 2
j0 .. N
-1
9/24/2014
5
“Fast” Algorithms• a “fast” algorithm factors the matrix into a sequence of structured, sparse matrices
cheaper sparse multiplies O(N log(N)) operations
• E.g. Cooley‐Tukey Factorization of DFT4
1
1
1
11
11
11
1
1
1
11
11
11
1111
11
1111
ii
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s9
• Matrix formula representation
1111111 iii
4222
42224 LDFTIDIDFTDFT
Factorization Rules
E.g. Cooley‐Tukey
mnmn LDFTIDIDFTDFT
– DFT2 is
– D is a diagonal matrix of twiddle factors
– L is a stride permutation matrix
– AB=[aj,kB] is the tensor (or kronecker) product
nmnnmnmn LDFTIDIDFTDFT
11
11
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s10
j,
A In
a0,0a0,0
a0,000 a0,1
a0,1a0,100
a1,0a1,0
a1,000 a1,1
a1,1a1,100
0
0BB
B
B
e.g., In B
9/24/2014
6
“Fast” Fourier Transform Algorithms
• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e.g. DFTr for radix‐r)
• Exponential number of alternatives
8
24222
42222
8242
8242
82428
LLDFTIDIDFTIDIDFT
LDFTIDIDFTDFT
8DFT8DFT
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s11
• Each ruletree corresponds a different algorithm
• All cost O(N log(N))
2DFT 4DFT
2DFT 2DFT
4DFT 2DFT
2DFT 2DFT
A System of Transforms and Rules
QnIV
nII
nII
n FIDCTDCTPDCT 22/)(
2/)(
2/)(
2)(
2 2/1,1 FdiagDCT II nnnn 22/2/2/
DDCTSDCT IIn
IVn )()(
rIV
n MMDCT 1)(
))(()()( // hFIIIhF ddnkdk
dnn
PDFTIDIDFTDFT mnmnnm
CDSTDCTBDFT In
Inn )( )(
2/)(
2/
50+ transforms150+ rules
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s12
1 1 12 2 2 21
( )n n n n n ni i i t
n
i
WHT I WHT I
EWIPIWDWTWDWT knnnn )())(()( 2/2/2/
EhCirchFn )()(
9/24/2014
7
Algorithmic Design Space
size
2
# of DFT
1
# of DCT‐IV
1248
163264
128256
16
40296
27744162570361280
~1.01 1027
~2 31 1061
110
12631242
19244433627343815121631354242
~1.07 1038
~2 30 1076
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s13
256512
2.31 1061
~2.86 101332.30 1076
~1.06 10153
Different characteristics: data flow, numerical stability, operation ordering, working set size, datapath regularity
Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s14
histogram by runtime(P4, 3.2 GHz)
histogram by num. accuracy
9/24/2014
8
Outline
• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐core
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s15
Formula to HW (Combinational)
• Given where is:– apply , then
is a permutation permute– is a permutation permute– apply , times in parallel– is a diagonal scale
B A
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s16
A
A×4
×2
×7
×8
9/24/2014
9
x
DFT8 Datapath Example
x
x
x
x
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s17
x
x
82
4222
42222
82428 LLDFTIDIDFTIDIDFTDFT
x
(formula is applied from right to left)
Pease DFT8 Example
x x x
x
x
x
x
x
x
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s18
x x x
stage 1 stage 2 stage 3
9/24/2014
10
How about good HW?
• Matrix formulas have a natural mapping to dataflow and hence combinational datapath
• However, real hardware designs must fit a given resource constraint
sequential datapath that reuse available HW
– identify repeated kernels
– instantiate kernels under resource constraints
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s19
– schedule computation to reuse instantiated kernels
We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge
Tensor as Streaming Parallelism
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s20
partially streamedfully parallel fully streamed
9/24/2014
11
Pease DFT Example: DFT8
x x x
x
x
x
x
x
x
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s21
x x x
stage 1 stage 2 stage 3
Pease DFT Example: DFT8 Streaming
x x xf(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(R8)
x x x
2( 2) 2( 2) 2( 2) 2( )
x x xf(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(R8)
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s22
stage 1 stage 2 stage 3
x x x
9/24/2014
12
• Simple regular structure embodied in Pease FFT
Regular Structure for HW
• Example:
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s23
Formally representing horizontal reuse
hr hr
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s24
not horizontally reused
horizontallyreused
partially horizontally reused
9/24/2014
13
Iterative Reuse of Logic
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s25
latency
Fine-grained control over cost/latency tradeoff
cost
Example: rewriting rulesfor streaming reuse
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s26
9/24/2014
14
Applicability to other transforms?
• DFT radix 2
1
0
2
2222 11
k
ii
k
kkk LDFTITR
1/k
• DFT radix 2r
• 2‐D DFTnxn
1
0
2
inn
nn DFTIL
1/
0
2
2222
rk
ii
k
rkrrkk LDFTITR
1/ rk
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s27
• WHT
• DCT (type II) Hk
iik k
k
k
k
k
k
PLLLADP2
2
2
2
2
1
0
22 11
1/
0
2
222
rk
i
k
rkrrk LWHTI
FPGA: Area vs. Throughput
Pareto
49 li
Pareto optimal
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s28
28
49x slices132x throughput
9/24/2014
15
Outline
• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐core
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s29
A 2D‐FFT Algorithm
• Row‐column algorithm:
Row StageColumn Stage
2D-
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s30
Dataset:(Logical abstractionof the 2D dataset)
…
…
9/24/2014
16
Off‐chip Data Sets
• Need to balance
off‐chipDRAM
DFTnkernel
on‐chipSRAM
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s31
• Need to balance
– kernel processing bandwidth
– off‐chip memory bandwidth
– on‐chip storage capacity
Inefficient DRAM Access Patterns
• Row‐wise traversal ‐> Sequential accesses
• Column‐wise traversal ‐> Large strided accessesg
n
linear m
0
…
row‐major 2D array
Row buffer
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s32
n
mem sp
ace
n2
…
size
9/24/2014
17
How to Optimize the Access Patterns
0row‐major “blocked”
n
…
…
k
k
k2
Row buffersize
linear m
em s
row major blocked
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s33
n
…
n2in row‐buffersized chunks
space
[Akin, et al., FCCM 2012]
Design Generator w/ Tensor Formalism
row‐column algorithm
row stagecolumn stage
2D-
symmetric algorithm
symmetric algorithm with tiling
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s34
read tilesrow‐wise
linearizeon‐chip
FFT processingtransposeand re‐tileon‐chip
write tilescolumn‐wise
[Akin, et al., FCCM 2012]
9/24/2014
18
2D‐FFT (double) Raw Performance
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s35
Problem Size[Akin, et al., FCCM 2012]
2D‐FFT (double) BW Efficiency
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s36
Problem Size[Akin, et al., FCCM 2012]
9/24/2014
19
2D‐FFT (double) Power Efficiency
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s37
Problem Size[Akin, et al., FCCM 2012]
Recap the Last Hour
• Encapsulating domain knowledge in a domain specific tool for truly high‐level design automation
• SPIRAL
– mathematical approach to DSP transform implementation (cores and “un”‐core)
– generalizable to other linear DSP transforms
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s38
– as good as best expert designer
9/24/2014
20
An Aside on Permutations
• Permutation: a fixed reordering of a given number of data points
• Example: P16
Efficient streaming permutation is the secret to
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s39
pefficient high‐performance
FFT in hardware
16 point input vector
16 point output vector
Permuting Streaming Data
streaming permutation:data reordered in space & time
streaming data vector4 words per cycle,
Example: P with streaming width 4
port 0
port 1
port 2
cycle 3 2 1 0
port 3
p y ,4 cycles
cycle d+3 d+2 d+1 d
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s40
Requirements:• scale with permutation size n• scale with streaming width w• support full throughput: w words per cycle
Example: P16 with streaming width 4
9/24/2014
21
Naive Implementation
• Perform streaming permutation with one many‐ported RAM
• 2w ported RAM is impractical in practice for w>1
w inputs per cycle
w outputs per cycle
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s41
• 2w‐ported RAM is impractical in practice for w>1
Solve the above using w, 2‐ported RAMs instead?
Using only 2‐ported RAMs
• For 2‐power vector and 2‐power streaming width
reorder in time reorder in spacereorder in space
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s42
p p g
– solution for “bit” permutations requires 1x storage in w 2‐ported RAMs
– solution for arbitrary permutations requires 2x storage in 2w 2‐ported RAMs
See [Pueschel, JACM’2009][ Milder, DATE’2009]
9/24/2014
22
J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s43
Computer Architecture Lab (CALCM)
Carnegie Mellon University
http://www.ece.cmu.edu/CALCM