BeiHang Short Course, Part 4 - Carnegie Mellon...

22
9/24/2014 1 BeiHang Short Course, Part 4: Spiral: DomainSpecific Synthesis James C. Hoe Department of ECE J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSCL4s1 Department of ECE Carnegie Mellon University Collaborators: Peter Milder (SUNY Stonybrook), Markus Pueschel (ETH), Franz Franchetti Peter A. Milder, et al., “Computer Generation of Hardware for Linear Digital Signal Processing Transforms,” ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 17, Issue 2, pp 15:133, April 2012. Conflict btw HighLevel vs Generality highlevel: t lk RTL synthesis: generalpurpose but special handling of tools know better than you J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSCL4s2 placeandroute: works the same no matter what design structures like FSM, arith, etc.

Transcript of BeiHang Short Course, Part 4 - Carnegie Mellon...

Page 1: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

1

BeiHang Short Course, Part 4: Spiral: Domain‐Specific Synthesis

James C. Hoe

Department of ECE

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s1

Department of ECE

Carnegie Mellon University

Collaborators: Peter Milder (SUNY Stonybrook), Markus Pueschel (ETH), Franz Franchetti

Peter A. Milder, et al., “Computer Generation of Hardware for Linear Digital Signal Processing Transforms,” ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 17, Issue 2, pp 15:1‐33, April 2012. 

Conflict btw High‐Level vs Generality

high‐level:t l k

RTL synthesis: general‐purposebut special handling of

tools knowbetter than you

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s2

place‐and‐route: works the sameno matter what design

structures like FSM, arith, etc.

Page 2: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

2

The SPIRAL Project

• High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc.) are an important class of design problems

• Hand design and tuning is tricky and expensive

– needs both math and implementation knowledge

– time‐consuming and tedious 

– needs to repeat effort for every new context

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s3

p y

• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design

Why we can do better than hand design

• SPIRAL is only focused on linear DSP transforms

Th f hi hl d hi hl• These transforms are highly structured, highly regular and very well understood mathematically

• Algorithmic implementations of a transform can be enumerated following a known set of rules

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s4

• For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations 

Page 3: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

3

SPIRAL Framework

SPIRAL 

I want a DFT of size 1024on a {Xilinx, P4, Cell....} 

where mosttools begin

automationstarts here

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s5

Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction

automatingthe problem

www.spiral.net/hardware/dftgen.html

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s6

Page 4: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

4

Outline

• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐core

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s7

Linear Transforms

• Linear transform is a matrix‐vector multiplication

– computing by definition takes O(N2) operations( )

– the matrix has structure

• E.g. discrete Fourier transform: y = DFTN x

y0

y1

x0

x1

k 0 .. N-1

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s8

1

.yj

.

.yN-1

=

1

.xk

.

.xN-1

.Njkie 2

j0 .. N

-1

Page 5: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

5

“Fast” Algorithms• a “fast” algorithm factors the matrix into a sequence of structured, sparse matrices

cheaper sparse multiplies  O(N log(N)) operations

• E.g. Cooley‐Tukey Factorization of DFT4

1

1

1

11

11

11

1

1

1

11

11

11

1111

11

1111

ii

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s9

• Matrix formula representation

1111111 iii

4222

42224 LDFTIDIDFTDFT

Factorization Rules

E.g. Cooley‐Tukey

mnmn LDFTIDIDFTDFT

– DFT2 is 

– D is a diagonal matrix of twiddle factors

– L is a stride permutation matrix

– AB=[aj,kB] is the tensor (or kronecker) product

nmnnmnmn LDFTIDIDFTDFT

11

11

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s10

j,

A In

a0,0a0,0

a0,000 a0,1

a0,1a0,100

a1,0a1,0

a1,000 a1,1

a1,1a1,100

0

0BB

B

B

e.g., In B

Page 6: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

6

“Fast” Fourier Transform Algorithms

• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e.g. DFTr for radix‐r)

• Exponential number of alternatives 

8

24222

42222

8242

8242

82428

LLDFTIDIDFTIDIDFT

LDFTIDIDFTDFT

8DFT8DFT

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s11

• Each ruletree corresponds a different algorithm

• All cost O(N log(N))

2DFT 4DFT

2DFT 2DFT

4DFT 2DFT

2DFT 2DFT

A System of Transforms and Rules

QnIV

nII

nII

n FIDCTDCTPDCT 22/)(

2/)(

2/)(

2)(

2 2/1,1 FdiagDCT II nnnn 22/2/2/

DDCTSDCT IIn

IVn )()(

rIV

n MMDCT 1)(

))(()()( // hFIIIhF ddnkdk

dnn

PDFTIDIDFTDFT mnmnnm

CDSTDCTBDFT In

Inn )( )(

2/)(

2/

50+ transforms150+ rules

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s12

1 1 12 2 2 21

( )n n n n n ni i i t

n

i

WHT I WHT I

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

EhCirchFn )()(

Page 7: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

7

Algorithmic Design Space

size

2

# of DFT

1

# of DCT‐IV

1248

163264

128256

16

40296

27744162570361280

~1.01  1027

~2 31 1061

110

12631242

19244433627343815121631354242

~1.07  1038

~2 30 1076

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s13

256512

2.31  1061

~2.86  101332.30  1076

~1.06  10153

Different characteristics: data flow, numerical stability, operation ordering, working set size, datapath regularity

Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s14

histogram by runtime(P4, 3.2 GHz)

histogram by num. accuracy

Page 8: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

8

Outline

• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐core

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s15

Formula to HW (Combinational)

• Given                        where       is:– apply     , then 

is a permutation permute– is a permutation permute– apply     ,     times in parallel– is a diagonal scale 

B A

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s16

A

A×4

×2

×7

×8

Page 9: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

9

x

DFT8 Datapath Example

x

x

x

x

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s17

x

x

82

4222

42222

82428 LLDFTIDIDFTIDIDFTDFT

x

(formula is applied from right to left)

Pease DFT8 Example

x x x

x

x

x

x

x

x

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s18

x x x

stage 1 stage 2 stage 3

Page 10: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

10

How about good HW?

• Matrix formulas have a natural mapping to dataflow and hence combinational datapath

• However, real hardware designs must fit a given resource constraint 

sequential datapath that reuse available HW

– identify repeated kernels

– instantiate kernels under resource constraints

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s19

– schedule computation to reuse instantiated kernels

We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge

Tensor as Streaming Parallelism

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s20

partially streamedfully parallel fully streamed

Page 11: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

11

Pease DFT Example:  DFT8

x x x

x

x

x

x

x

x

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s21

x x x

stage 1 stage 2 stage 3

Pease DFT Example:  DFT8 Streaming

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)

x x x

2( 2) 2( 2) 2( 2) 2( )

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s22

stage 1 stage 2 stage 3

x x x

Page 12: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

12

• Simple regular structure embodied in Pease FFT

Regular Structure for HW

• Example:

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s23

Formally representing horizontal reuse

hr hr

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s24

not horizontally reused

horizontallyreused

partially horizontally reused

Page 13: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

13

Iterative Reuse of Logic

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s25

latency

Fine-grained control over cost/latency tradeoff

cost

Example: rewriting rulesfor streaming reuse

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s26

Page 14: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

14

Applicability to other transforms?

• DFT radix 2

1

0

2

2222 11

k

ii

k

kkk LDFTITR

1/k

• DFT radix 2r

• 2‐D DFTnxn

1

0

2

inn

nn DFTIL

1/

0

2

2222

rk

ii

k

rkrrkk LDFTITR

1/ rk

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s27

• WHT

• DCT (type II) Hk

iik k

k

k

k

k

k

PLLLADP2

2

2

2

2

1

0

22 11

1/

0

2

222

rk

i

k

rkrrk LWHTI

FPGA: Area vs. Throughput

Pareto

49 li

Pareto optimal

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s28

28

49x slices132x throughput

Page 15: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

15

Outline

• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐core

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s29

A 2D‐FFT Algorithm

• Row‐column algorithm:

Row StageColumn Stage

2D-

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s30

Dataset:(Logical abstractionof the 2D dataset)

Page 16: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

16

Off‐chip Data Sets

• Need to balance

off‐chipDRAM

DFTnkernel

on‐chipSRAM

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s31

• Need to balance

– kernel processing bandwidth

– off‐chip memory bandwidth

– on‐chip storage capacity

Inefficient DRAM Access Patterns

• Row‐wise traversal ‐> Sequential accesses

• Column‐wise traversal ‐> Large strided accessesg

n

linear m

0

row‐major 2D array

Row buffer

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s32

n

mem sp

ace

n2

size

Page 17: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

17

How to Optimize the Access Patterns 

0row‐major “blocked”

n

k

k

k2

Row buffersize

linear m

em s

row major   blocked

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s33

n

n2in row‐buffersized chunks 

space

[Akin, et al., FCCM 2012]

Design Generator w/ Tensor Formalism

row‐column algorithm

row stagecolumn stage

2D-

symmetric algorithm

symmetric algorithm with tiling

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s34

read tilesrow‐wise

linearizeon‐chip

FFT processingtransposeand re‐tileon‐chip

write tilescolumn‐wise

[Akin, et al., FCCM 2012]

Page 18: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

18

2D‐FFT (double) Raw Performance

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s35

Problem Size[Akin, et al., FCCM 2012]

2D‐FFT (double) BW Efficiency

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s36

Problem Size[Akin, et al., FCCM 2012]

Page 19: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

19

2D‐FFT (double) Power Efficiency

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s37

Problem Size[Akin, et al., FCCM 2012]

Recap the Last Hour

• Encapsulating domain knowledge in a domain specific tool for truly high‐level design automation

• SPIRAL

– mathematical approach to DSP transform implementation (cores and “un”‐core)

– generalizable to other linear DSP transforms

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s38

– as good as best expert designer

Page 20: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

20

An Aside on Permutations

• Permutation: a fixed reordering of a given number of data points 

• Example: P16

Efficient streaming permutation is the secret to 

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s39

pefficient high‐performance 

FFT in hardware

16 point input vector

16 point output vector

Permuting Streaming Data

streaming permutation:data reordered in space & time

streaming data vector4 words per cycle,

Example: P with streaming width 4

port 0

port 1

port 2

cycle      3     2    1     0   

port 3

p y ,4 cycles

cycle d+3 d+2 d+1 d

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s40

Requirements:• scale with permutation size n• scale with streaming width w• support full throughput: w words per cycle

Example: P16  with streaming width 4

Page 21: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

21

Naive Implementation

• Perform streaming permutation with one many‐ported RAM

• 2w ported RAM is impractical in practice for w>1

w inputs per cycle

w outputs per cycle

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s41

• 2w‐ported RAM is impractical in practice for w>1

Solve the above using w, 2‐ported RAMs instead?

Using only 2‐ported RAMs

• For 2‐power vector and 2‐power streaming width

reorder in time reorder in spacereorder in space

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s42

p p g

– solution for “bit” permutations requires 1x storage in w 2‐ported RAMs

– solution for arbitrary permutations requires 2x storage in 2w 2‐ported RAMs

See [Pueschel, JACM’2009][ Milder, DATE’2009]

Page 22: BeiHang Short Course, Part 4 - Carnegie Mellon Universityusers.ece.cmu.edu/~jhoe/distribution/bhsc14/BH4-Spiral.pdfWhy we can do better than hand design •SPIRAL is only focused on

9/24/2014

22

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s43

Computer Architecture Lab (CALCM)

Carnegie Mellon University

http://www.ece.cmu.edu/CALCM