BeiHang Short Course, Part 4 - Carnegie Mellon...

9/24/2014

1

BeiHang Short Course, Part 4: Spiral: Domain‐Specific Synthesis

James C. Hoe

Department of ECE

J. C. Hoe, CMU/ECE/CALCM, © 2014, BHSC‐L4‐s1

Department of ECE

Carnegie Mellon University

Collaborators: Peter Milder (SUNY Stonybrook), Markus Pueschel (ETH), Franz Franchetti

Peter A. Milder, et al., “Computer Generation of Hardware for Linear Digital Signal Processing Transforms,” ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 17, Issue 2, pp 15:1‐33, April 2012.

Conflict btw High‐Level vs Generality

high‐level:t l k

RTL synthesis: general‐purposebut special handling of

tools knowbetter than you


place‐and‐route: works the sameno matter what design

structures like FSM, arith, etc.

9/24/2014

2

The SPIRAL Project

• High performance implementations of linear DSP transforms (DFT, DCT, DWT, filters, etc.) are an important class of design problems

• Hand design and tuning is tricky and expensive

– needs both math and implementation knowledge

– time‐consuming and tedious

– needs to repeat effort for every new context


p y

• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design

Why we can do better than hand design

• SPIRAL is only focused on linear DSP transforms

Th f hi hl d hi hl• These transforms are highly structured, highly regular and very well understood mathematically

• Algorithmic implementations of a transform can be enumerated following a known set of rules


• For a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough implementations

9/24/2014

3

SPIRAL Framework

SPIRAL

I want a DFT of size 1024on a {Xilinx, P4, Cell....}

where mosttools begin

automationstarts here


Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction

automatingthe problem

www.spiral.net/hardware/dftgen.html


9/24/2014

4

Outline

• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐core


Linear Transforms

• Linear transform is a matrix‐vector multiplication

– computing by definition takes O(N2) operations( )

– the matrix has structure

• E.g. discrete Fourier transform: y = DFTN x

y0

y1

x0

x1

k 0 .. N-1


1

.yj

.

.yN-1

=

1

.xk

.

.xN-1

.Njkie 2

j0 .. N

-1

9/24/2014

5

“Fast” Algorithms• a “fast” algorithm factors the matrix into a sequence of structured, sparse matrices

cheaper sparse multiplies O(N log(N)) operations

• E.g. Cooley‐Tukey Factorization of DFT4

1

1

1

11

11

11

1

1

1

11

11

11

1111

11

1111

ii


• Matrix formula representation

1111111 iii

4222

42224 LDFTIDIDFTDFT

Factorization Rules

E.g. Cooley‐Tukey

mnmn LDFTIDIDFTDFT

– DFT2 is

– D is a diagonal matrix of twiddle factors

– L is a stride permutation matrix

– AB=[aj,kB] is the tensor (or kronecker) product

nmnnmnmn LDFTIDIDFTDFT

11

11


j,

A In

a0,0a0,0

a0,000 a0,1

a0,1a0,100

a1,0a1,0

a1,000 a1,1

a1,1a1,100

0

0BB

B

B

e.g., In B

9/24/2014

6

“Fast” Fourier Transform Algorithms

• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e.g. DFTr for radix‐r)

• Exponential number of alternatives

8

24222

42222

8242

8242

82428

LLDFTIDIDFTIDIDFT

LDFTIDIDFTDFT

8DFT8DFT


• Each ruletree corresponds a different algorithm

• All cost O(N log(N))

2DFT 4DFT

2DFT 2DFT

4DFT 2DFT

2DFT 2DFT

A System of Transforms and Rules

QnIV

nII

nII

n FIDCTDCTPDCT 22/)(

2/)(

2/)(

2)(

2 2/1,1 FdiagDCT II nnnn 22/2/2/

DDCTSDCT IIn

IVn )()(

rIV

n MMDCT 1)(

))(()()( // hFIIIhF ddnkdk

dnn

PDFTIDIDFTDFT mnmnnm

CDSTDCTBDFT In

Inn )( )(

2/)(

2/

50+ transforms150+ rules


1 1 12 2 2 21

( )n n n n n ni i i t

n

i

WHT I WHT I

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

EhCirchFn )()(

9/24/2014

7

Algorithmic Design Space

size

2

# of DFT

1

# of DCT‐IV

1248

163264

128256

16

40296

27744162570361280

~1.01 1027

~2 31 1061

110

12631242

19244433627343815121631354242

~1.07 1038

~2 30 1076


256512

2.31 1061

~2.86 101332.30 1076

~1.06 10153

Different characteristics: data flow, numerical stability, operation ordering, working set size, datapath regularity

Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms


histogram by runtime(P4, 3.2 GHz)

histogram by num. accuracy

9/24/2014

8

Outline





Formula to HW (Combinational)

• Given where is:– apply , then

is a permutation permute– is a permutation permute– apply , times in parallel– is a diagonal scale

B A


A

A×4

×2

×7

×8

9/24/2014

9

x

DFT8 Datapath Example

x

x

x

x


x

x

82

4222

42222

82428 LLDFTIDIDFTIDIDFTDFT

x

(formula is applied from right to left)

Pease DFT8 Example

x x x

x

x

x

x

x

x


x x x

stage 1 stage 2 stage 3

9/24/2014

10

How about good HW?

• Matrix formulas have a natural mapping to dataflow and hence combinational datapath

• However, real hardware designs must fit a given resource constraint

sequential datapath that reuse available HW

– identify repeated kernels

– instantiate kernels under resource constraints


– schedule computation to reuse instantiated kernels

We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge

Tensor as Streaming Parallelism


partially streamedfully parallel fully streamed

9/24/2014

11

Pease DFT Example: DFT8

x x x

x

x

x

x

x

x


x x x


Pease DFT Example: DFT8 Streaming

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)

x x x

2( 2) 2( 2) 2( 2) 2( )

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)



x x x

9/24/2014

12

• Simple regular structure embodied in Pease FFT

Regular Structure for HW

• Example:


Formally representing horizontal reuse

hr hr


not horizontally reused

horizontallyreused

partially horizontally reused

9/24/2014

13

Iterative Reuse of Logic


latency

Fine-grained control over cost/latency tradeoff

cost

Example: rewriting rulesfor streaming reuse


9/24/2014

14

Applicability to other transforms?

• DFT radix 2

1

0

2

2222 11

k

ii

k

kkk LDFTITR

1/k

• DFT radix 2r

• 2‐D DFTnxn

1

0

2

inn

nn DFTIL

1/

0

2

2222

rk

ii

k

rkrrkk LDFTITR

1/ rk


• WHT

• DCT (type II) Hk

iik k

k

k

k

k

k

PLLLADP2

2

2

2

2

1

0

22 11

1/

0

2

222

rk

i

k

rkrrk LWHTI

FPGA: Area vs. Throughput

Pareto

49 li

Pareto optimal


28

49x slices132x throughput

9/24/2014

15

Outline





A 2D‐FFT Algorithm

• Row‐column algorithm:

Row StageColumn Stage

2D-


Dataset:(Logical abstractionof the 2D dataset)

…

…

9/24/2014

16

Off‐chip Data Sets

• Need to balance

off‐chipDRAM

DFTnkernel

on‐chipSRAM


• Need to balance

– kernel processing bandwidth

– off‐chip memory bandwidth

– on‐chip storage capacity

Inefficient DRAM Access Patterns

• Row‐wise traversal ‐> Sequential accesses

• Column‐wise traversal ‐> Large strided accessesg

n

linear m

0

…

row‐major 2D array

Row buffer


n

mem sp

ace

n2

…

size

9/24/2014

17

How to Optimize the Access Patterns

0row‐major “blocked”

n

…

…

k

k

k2

Row buffersize

linear m

em s

row major blocked


n

…

n2in row‐buffersized chunks

space

[Akin, et al., FCCM 2012]

Design Generator w/ Tensor Formalism

row‐column algorithm

row stagecolumn stage

2D-

symmetric algorithm

symmetric algorithm with tiling


read tilesrow‐wise

linearizeon‐chip

FFT processingtransposeand re‐tileon‐chip

write tilescolumn‐wise

[Akin, et al., FCCM 2012]

9/24/2014

18

2D‐FFT (double) Raw Performance


Problem Size[Akin, et al., FCCM 2012]

2D‐FFT (double) BW Efficiency



9/24/2014

19

2D‐FFT (double) Power Efficiency



Recap the Last Hour

• Encapsulating domain knowledge in a domain specific tool for truly high‐level design automation

• SPIRAL

– mathematical approach to DSP transform implementation (cores and “un”‐core)

– generalizable to other linear DSP transforms


– as good as best expert designer

9/24/2014

20

An Aside on Permutations

• Permutation: a fixed reordering of a given number of data points

• Example: P16

Efficient streaming permutation is the secret to


pefficient high‐performance

FFT in hardware

16 point input vector

16 point output vector

Permuting Streaming Data

streaming permutation:data reordered in space & time

streaming data vector4 words per cycle,

Example: P with streaming width 4

port 0

port 1

port 2

cycle 3 2 1 0

port 3

p y ,4 cycles

cycle d+3 d+2 d+1 d


Requirements:• scale with permutation size n• scale with streaming width w• support full throughput: w words per cycle

Example: P16 with streaming width 4

9/24/2014

21

Naive Implementation

• Perform streaming permutation with one many‐ported RAM

• 2w ported RAM is impractical in practice for w>1

w inputs per cycle

w outputs per cycle


• 2w‐ported RAM is impractical in practice for w>1

Solve the above using w, 2‐ported RAMs instead?

Using only 2‐ported RAMs

• For 2‐power vector and 2‐power streaming width

reorder in time reorder in spacereorder in space


p p g

– solution for “bit” permutations requires 1x storage in w 2‐ported RAMs

– solution for arbitrary permutations requires 2x storage in 2w 2‐ported RAMs

See [Pueschel, JACM’2009][ Milder, DATE’2009]

9/24/2014

22


Computer Architecture Lab (CALCM)

Carnegie Mellon University

http://www.ece.cmu.edu/CALCM

BeiHang Short Course, Part 4 - Carnegie Mellon...

Documents

Transcript of BeiHang Short Course, Part 4 - Carnegie Mellon...