Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...

23
Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

Transcript of Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...

Page 1: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

High Performance Linear Transform Program Generation for the Cell BEVas Chellappa Franz Franchetti Markus Püschel

Electrical & Computer EngineeringCarnegie Mellon University

Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

Page 2: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Cell Broadband Engine

Multicore cpu (8 SPEs+1 PPE)

SPEs: SIMD cores designed for numerical computing

256KB “local store” per SPE (scratchpad-like)

Programmer-driven DMA

204 Gflop/s peak

2

Cell BE Chip

Main Mem

EIBSPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LS

How do we harness the Cell’s impressive peak performance?

Page 3: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

DFT on the Cell BE

3

Numerical Recipes

FFTW

FFTC

Spiral generated(this paper)

350x

Platform-tuned code is 350x faster. But hard to write!

Page 4: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks

4

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

Page 5: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Stage 1Stage 2Stage 3Stage 4Iterative Algorithm (programming ease)

Stage 5

Stage 1

Recursive algorithm (memory hierarchy)

Stage 2

Stage 3

Stage 4

Core 0

Core 1

Parallel execution (multicore)

“Fitting” Dataflow to Hardware

To “fit” DFT to architecture: Various traversals Various factorizations

5How to map dataflow to architecture automatically?

Page 6: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

1234

5

1

2

3

4

Core 0

Core 1

“Fitting” Dataflow to Platform (contd.)

6Intuition: rewrite formulas to obtain suitable dataflow

Page 7: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Program Generation in Spiral

7

Transformuser specified

C Code

Fast algorithmin SPLmany choices

∑-SPL

Iteration of this process to search for the fastest

But that’s not all …

parallelizationvectorization

loop optimizations

constant foldingscheduling……

Optimization at allabstraction levels

Page 8: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Common Abstraction: SPL

8

SPL: Tensor-product representationEg.: Cooley-Tukey fast Fourier transform (FFT):

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1

j j

j j j

Tensor products in SPL represent loop structures

Page 9: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks

9

Page 10: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Mapping DFTs to the Cell

10

Objective: High-performance transform library for Cell BE

Cell BE Chip

Main Mem

EIBSPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LSDFT

Tags guide formula rewriting

Vectorize DFT for vector length

Vectorization

Parallelize DFT across p SPEs, and use a DMA packet size of

Parallelization

Optimize DFT for throughput (s DFTs required)

Multibuffering

Cell’s architectural paradigms:

Page 11: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

SPL to Parallel Code

11

Natural parallel construct in SPL:AAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Parallelizing other constructs in SPL:

Permutations require message exchange (on-chip DMA comm.)x y

Independent, load-balanced, communication-free operation

Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA

Page 12: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

SPL to Streaming Code

12

Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory)

Idea: tensor loops become multi-buffered loops

Useful for: Throughput-optimized code Large, out-of-chip sizes

AAA

x y

Write Ai-1

Compute Ai

Read Ai+1

i'th iteration

(Trickier for other SPL constructs)

Idea: rewrite algorithm at SPL level to achieve largest DMA packets

Page 13: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

SIMD kernel optimized for memory hierarchy

Generating Cell Code

13

Transformuser specified

Fast algorithmin SPLtag guided

Loop operations in ∑-SPL

Streamed from memory for throughput

All-to-all communication (on-chip)

Load balanced across p SPEs

Cell-specific optimized C code (intrinsics, DMA etc.)

Rewriting

Page 14: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

DMA

parallelized

vectorized

/* Complex-to-complex DFT size 64 on 2 SPEs */

dft_c2c_64(float *X, float *Y, int spuid){ // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs

// Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier();}

Generated Code Sample

14DFT 216: 4,000+ lines of code!

Page 15: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Problem Space: Options

15

SPE

DFT

SPE

DFT

SPE

DFT

SPE

DFT

SPE

SPE SPE

SPE

DFT

SPE

SPE SPE

SPE

DFT

SPE

SPE SPE

SPE

DFTSPE

DFTDFT

SPE

DFT

Main Memory Operations

Parallelization

Latency optimized(default)

Throughput,multibuffered

SPE

DFT

Base (Vectorized)

Vectorization assumed

(Only for small DFTs)

Single DFT parallelized across multiple SPEs

Multiple independent DFTs on multiple SPEs

Multiple parallelized independent DFTs

Page 16: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Problem Space: Combinations

16

Throughput-optimized usage scenariosLatency-optimized usage scenarios

SPE

SPE SPE

SPE

DFT

SPE

DFTDFT

SPE

DFTDFT

SPE

DFTDFT

SPE

DFTDFT

SPE

SPE SPE

SPE

DFTDFT

Parallel, multibuffered DFT

Single DFT from main memory

Independent DFTs multibuffered in parallel

Devise rewrite rules for tags. Nestings describe all scenarios

Page 17: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks

17

Page 18: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

SPE

SPE SPE

SPE

DFT

4-SPEs

8-SPEs

2-SPEs

1-SPE

18

Page 19: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

SPE

SPE SPE

SPE

DFT

Spiral: 1-SPE

Spiral: 8-SPEs

FFTC

FFTW

4.5x faster than FFTW, 1.63x faster than FFTC19

Page 20: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

More Performance Results

20

Single-SPE DFT code

Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i)

IBM SDK

Spiral

Mercury

Chow

Page 21: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Other Linear Transforms

21

Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE)

2-D DFTs Out-of-core sizes

Limited to 2D DFTs on 1-SPE (for now)

More performance results:Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

Page 22: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks

22

Page 23: Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel High Performance Linear.

Carnegie Mellon

Conclusion

Automatic generation of transform libraries High performance Variety of scenarios, formats

High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs

Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture

23

architecturespace

algorithmspace