Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...

Carnegie Mellon

High Performance Linear Transform Program Generation for the Cell BEVas Chellappa Franz Franchetti Markus Püschel

Electrical & Computer EngineeringCarnegie Mellon University

Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.

Carnegie Mellon

Cell Broadband Engine

Multicore cpu (8 SPEs+1 PPE)

SPEs: SIMD cores designed for numerical computing

256KB “local store” per SPE (scratchpad-like)

Programmer-driven DMA

204 Gflop/s peak

2

Cell BE Chip

Main Mem

EIBSPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LS

How do we harness the Cell’s impressive peak performance?

Carnegie Mellon

DFT on the Cell BE

3

Numerical Recipes

FFTW

FFTC

Spiral generated(this paper)

350x

Platform-tuned code is 350x faster. But hard to write!

Carnegie Mellon

Overview

Background, Spiral Overview

Generating DFTs for the Cell

Performance Results

Concluding Remarks

4

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005

Carnegie Mellon

Stage 1Stage 2Stage 3Stage 4Iterative Algorithm (programming ease)

Stage 5

Stage 1

Recursive algorithm (memory hierarchy)

Stage 2

Stage 3

Stage 4

Core 0

Core 1

Parallel execution (multicore)

“Fitting” Dataflow to Hardware

To “fit” DFT to architecture: Various traversals Various factorizations

5How to map dataflow to architecture automatically?

Carnegie Mellon

1234

5

1

2

3

4

Core 0

Core 1

“Fitting” Dataflow to Platform (contd.)

6Intuition: rewrite formulas to obtain suitable dataflow

Carnegie Mellon

Program Generation in Spiral

7

Transformuser specified

C Code

Fast algorithmin SPLmany choices

∑-SPL

Iteration of this process to search for the fastest

But that’s not all …

parallelizationvectorization

loop optimizations

constant foldingscheduling……

Optimization at allabstraction levels

Carnegie Mellon

Common Abstraction: SPL

8

SPL: Tensor-product representationEg.: Cooley-Tukey fast Fourier transform (FFT):

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1

j j

j j j

Tensor products in SPL represent loop structures

Carnegie Mellon

Overview



Performance Results

Concluding Remarks

9

Carnegie Mellon

Mapping DFTs to the Cell

10

Objective: High-performance transform library for Cell BE

Cell BE Chip

Main Mem

EIBSPELS

SPELS

SPELS

SPELS

SPE LS

SPE LS

SPE LS

SPE LSDFT

Tags guide formula rewriting

Vectorize DFT for vector length

Vectorization

Parallelize DFT across p SPEs, and use a DMA packet size of

Parallelization

Optimize DFT for throughput (s DFTs required)

Multibuffering

Cell’s architectural paradigms:

Carnegie Mellon

SPL to Parallel Code

11

Natural parallel construct in SPL:AAAA

x y

Processor 0

Processor 1

Processor 2

Processor 3

Parallelizing other constructs in SPL:

Permutations require message exchange (on-chip DMA comm.)x y

Independent, load-balanced, communication-free operation

Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA

Carnegie Mellon

SPL to Streaming Code

12

Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory)

Idea: tensor loops become multi-buffered loops

Useful for: Throughput-optimized code Large, out-of-chip sizes

AAA

x y

Write Ai-1

Compute Ai

Read Ai+1

i'th iteration

(Trickier for other SPL constructs)

Idea: rewrite algorithm at SPL level to achieve largest DMA packets

Carnegie Mellon

SIMD kernel optimized for memory hierarchy

Generating Cell Code

13

Transformuser specified

Fast algorithmin SPLtag guided

Loop operations in ∑-SPL

Streamed from memory for throughput

All-to-all communication (on-chip)

Load balanced across p SPEs

Cell-specific optimized C code (intrinsics, DMA etc.)

Rewriting

Carnegie Mellon

DMA

parallelized

vectorized

/* Complex-to-complex DFT size 64 on 2 SPEs */

dft_c2c_64(float *X, float *Y, int spuid){ // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs

// Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier();}

Generated Code Sample

14DFT 216: 4,000+ lines of code!

Carnegie Mellon

Problem Space: Options

15

SPE

DFT

SPE

DFT

SPE

DFT

SPE

DFT

SPE

SPE SPE

SPE

DFT

SPE

SPE SPE

SPE

DFT

SPE

SPE SPE

SPE

DFTSPE

DFTDFT

SPE

DFT

Main Memory Operations

Parallelization

Latency optimized(default)

Throughput,multibuffered

SPE

DFT

Base (Vectorized)

Vectorization assumed

(Only for small DFTs)

Single DFT parallelized across multiple SPEs

Multiple independent DFTs on multiple SPEs

Multiple parallelized independent DFTs

Carnegie Mellon

Problem Space: Combinations

16

Throughput-optimized usage scenariosLatency-optimized usage scenarios

SPE

SPE SPE

SPE

DFT

SPE

DFTDFT

SPE

DFTDFT

SPE

DFTDFT

SPE

DFTDFT

SPE

SPE SPE

SPE

DFTDFT

Parallel, multibuffered DFT

Single DFT from main memory

Independent DFTs multibuffered in parallel

Devise rewrite rules for tags. Nestings describe all scenarios

Carnegie Mellon

Overview



Performance Results

Concluding Remarks

17

Carnegie Mellon

SPE

SPE SPE

SPE

DFT

4-SPEs

8-SPEs

2-SPEs

1-SPE

18

Carnegie Mellon

SPE

SPE SPE

SPE

DFT

Spiral: 1-SPE

Spiral: 8-SPEs

FFTC

FFTW

4.5x faster than FFTW, 1.63x faster than FFTC19

Carnegie Mellon

More Performance Results

20

Single-SPE DFT code

Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i)

IBM SDK

Spiral

Mercury

Chow

Carnegie Mellon

Other Linear Transforms

21

Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE)

2-D DFTs Out-of-core sizes

Limited to 2D DFTs on 1-SPE (for now)

More performance results:Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009

Carnegie Mellon

Overview



Performance Results

Concluding Remarks

22

Carnegie Mellon

Conclusion

Automatic generation of transform libraries High performance Variety of scenarios, formats

High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs

Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture

23

architecturespace

algorithmspace

Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...

Documents

Transcript of Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...