Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...
-
Upload
blake-dennis -
Category
Documents
-
view
214 -
download
0
Transcript of Carnegie Mellon High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa...
Carnegie Mellon
High Performance Linear Transform Program Generation for the Cell BEVas Chellappa Franz Franchetti Markus Püschel
Electrical & Computer EngineeringCarnegie Mellon University
Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.
Carnegie Mellon
Cell Broadband Engine
Multicore cpu (8 SPEs+1 PPE)
SPEs: SIMD cores designed for numerical computing
256KB “local store” per SPE (scratchpad-like)
Programmer-driven DMA
204 Gflop/s peak
2
Cell BE Chip
Main Mem
EIBSPELS
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
SPE LS
How do we harness the Cell’s impressive peak performance?
Carnegie Mellon
DFT on the Cell BE
3
Numerical Recipes
FFTW
FFTC
Spiral generated(this paper)
350x
Platform-tuned code is 350x faster. But hard to write!
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance Results
Concluding Remarks
4
Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
Carnegie Mellon
Stage 1Stage 2Stage 3Stage 4Iterative Algorithm (programming ease)
Stage 5
Stage 1
Recursive algorithm (memory hierarchy)
Stage 2
Stage 3
Stage 4
Core 0
Core 1
Parallel execution (multicore)
“Fitting” Dataflow to Hardware
To “fit” DFT to architecture: Various traversals Various factorizations
5How to map dataflow to architecture automatically?
Carnegie Mellon
1234
5
1
2
3
4
Core 0
Core 1
“Fitting” Dataflow to Platform (contd.)
6Intuition: rewrite formulas to obtain suitable dataflow
Carnegie Mellon
Program Generation in Spiral
7
Transformuser specified
C Code
Fast algorithmin SPLmany choices
∑-SPL
Iteration of this process to search for the fastest
But that’s not all …
parallelizationvectorization
loop optimizations
constant foldingscheduling……
Optimization at allabstraction levels
Carnegie Mellon
Common Abstraction: SPL
8
SPL: Tensor-product representationEg.: Cooley-Tukey fast Fourier transform (FFT):
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
j j
j j j
Tensor products in SPL represent loop structures
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance Results
Concluding Remarks
9
Carnegie Mellon
Mapping DFTs to the Cell
10
Objective: High-performance transform library for Cell BE
Cell BE Chip
Main Mem
EIBSPELS
SPELS
SPELS
SPELS
SPE LS
SPE LS
SPE LS
SPE LSDFT
Tags guide formula rewriting
Vectorize DFT for vector length
Vectorization
Parallelize DFT across p SPEs, and use a DMA packet size of
Parallelization
Optimize DFT for throughput (s DFTs required)
Multibuffering
Cell’s architectural paradigms:
Carnegie Mellon
SPL to Parallel Code
11
Natural parallel construct in SPL:AAAA
x y
Processor 0
Processor 1
Processor 2
Processor 3
Parallelizing other constructs in SPL:
Permutations require message exchange (on-chip DMA comm.)x y
Independent, load-balanced, communication-free operation
Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA
Carnegie Mellon
SPL to Streaming Code
12
Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory)
Idea: tensor loops become multi-buffered loops
Useful for: Throughput-optimized code Large, out-of-chip sizes
AAA
x y
Write Ai-1
Compute Ai
Read Ai+1
i'th iteration
(Trickier for other SPL constructs)
Idea: rewrite algorithm at SPL level to achieve largest DMA packets
Carnegie Mellon
SIMD kernel optimized for memory hierarchy
Generating Cell Code
13
Transformuser specified
Fast algorithmin SPLtag guided
Loop operations in ∑-SPL
Streamed from memory for throughput
All-to-all communication (on-chip)
Load balanced across p SPEs
Cell-specific optimized C code (intrinsics, DMA etc.)
Rewriting
Carnegie Mellon
DMA
parallelized
vectorized
/* Complex-to-complex DFT size 64 on 2 SPEs */
dft_c2c_64(float *X, float *Y, int spuid){ // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs
// Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier();}
Generated Code Sample
14DFT 216: 4,000+ lines of code!
Carnegie Mellon
Problem Space: Options
15
SPE
DFT
SPE
DFT
SPE
DFT
SPE
DFT
SPE
SPE SPE
SPE
DFT
SPE
SPE SPE
SPE
DFT
SPE
SPE SPE
SPE
DFTSPE
DFTDFT
SPE
DFT
Main Memory Operations
Parallelization
Latency optimized(default)
Throughput,multibuffered
SPE
DFT
Base (Vectorized)
Vectorization assumed
(Only for small DFTs)
Single DFT parallelized across multiple SPEs
Multiple independent DFTs on multiple SPEs
Multiple parallelized independent DFTs
Carnegie Mellon
Problem Space: Combinations
16
Throughput-optimized usage scenariosLatency-optimized usage scenarios
SPE
SPE SPE
SPE
DFT
SPE
DFTDFT
SPE
DFTDFT
SPE
DFTDFT
SPE
DFTDFT
SPE
SPE SPE
SPE
DFTDFT
Parallel, multibuffered DFT
Single DFT from main memory
Independent DFTs multibuffered in parallel
Devise rewrite rules for tags. Nestings describe all scenarios
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance Results
Concluding Remarks
17
Carnegie Mellon
SPE
SPE SPE
SPE
DFT
4-SPEs
8-SPEs
2-SPEs
1-SPE
18
Carnegie Mellon
SPE
SPE SPE
SPE
DFT
Spiral: 1-SPE
Spiral: 8-SPEs
FFTC
FFTW
4.5x faster than FFTW, 1.63x faster than FFTC19
Carnegie Mellon
More Performance Results
20
Single-SPE DFT code
Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i)
IBM SDK
Spiral
Mercury
Chow
Carnegie Mellon
Other Linear Transforms
21
Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE)
2-D DFTs Out-of-core sizes
Limited to 2D DFTs on 1-SPE (for now)
More performance results:Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009
Carnegie Mellon
Overview
Background, Spiral Overview
Generating DFTs for the Cell
Performance Results
Concluding Remarks
22
Carnegie Mellon
Conclusion
Automatic generation of transform libraries High performance Variety of scenarios, formats
High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs
Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture
23
architecturespace
algorithmspace