Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4...
Transcript of Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4...
![Page 1: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/1.jpg)
Carnegie Mellon
Parallelism in Spiral
This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel
Franz Franchetti
Electrical and Computer EngineeringCarnegie Mellon University
Joint work withYevgen VoronenkoMarkus Püschel
… and the Spiral team (only part shown)
![Page 2: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/2.jpg)
Carnegie Mellon
The Problem
0
2
4
6
8
10
12
14
16
18
20
22
24
26
16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
Problem size
Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHzPerformance [Gflop/s]
What’s going on?
30xbest code
Numerical Recipes
![Page 3: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/3.jpg)
Carnegie Mellon
Automatic Performance TuningCurrent vicious circle: Whenever a new platform comes out, the same functionality needs to be rewritten and reoptimized
Automatic Performance TuningBLAS: ATLAS Linear algebra: Bebop, Spike, FlameSorting Fourier transform: FFTW Linear transforms: Spiral…othersNew compiler techniques
Proceedings of the IEEE special issue, Feb. 2005But what about parallelism … ?
![Page 4: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/4.jpg)
Carnegie Mellon
Vision Behind Spiral
Numerical problem
Computing platform
algorithm selection
compilation
hum
an e
ffort
auto
mat
ed
implementationC program
auto
mat
edalgorithm selection
compilation
implementation
Numerical problem
Computing platform
Current Future
C code a singularity: Compiler hasno access to high level information
Challenge: conquer the high abstraction level for complete automation
![Page 5: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/5.jpg)
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
![Page 6: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/6.jpg)
Carnegie Mellon
SpiralLibrary generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …
Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU
Research Goal: “Teach” computers to write fast librariesComplete automation of implementation and optimizationConquer the “high” algorithm level for automation
When a new platform comes out: Regenerate a retuned library
When a new platform paradigm comes out (e.g., vector or CMPs):Update the tool rather than rewriting the library
Intel has started to use Spiral to generate parts of their MKL library
![Page 7: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/7.jpg)
Carnegie Mellon
How Spiral Works
Algorithm GenerationAlgorithm Optimization
ImplementationCode Optimization
CompilationCompiler Optimizations
algorithm
C code
performance
Problem specification (transform)
Fast executable
Sear
ch
controls
controls
Spiral
Spiral: Complete automation of the implementation and optimization task
Basic idea:Declarative representation of algorithms
Rewriting systems to generate and optimize algorithms
![Page 8: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/8.jpg)
Carnegie Mellon
What is a (Linear) Transform?Mathematically: Matrix-vector multiplication
Example: Discrete Fourier transform (DFT)
input vector (signal)output vector (signal) transform = matrix
![Page 9: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/9.jpg)
Carnegie Mellon
Transform Algorithms: Example 4-point FFTCooley/Tukey fast Fourier transform (FFT):
Algorithms reduce arithmetic cost O(n2) → O(nlog(n))Product of structured sparse matricesMathematical notation exhibits structure: SPL (signal processing language)
1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1
j j
j j j
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⋅ ⋅ − ⋅ ⋅ ⋅ ⋅ ⋅ − ⋅ ⋅ ⋅⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
![Page 10: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/10.jpg)
Carnegie Mellon
Examples: Transforms
Spiral currently contains 55 transforms
![Page 11: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/11.jpg)
Carnegie Mellon
Examples: Breakdown Rules (currently ≈220)
Base case rules
“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)
Includes many new ones (algebraic theory, Pueschel, Moura, Voronenko)
![Page 12: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/12.jpg)
Carnegie Mellon
SPL to Sequential Code
Example: tensor product
![Page 13: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/13.jpg)
Carnegie Mellon
Program Generation in Spiral (Sketched)Transformuser specified
C Code:
Fast algorithmin SPLmany choices
∑-SPL:
Iteration of this process to search for the fastest
But that’s not all …
parallelizationvectorization
loop optimizations
constant foldingscheduling……
Optimization at allabstraction levels
![Page 14: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/14.jpg)
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
![Page 15: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/15.jpg)
Carnegie Mellon
SPL to Shared Memory Code: Basic IdeaGoverning construct: tensor product
Independent operation, load-balanced
AAAA
x y
Processor 0Processor 1Processor 2Processor 3
Problematic construct: permutations produce false sharing
Task: Rewrite formulas to extract tensor product + keep contiguous blocks
x y
[SC 06]
![Page 16: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/16.jpg)
Carnegie Mellon
Step 1: Shared Memory TagsIdentify crucial hardware parameters
Number of processors: pCache line size: μ
Introduce them as tags in SPL
This means: formula A is to be optimized for p processors and cache line size μ
![Page 17: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/17.jpg)
Carnegie Mellon
Step 2: Identify “Good” FormulasLoad balanced, avoiding false sharing
Tagged operators (no further rewriting necessary)
Definition: A formula is fully optimized if it is one of the above or of the form
where A and B are fully optimized.
![Page 18: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/18.jpg)
Carnegie Mellon
Step 3: Identify Rewriting RulesGoal: Transform formulas into fully optimized formulas
Formulas rewritten, tags propagatedThere may be choices
![Page 19: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/19.jpg)
Carnegie Mellon
Simple Rewriting Example
fully optimized
Loop splitting + loop exchange
![Page 20: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/20.jpg)
Carnegie Mellon
Parallelization by Rewriting
Fully optimized (load-balanced, no false sharing) in the sense of our definition
![Page 21: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/21.jpg)
Carnegie Mellon
Same Approach for Other Parallel ParadigmsVectorization: [IPDPS 02, VecPar 06]Message Passing: [ISPA 06]
Cg/OpenGL for GPUs: Verilog for FPGAs: [DAC 05]
MPI
With Bonelli, Lorenz, Ueberhuber, TU Vienna
With Shen, TU Denmark With Milder, Hoe, CMU
![Page 22: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/22.jpg)
Carnegie Mellon
Going Beyond TransformsTransform = linear operator with one vector input and one vector output
Key ideas: Generalize to (possibly nonlinear) operators with several inputs and severaloutputsGeneralize SPL (including tensor product) to OL (operator language)
Cooley-Tukey FFT in OL:
Viterbi in OL:
Mat-Mat-Mult:
![Page 23: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/23.jpg)
Carnegie Mellon
OL Rewriting RulesSPL rules reusedOnly few OL-specific rules required
New OL rules
![Page 24: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/24.jpg)
Carnegie Mellon
Example: Viterbi Decoder in OLViterbi decoder (forward part) as operator
Viterbi kernel (butterfly)
Viterbi algorithm as breakdown rule
First non-transform supported by Spiral
![Page 25: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/25.jpg)
Carnegie Mellon
Viterbi: Vectorization Through Rewriting
Sufficient to vectorize one inputVectorized kernelIn-register shuffle operation
![Page 26: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/26.jpg)
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
![Page 27: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/27.jpg)
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad coreGPU FPGA
DFT
All Spiral code shownis “push-button” generatedfrom scratch
“click”
FPGA+CPU
![Page 28: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/28.jpg)
Carnegie Mellon
DFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]
0
5
10
15
20
25
30
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
input size
Spiral 5.0 SPMDSpiral 5.0 sequentialIntel IPP 5.0FFTW 3.2 alpha SMPFFTW 3.2 alpha sequential
Benchmark: Vector and SMP
2 processors
4 processors
2 processors 4 processors
Memory footprint < L1$ of 1 processor
25 Gflop/s!
4-way vectorized + up to 4-threaded + adapted to the memory hierarchy
![Page 29: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/29.jpg)
Carnegie Mellon
Benchmark: Cell (1 processor = SPE)DFT (single precision) on 3.2 GHz Cell BE (Single SPE)performance [Gflop/s]
0
2
4
6
8
10
12
14
16
16 32 48 64 80 96 128 160 256 384 512 1024input size
Generated using the simulator; run at Mercury (thanks to Robert Cooper)
Joint work with Th. Peter (ETH Zurich), S. Chellappa, M. Telgarsky, J. Moura (CMU)
![Page 30: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/30.jpg)
Carnegie Mellon
DCT4, Multiples of 32: 4-way VectorizedDCT (single precision) 2.66 GHz Core2 (4-way 32-bit SSE)performance [Gflop/s]
0
1
2
3
4
5
6
7
8
9
10
32 64 96 128 160 192 224 256
input size
Spiral DCT4FFTW 3.1.2 DCT4 (k11)IPP 5.1 DCT2 (?)
novel algorithm (algebraic algorithm theory)
![Page 31: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/31.jpg)
Carnegie Mellon
DFT, 8-way Vectorized: All Sizes Up To 80
0
2
4
6
8
10
12
14
16
18
2 7 12 17 22 27 32 37 42 47 52 57 62 67 72 77input size
Spiral SSE2
Intel IPP 5.1
DFT (16-bit integer) on 2.66 GHz Core2 Duo (8-way SSE2)performance [Gflop/s]
first 8-way DFTs for all sizesarbitrary vector length /arbitrary DFT sizes in principle solved
![Page 32: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/32.jpg)
Carnegie Mellon
Benchmark: GPU
0
1
2
3
4
5
6
8 10 12 14 16 18 20 22 24log2(input size)
WHT (single precision) on 3.6 GHz Pentium 4 with Nvidia 7900 GTXperformance [Gflops/s]
Spiral CPU Spiral GPU
Joint work with H. Shen, TU Denmark
CPU + GPU
![Page 33: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/33.jpg)
Carnegie Mellon
Benchmark: FPGADFT 256 on Xilinx Virtex 2 Pro FPGAinverse throughput (gap) [us]
0
1
2
3
4
5
6
7
8
0 5000 10000 15000 20000 25000Area [slices]
Xilinx Logicore 3.2
Spiral
better
Joint work with P. Milder, J. Hoe (CMU)
competitive with professional designsmuch larger set of performance/area trade-offs
(Pareto-optimal designs)
![Page 34: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/34.jpg)
Carnegie Mellon
Benchmark: Hardware Accelerator (FPGA)Xilinx Virtex 2 Pro FPGA: 1M gates @ 100 MHz + 2 PowerPC 405 @ 300 MHz
Fixed set of accelerators speed up a whole library
better
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4096 8192
Problem size
[Mflop/s]
Software only
Software + hardware
Joint work with P. D’Alberto (Yahoo), A. Sandryhaila, P. Milder, J. Hoe, J . M. F. Moura (CMU), J. Johnson (Drexel)
6.5x
native sizes
![Page 35: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/35.jpg)
Carnegie Mellon
Benchmarks
platforms
kernels
vector dual/quad coreGPU FPGA
DFT
filter
GEMM
Viterbi
All Spiral code shownis “push-button” generatedfrom scratch
“click”
FPGA+CPU
![Page 36: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/36.jpg)
Carnegie Mellon
Benchmark: Finite Impulse Response Filter
0
5
10
15
20
25
30
35
40
45
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20log2(input size)
theoretical peak performance: 74.66 Gflop/s Spiral 8 taps
Spiral 32 taps
IPP 32 taps
IPP 8 taps
FIR filter (double precision) on 2.33 GHz 2x Core 2 Quad (8 threads)performance [Gflop/s]
![Page 37: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/37.jpg)
Carnegie Mellon
Beyond Transforms : Viterbi DecodingViterbi decoding (8-bit) on 2.66 GHz Core 2 Duoperformance [Gbutterflies/s]
0
0.5
1
1.5
2
2.5
6 7 8 9 10 11 12 13log2(size of state machine)
Spiral 16-way vectorized
Spiral scalar
Karn's Viterbi decoder(hand-tuned assembly)
1 butterfly = ~22 ops
Vectorized using practically the same rules as for DFT
Joint work with S. Chellappa, CMU Karn: http://www.ka9q.net/code/fec/
![Page 38: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/38.jpg)
Carnegie Mellon
First Results: Matrix-Matrix-Multiply
DGEMM on 3 GHz Core 2 Duo (1 thread)performance [Gflop/s]
0
2
4
6
8
10
12
2 4 8 16 32 64 128 256 512
N (matrix size is N x N)
Goto
Spiral
triple loop
work with F. de Mesmay, CMU
![Page 39: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/39.jpg)
Carnegie Mellon
Organization
Spiral overview
Parallelization in Spiral
Results
Concluding remarks
![Page 40: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/40.jpg)
Carnegie Mellon
ConclusionsAutomatic generation of very fast and fastest numerical kernels is possible and desirable
High level language and approachAlgorithm generationAlgorithm optimization
Same approach for loop optimization, different forms of parallelism, SW and HW implementations
![Page 41: Parallelism in Spiral - Rice Universitycscads.rice.edu/Franchetti.pdfSpiral DCT4 FFTW 3.1.2 DCT4 (k11) IPP 5.1 DCT2 (?) novel algorithm (algebraic algorithm theory) Carnegie Mellon](https://reader035.fdocuments.in/reader035/viewer/2022071410/6104acf14600c01f96141e3e/html5/thumbnails/41.jpg)
Carnegie Mellon
Spiral Web Interface @spiral.net (beta version)
1. Select platform
2. Select functionality
3.“click”