The Fastest Fourier Transformmb/Teaching/Week13/fftw-slides2.pdf · 2013-04-15 · The Fastest...
Transcript of The Fastest Fourier Transformmb/Teaching/Week13/fftw-slides2.pdf · 2013-04-15 · The Fastest...
FFTW:An Adaptive Software
Architecture for the FFT
Steven G. JohnsonCondensed Matter TheoryMIT Physics Department
Matteo FrigoSupercomputing Technologies GroupMIT Laboratory for Computer Science
The Fastest Fourier Transformin the West
Steven G. JohnsonCondensed Matter TheoryMIT Physics Department
Matteo FrigoSupercomputing Technologies GroupMIT Laboratory for Computer Science
FFTW
A C library for computing the FFTin one or more dimensionsincludes parallel transforms
Fastcompetitive with vendor-tuned librariesFFTW’s performance is portable
Here today (since 3/97)...with thousands of happy users
FFTW is:
http://theory.lcs.mit.edu/~fftw
FFTW
Always the fastest codebut is the fastest more often than any otherprogram
A new FFT algorithm per seit is a new way of implementing knownalgorithms
FFTW is Not:
B
B
B
B
B
B
B
B
BB B
B
B
B
BB
B B B B B B
J
J
J
J
J
J
J
J
J
J
JJ
J
J
J JJ J J J
J J
H
H
H
H
H
HH
HH
H
H
HH
H
H
HH H H H
H H
F
F
F
F
F
F
F
F
F
FF
F
F
F
F F F F F FF F
MM
M
M
M
M M
M
M
M
MM
MM
MM
M M M M M M
77 7 7 7
7 77 7
77
7 77 7 7 7 7 7
2 4 8
16 32 64
128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
2097
152
4194
304
0
50
100
150
200
250
Spe
ed in
“M
FLO
PS
” =
5 N
lg N
/ µs
Transform Size
B FFTW
J SUNPERF
H Ooura
F FFTPACK
Green
Sorensen
Singleton
Krukar
M Temperton
Bailey
NR
7 QFT
FFT Benchmark on a 167MHz UltraSPARC-I (xolas)
B
B
B
B
B
B
B
B
B
BB
B B
B
BB
B B B B BJ
J
J J
J
J
JJ
J
J
J JJ
J
J
JJ J J J J
H
H
H
H
HH
H
H
HH
HH
H
H H H H
F
F
F
F
FF
F
F F F
F FF F F
F F FF F
M
M
M
M
M
MM
M
M
M M M
M
M
M MM M M M M
2 4 8
16 32 64
128
256
512
1024
2048
4096
8192
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
2097
152
0
50
100
150
200
250
Spe
ed in
“M
FLO
PS
” =
5 N
lg N
/ µs
Transform Size
B FFTW
J Ooura
H Green
F Bergland
GSL
Krukar
Mayer (simple)
Singleton (f2c)
M FFTPACK (f2c)
Temperton (f2c)
FFT Benchmark on a 300MHz Pentium II
FFTW
Over 40 FFT codes
19 different machines
Both 1D and 3D transforms
Transform sizes not a power of 2
We Also Have Results From...
FFTW
The Runtime Planneroptimizes FFTW for your CPU, your cachesize, etcetera
The Codeletscomposable blocks of optimized codecomputer generated
The Executorinterprets the plan to compose the codeletsincludes a few tricks of its own
But first, a review of the FFT...
Why is FFTW So Fast?
FFTW
Computes a DFT of size N = N1 * N2first, does N1 transforms of size N2then, multiplies by “twiddle factors”finally, does N2 transforms of size N1
Performance is O(N lg N)
Base cases of recursion are optimizedsmall transforms
The Cooley-Tukey FFT Algorithm
FFTW
For a given N, there are manyfactorizations
not clear a priori which is best
The planner tries them “all” and picksthe best one
uses actual runtime timing measurementsresult is encoded in a “plan”
Uses dynamic programming to reducenumber of possible plans
remembers optimal sub-plans for small sizes
The Planner
FFTW
Example Plans
N=32768 on Alpha and Pentium II
Alpha:RADIX 16: 32768 -> 16*2048RADIX 8: 2048 -> 8*256RADIX 8: 256 -> 8*32SOLVE 32
Pentium II:RADIX 64: 32768 -> 64*512RADIX 16: 512 -> 16*32SOLVE 32
FFTW
Generates highly-optimized transformsof small sizes (“codelets”)
with and without twiddle factorsform the base cases of the FFT recursion
Written in the Caml-Light dialect of ML
Manipulates abstract syntax tree whichis unparsed to C
knows about complex arithmetic, etcetera
The Codelet Generator
FFTW
Long, unrolled code takes advantage of:optimizing compilers
instruction scheduling, etceteralarge register sets
Applies tedious optimizations
Easy to experiment with differentalgorithms
prime factor, split-radix, Raderexpress the algorithm once, abstractly
various optimization hacks
You only have to get it right once
Advantages of Generating Codelets
FFTW
Here is a fragment that helps simplifymultiplications:
The Expression Simplifier
simplify_times = fun (Real a) (Real b) -> (Real (a *. b)) | a (Real b) -> simplify_times (Real b) a | (Uminus a) b -> simplify (Uminus (Times (a,b))) | (Real a) (Times ((Real b), c)) -> simplify (Times ((Real (a *. b)), c)) | (Real a) b -> if (almost_equal a 0.0) then (Real 0.0 else if (almost_equal a 1.0) then b else if (almost_equal a (-1.0))then simplify (Uminus b) else Times ((Real a), b) | ...
FFTW
Quiz: which of the following is faster?
Tricky Optimizing Rules
a = 0.5 * b;c = 0.5 * d;e = 1.0 + a;f = 1.0 - c;
a = 0.5 * b;c = -0.5 * d;e = 1.0 + a;f = 1.0 + c;
Answer: the fragment on the left.
The number of floating-point constantsshould be minimized.
FFTW
Executes the plan bycomposing codelets
Explicit recursiondivide-and-conquer uses all levelsof the memory hierarchy
Novel storage for the twiddlefactors
store them in the order they areused
The Executordoes not fit in cache
fits in cache
FFTW
Complexity is abstracted from the user:
FFTW is Easy to Use
COMPLEX A[n], B[n];fftw_plan plan;
/* create the plan: */plan = fftw_create_plan(n);
/* use the plan: */fftw(plan,A);
/* re-use the plan: */fftw(plan,B);
FFTW
All this for an FFT?
Modern architectures are invalidatingconventional wisdom about what is fast
no new wisdom is emerging
In the name of performance, designershave sacrificed:
predictabilityrepeatabilitycomposability
Hand-optimization of programs isbecoming impractical
Parting Thought: This is Ridiculous!