The Fastest Fourier Transformmb/Teaching/Week13/fftw-slides2.pdf · 2013-04-15 · The Fastest...

FFTW:An Adaptive Software

Architecture for the FFT

Steven G. JohnsonCondensed Matter TheoryMIT Physics Department

Matteo FrigoSupercomputing Technologies GroupMIT Laboratory for Computer Science

The Fastest Fourier Transformin the West

Steven G. JohnsonCondensed Matter TheoryMIT Physics Department

Matteo FrigoSupercomputing Technologies GroupMIT Laboratory for Computer Science

FFTW

A C library for computing the FFTin one or more dimensionsincludes parallel transforms

Fastcompetitive with vendor-tuned librariesFFTW’s performance is portable

Here today (since 3/97)...with thousands of happy users

FFTW is:

http://theory.lcs.mit.edu/~fftw

FFTW

Always the fastest codebut is the fastest more often than any otherprogram

A new FFT algorithm per seit is a new way of implementing knownalgorithms

FFTW is Not:

B

B

B

B

B

B

B

B

BB B

B

B

B

BB

B B B B B B

J

J

J

J

J

J

J

J

J

J

JJ

J

J

J JJ J J J

J J

H

H

H

H

H

HH

HH

H

H

HH

H

H

HH H H H

H H

F

F

F

F

F

F

F

F

F

FF

F

F

F

F F F F F FF F

MM

M

M

M

M M

M

M

M

MM

MM

MM

M M M M M M

77 7 7 7

7 77 7

77

7 77 7 7 7 7 7

2 4 8

16 32 64

128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

4194

304

0

50

100

150

200

250

Spe

ed in

“M

FLO

PS

” =

5 N

lg N

/ µs

Transform Size

B FFTW

J SUNPERF

H Ooura

F FFTPACK

Green

Sorensen

Singleton

Krukar

M Temperton

Bailey

NR

7 QFT

FFT Benchmark on a 167MHz UltraSPARC-I (xolas)

B

B

B

B

B

B

B

B

B

BB

B B

B

BB

B B B B BJ

J

J J

J

J

JJ

J

J

J JJ

J

J

JJ J J J J

H

H

H

H

HH

H

H

HH

HH

H

H H H H

F

F

F

F

FF

F

F F F

F FF F F

F F FF F

M

M

M

M

M

MM

M

M

M M M

M

M

M MM M M M M

2 4 8

16 32 64

128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

2097

152

0

50

100

150

200

250

Spe

ed in

“M

FLO

PS

” =

5 N

lg N

/ µs

Transform Size

B FFTW

J Ooura

H Green

F Bergland

GSL

Krukar

Mayer (simple)

Singleton (f2c)

M FFTPACK (f2c)

Temperton (f2c)

FFT Benchmark on a 300MHz Pentium II

FFTW

Over 40 FFT codes

19 different machines

Both 1D and 3D transforms

Transform sizes not a power of 2

We Also Have Results From...

FFTW

The Runtime Planneroptimizes FFTW for your CPU, your cachesize, etcetera

The Codeletscomposable blocks of optimized codecomputer generated

The Executorinterprets the plan to compose the codeletsincludes a few tricks of its own

But first, a review of the FFT...

Why is FFTW So Fast?

FFTW

Computes a DFT of size N = N1 * N2first, does N1 transforms of size N2then, multiplies by “twiddle factors”finally, does N2 transforms of size N1

Performance is O(N lg N)

Base cases of recursion are optimizedsmall transforms

The Cooley-Tukey FFT Algorithm

FFTW

For a given N, there are manyfactorizations

not clear a priori which is best

The planner tries them “all” and picksthe best one

uses actual runtime timing measurementsresult is encoded in a “plan”

Uses dynamic programming to reducenumber of possible plans

remembers optimal sub-plans for small sizes

The Planner

FFTW

Example Plans

N=32768 on Alpha and Pentium II

Alpha:RADIX 16: 32768 -> 16*2048RADIX 8: 2048 -> 8*256RADIX 8: 256 -> 8*32SOLVE 32

Pentium II:RADIX 64: 32768 -> 64*512RADIX 16: 512 -> 16*32SOLVE 32

FFTW

Generates highly-optimized transformsof small sizes (“codelets”)

with and without twiddle factorsform the base cases of the FFT recursion

Written in the Caml-Light dialect of ML

Manipulates abstract syntax tree whichis unparsed to C

knows about complex arithmetic, etcetera

The Codelet Generator

FFTW

Long, unrolled code takes advantage of:optimizing compilers

instruction scheduling, etceteralarge register sets

Applies tedious optimizations

Easy to experiment with differentalgorithms

prime factor, split-radix, Raderexpress the algorithm once, abstractly

various optimization hacks

You only have to get it right once

Advantages of Generating Codelets

FFTW

Here is a fragment that helps simplifymultiplications:

The Expression Simplifier

simplify_times = fun (Real a) (Real b) -> (Real (a *. b)) | a (Real b) -> simplify_times (Real b) a | (Uminus a) b -> simplify (Uminus (Times (a,b))) | (Real a) (Times ((Real b), c)) -> simplify (Times ((Real (a *. b)), c)) | (Real a) b -> if (almost_equal a 0.0) then (Real 0.0 else if (almost_equal a 1.0) then b else if (almost_equal a (-1.0))then simplify (Uminus b) else Times ((Real a), b) | ...

FFTW

Quiz: which of the following is faster?

Tricky Optimizing Rules

a = 0.5 * b;c = 0.5 * d;e = 1.0 + a;f = 1.0 - c;

a = 0.5 * b;c = -0.5 * d;e = 1.0 + a;f = 1.0 + c;

Answer: the fragment on the left.

The number of floating-point constantsshould be minimized.

FFTW

Executes the plan bycomposing codelets

Explicit recursiondivide-and-conquer uses all levelsof the memory hierarchy

Novel storage for the twiddlefactors

store them in the order they areused

The Executordoes not fit in cache

fits in cache

FFTW

Complexity is abstracted from the user:

FFTW is Easy to Use

COMPLEX A[n], B[n];fftw_plan plan;

/* create the plan: */plan = fftw_create_plan(n);

/* use the plan: */fftw(plan,A);

/* re-use the plan: */fftw(plan,B);

FFTW

All this for an FFT?

Modern architectures are invalidatingconventional wisdom about what is fast

no new wisdom is emerging

In the name of performance, designershave sacrificed:

predictabilityrepeatabilitycomposability

Hand-optimization of programs isbecoming impractical

Parting Thought: This is Ridiculous!

The Fastest Fourier Transformmb/Teaching/Week13/fftw-slides2.pdf · 2013-04-15 · The Fastest...

Documents

Transcript of The Fastest Fourier Transformmb/Teaching/Week13/fftw-slides2.pdf · 2013-04-15 · The Fastest...