SPIRAL: Tuning DSP Transforms to
Transcript of SPIRAL: Tuning DSP Transforms to
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRAL: SPIRAL: Tuning DSP Transforms to Tuning DSP Transforms to Computing PlatformsComputing Platforms
• Jeremy Johnson (Drexel)• Robert Johnson (MathStar Inc.)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)
• Franz Franchetti (TU Vienna)• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Jianxin Xiong (UIUC)
Faculty Students
http://www.ece.cmu.edu/~spiral
José M. F. Moura
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SponsorSponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Moore’s Moore’s Law and Law and High(estHigh(est)) Performance Performance Scientific ComputingScientific Computing
arithmetic cost model not accurate for predicting runtime (one cache miss = 10 floating point ops)better performance models hard to getbest code is machine dependent (registers/caches size, structure)hand-tuned code becomes obsolete as fast as it is writtencompiler limitationsfull performance requires (in part) assembly coding
Moore’s Law: processor-memory bottleneckshort life cycles of computersvery complex architectures
• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features
(single processor, off-the-shelf)
Effects on software/algorithms:
Portable performance requires automation
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
AutomaticAutomatic Performance Tuning: ResearchPerformance Tuning: Research
Linear Algebra:ATLAS (J. Dongarra et al.) LAPACKPhiPACK (J. Demmel et al.)
Signal Processing: FFTW (M. Frigo and S. Johnson)SPIRAL
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRALAutomates
cuts development costscode less error-prone
takes advantage of architecture specific featuresporting without loss of performance
systematic exploration of alternatives both at algorithmic and code level
are performance critical
Implementation
Platform-Adaptation
Optimization
of DSP algorithms
A library generator for highly optimized signal processing algorithms
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRAL ApproachApproach
DSP Transform(DFT, DCT, Wavelets etc.)
Computing Platform
given
given
PossibleImplementations
PerformanceEvaluation
Inte
llig
ent
Sea
rchPossible
Algorithms
SP
IRA
L S
earc
h S
pac
e
(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )
adaptedimplementation
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DSPDSP Algorithms: Example 4Algorithms: Example 4--point DFTpoint DFT
Cooley/Tukey FFT (size 4):
product of structured sparse matricesmathematical notation
−
−
−−
=
−−−−−−
1000
0010
0100
0001
1100
1100
0011
0011
000
0100
0010
0001
1010
0101
1010
0101
11
1111
11
1111
iii
ii
4222
42224 )()( LDFTITIDFTDFT ⋅⊗⋅⋅⊗=
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DSPDSP Algorithms: TerminologyAlgorithms: Terminology
Transform
Rule
Formula
( ) ( ) PDFTIDIDFTDFT mnmnnm ⋅⊗⋅⋅⊗→
( ) ( )( ) PFIIDIFDFT ⋅⊗⊗⋅⋅⊗= 222428
parameterized matrix
• a breakdown strategy • product of sparse matrices
• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation
nDFT
Ruletree 8DFT
2DFT 4DFT
2DFT 2DFT
• few constructs and primitives• uniquely defines an algorithm• can be translated into code
DFT is symmetric ⇒ transpose the rule:
CT rule transposed
Commuting tensor product factorsA and B square size m and n
Commutation property ⇒ further variations
Different patterns for access, storage and flow of data
( )mn mn
n mB A L A B L⊗ = ⊗
( ) ( )( ) ( )
RS RS RS RS
N S S R R S R S R
RS RS
N R S S S S R
F L I F L T I F L
F F I T L F I
= ⊗ ⊗
= ⊗ ⊗
( ) ( )RS RS
RS S R S S R SF L I F T F I= ⊗ ⊗
( ) ( )0 0
1 12 2 2 2
2 2
3 3
1 0 1 0 1 1 0 00 1 0 1 1 1 0 01 0 1 0 0 0 1 10 1 0 1 0 0 1 1
x xx xxF I I Fx xx x
−−
− −
= =⊗ ⊗
MoreMore CooleyCooley--Tukey RulesTukey Rules
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DSPDSP TransformsTransforms
( )[ ]nkliDFT n /2exp π=
222 DFTDFTWHT k ⊗⊗=
( )( )[ ]nlkDCT nII /2/1cos)( π+=
( )( )[ ]nlkDCT nIV /)2/1(2/1cos)( π++=
( )[ ]nklDST nI /sin)( π=
( )[ ]nlnkMDCT nn /)2/1)(2/)1((cos2 π+++=×
Others: filtering, Haar, Hartley, …
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sineTransforms (16 types)
modified discrete cosine transform
TT ⊗two-dimensional transform
( ) ( )H
Wl
n
n 22
2
2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=discrete wavelet transform
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
RulesRules = Breakdown Strategies= Breakdown Strategies
( ) ( )Qn
IVn
IIn
IIn FIDCTDCTPDCT 22/
)(2/
)(2/
)( ⊗⋅⊕⋅→
DDCTSDCT IIn
IVn ⋅⋅→ )()(
( ) 2)(
2 2/1,1 FdiagDCT II ⋅→
rIV
n MMDCT 1)( →
1 1 12 2 2 21
( )n n n n n ni i i t
n
i
WHT I WHT I+ + + +− +
=→ ⊗ ⊗∏ … …
nnn SinDFTjCosDFTDFT ⋅+→
)(4/2/
IInnn DCTCosDFTCosDFT →
)(4/2/
IInnn DCTSinDFTSinDFT →
( ) ( ) PDFTIDIDFTDFT mnmnnm ⋅⊗⋅⋅⊗→
CDSTDCTBDFT In
Inn ⋅⊕⋅→ )( )(
2/)(2/
PDCTSMDCT IVnnn ⋅⋅→×
)(2
base case
recursive
translation
iterative
recursive
recursive
recursive
recursive
recursive
translation
iterative/recursive
( ) ( )H
Wl
n
n 22
2
2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Partition TreesPartition Trees
4
1 1 1 14
1 3
1 11
4
1 3
1 2
11iterative
mixed
recursive
Each WHT algorithm can be represented by a tree,
where a node labeled by n corresponds to nWHT2
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
AlgorithmsAlgorithms = Ruletrees = Formulas= Ruletrees = Formulas)(
8IIDCT
)(4
IIDCT)(
4IVDCT
R1)()( 2/2
)(2/
)(2/
)(n
IVn
IIn
IIn IFDCTDCTPDCT ⊗⋅⊕⋅→
2FR3 R6
2FR4
R3
R1
R6
2F
2FR4
2)(
22
1FDCT II ⋅→
)(2
IIDCT
)(2
IIDST
)(2
IVDCT
)(2
IIDST
)(2
IIDCT
R1R6 SDCTPDCT II
nIV
n ⋅⋅→ )()(
)(4
IIDCT)(2
IVDCT
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Mathematical Mathematical Framework for Filters Framework for Filters and Discreteand Discrete--Time Wavelet TransformsTime Wavelet Transforms
Overlapped direct sum of matrices A (m x n) and B (r x s)
Overlapped tensor product
column overlap
row overlap
A
Bv=⊕ BA v
( ) ( )Svn IIBA ⊕⊕=
AB
v
=⊕ BA v ( )( )BAII ⊕⊕= rv
m
foldk
vvvvk
−
⊕⊕⊕=⊗ BBBBI
foldk
vvvvk
−
⊕⊕⊕=⊗ BBBBI
column overlap
row overlap
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
PropertiesProperties of Overlapped Direct Sums of Overlapped Direct Sums and Tensor Productsand Tensor Products
( ) ( ) x x x 1 1
I B B I I B I Ikk
k v m n m n v n k m n k v nl l= =
⊗ = ⊕ ⊕ = ⊗ ⊗ i i
Column and row overlap direct sum
Column and row overlapped tensor product
AB
v
=⊕ BA v
T
=AT
BTv ( )A B
TT Tv= ⊕
( )Tvn s n v s⊕ = ⊕I I I I
( ) ( )
( ) ( )
x x x 1 11
x
I B I B I I B
I I I B
T Tk k kT Tvk m n v m m n k v m m n
l ll
vk m k m n
= ==
⊗ = ⊕ ⊕ = ⊗ ⊕
= ⊗ ⊗
i i
i
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
FiltersFiltersLinear convolution in matrix form
Block convolutionsOverlap-save rule
Overlap-add rule
Circular convolution rule for circulant K
( ) [ ]T110
1I −−⊗= l
lnn hhhhC
( ) ( ) ( )( )hChC bbnlbl
n/bn ⊗⊗= −+−
/11 III
( ) ( )( )( ) 1,111T
/ III −−−+−⊗⊗= lllblm/bbbmn EhChC m=n+l-1
( ) ( ) nnn hhK DFTDFTdiagDFT 1 ⋅⋅⋅= −h-first column of K
n-length of h
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DiscreteDiscrete--TimeTime Wavelet TransformWavelet Transform
DTWT scaling and wavelet expansion coefficients
Recursive algorithm (Mallat)
( )( ) 1,...,02
2
1,
,0
−=−=
−=
−−∑
∑Jjnkhxd
nkhxc
jJjJ
nnkj
JJ
nnk
scaling coefficients
wavelet coefficients
h(-n)
h1(-n)
2
2
h(-n)
h1(-n)
2
2cJ,k
cJ-1,k
dJ-1,k
cJ-2,k
dJ-2,k
VJ-1
VJ-2
VJ WJ-1
WJ-2
1
, 2 1, , 2 1,j k m k j m j k m k j mm m
c h c d h c− + − += =∑ ∑
Haar wavelets = square waves
First stage: V2 ⇒V1⊕W1
The process is repeated for the upper half of the output
11 1
2 2[ 1, 1], [ 1, 1]h h= = −
( )4 4
2 2 2 21
21
1 1 0 0 1 1 0 0
0 0 1 1 1 1 0 0
1 1 0 0 0 0 1 1
0 0 1 1 0 0 1 1
H L L I Fc
H cd
−−
− −
= = ⊗
= =
( ) ( )1 1 1 1
2
2 2 22 2 2 22,
n
n n n nn
H
HTHT I L I F HT F− − − −= ⊕ ⊗ =
HaarHaar Wavelets Wavelets –– ExampleExample
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DiscreteDiscrete--TimeTime Wavelet TransformWavelet TransformDiscrete-Time Wavelet Transform (DTWT) rule
Scaling (lowpass) and wavelet (highpass) filter coefficients
DTWT - convolution rule
( ) ( )H
Wl
n
n 22
2
2222 1-n1-1-n1-nn ILIDTWTDTWT −⊗⊕=
[ ]( ) ( ) ( )( )([ ]( ) ( ) ( )( ) )
⊗
⋅⋅⊕⋅⊗
⊕⋅⊕⋅⊗=
nn
mmm
nmmm
heChoC
leCloCH
I1
1LI11
LI11
2T
2/T
2/2/
2T
2/T
2/2/
′′′
=−
−
110
110
l
l
hhh
hhhW
lo-lowpass odd coeffs., le-lowpass even coeffs.
ho-highpass odd coeffs., he-highpass even coeffs.
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DTWTDTWT RuletreeRuletree –– ExampleExample
DTWTn
DTWTn/2 C(lo) Cm/2(le) C(ho) C(he)
DTWT-convolution rule
DTWTn/4filters
Overlap-add rule
Cb(le)
Circulant rule
K(le0)
Circular convolution rule
1DFT−s sDFT
m=n+l-1
le0=le padded with b-1 zeroes
s=l/2+b-1
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
FormulaFormula for a DCT, size 16for a DCT, size 16
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
HelpfulHelpful ConceptConcept
DSP Transforms Formal Languages
DSP transform (of size)
Rule
Formula/Algorithm
Non-terminal symbol (with attribute)
Rule (production)
Element in Language (only terminals)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
MathematicalMathematical Framework: SummaryFramework: Summary
fast algorithms represented as ruletrees (easy generation/manipulation)and as formulas (can be translated into code)
formulas built from few constructs and primitives
many different algorithms/formulas generated from few rules (combinatorial explosion)
these algorithms are (essentially) equal in arithmetic cost, but differ in data flow
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
FormulaFormula GenerationGeneration
recursive application
Formula Generator
data base (extensible!)
data type
rules control
transforms formulasruletrees
translation
search engine formula translation
(spl compiler)export
runtime
written in GAP/AREP (computer algebra system)
all computation/manipulation is symbolic
exact arithmetic
easy extensible rule and transform data base
verification of rules and formulas
cut here for otheroptimization problems
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Formula Generator: SummaryFormula Generator: Summary
Concept:
Implementation:
• easy extensible• formulas as ruletrees (efficient representation, easy manipulation)• interface to spl• allows implementation of search methods
• efficient implementation moves bottleneck to spl compiler• fully symbolic derivation/manipulation of formulas• verification• tools to analyze structure, arithmetic cost of formulas• interfaces with search engine
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
FormulasFormulas in SPLin SPL
( compose( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) )( permutation ( 1 3 4 2 ) )( tensor
( I 2 )( F 2 )
)( permutation ( 1 4 2 3 ) )( direct_sum
( compose( F 2 )( diagonal ( 1 sqrt(1/2) ) )
)( compose
( matrix( 1 1 0 )( 0 (-1) 1 )
)( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) )( matrix( 1 0 )( 1 1 )( 0 1 )
)( permutation ( 2 1 ) )
• • • •
• • • •
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPLSPL Syntax (Subset)Syntax (Subset)matrix operations:
(compose formula formula ...)(tensor formula formula ...)(direct_sum formula formula ...)
direct matrix description:(matrix (a11 a12 ...) (a21 a22 ...) ...)(diagonal (d1 d2 ...))(permutation (p1 p2 ...))
parameterized matrices:(I n)(F n)
scalars:1.5, 2/7, cos(..), w(3), pi, 1.2e-04
definition of new symbols:(define name formula)(template formula (i-code-list)
directives for code generation#codetype real/complex#unroll on/off
allows extension of SPL
controls loop unrolling
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPLSPL Compiler, 4Compiler, 4--point FFTpoint FFT
(compose (tensor (F 2) (I 2)) (T 4 2)(tensor (I 2) (F 2)) (L 4 2))
f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4
r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6
fast algorithmas
formulaas
SPL program#codetype
complex real
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPLSPL Compiler: SummaryCompiler: Summary
Parsing
Intermediate Code Generation
Intermediate Code Restructuring
Target Code Generation
Symbol Table Abstract Syntax Tree
I-Code
I-Code
C, FORTRAN function
Template Table
SPL Formula Template DefinitionSymbol Definition
OptimizationI-Code
SPL Program
Built-in optimizations:
• single static assignment code• no reuse of temporary vars• only scalar temporary vars• constants precomputed
Extensible through templates
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
TemplatesTemplates
(template(F n)[ n >= 1 ]( do i=0,n-1
y(i)=0do j=0,n-1y(i)=y(i)+W(n,i*j)*x(j)
endend ))
Pattern
I-code
Condition
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Code Generation and Template MatchingCode Generation and Template Matching
(F 2) matches pattern (F n) and assigns 2 to n.Because n=2 satisfies the condition n>=1,the following i-code is generated from the template:
do i = 0,1y(i) = 0do j = 0,1
y(i) = y(i)+W(2,i*j)*x(j)end
end
Y(0)=x(0)+x(1)y(1)=x(0)-x(1)
Unrolling & Optimization
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SIMDSIMD Short Vector ExtensionsShort Vector Extensions
+ x
vector length = 4(4-way)
Extension to instruction set architectureAvailable on most current architectures (SSE on Pentium, AltiVec on Motorola G4)Originally for multimedia (like MMX for integers)Requires fine grain parallelismLarge potential speed-up
SIMD instructions are architecture specificNo common API (usually assembly hand coding)Performance very sensitive to memory accessAutomatic vectorization very limited
Problems:
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Vector Vector Code Generation from SPL FormulasCode Generation from SPL Formulas
Naturally vectorizable construct
A
x y
4IA ⊗vector length
iiii
k
ii QEIADP )(
1υ⊗∏
=
Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length
(Current) generic construct completely vectorizable:
Vectorization in two steps:
1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPL SPL Compiler: Vector Code GenerationCompiler: Vector Code Generation
( ) ( )( )⋅′⋅⊗⋅⊗=16444
84416 TIDFTLIDFT
( ) ( ) 16444
1644416 LDFTITIDFTDFT ⋅⊗⋅⋅⊗=
( )( )( ) ( ) ( )( )82444
8442
164
824 LIIDFTLIILLI ⊗⋅⊗⋅⊗⊗⊗
Symbolic vectorization(automatic formula manipulation)
Mapping to C code + vector API
LOAD_VECT(xl0, x + 0);LOAD_VECT(xl4, x + 16);f0 = SIMD_SUB(xl0, xl4);LOAD_VECT(xl1, x + 4);LOAD_VECT(xl5, x + 20);f1 = SIMD_SUB(xl1, xl5);...yl7 = SIMD_SUB(f1, f4);STORE_L_8_4(yl6, yl7, y + 24);yl2 = SIMD_SUB(f0, f5);yl3 = SIMD_ADD(f1, f4);STORE_L_8_4(yl2, yl3, y + 8);
SSESSE2AltiVec…
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
WhyWhy Search?Search?Toy problem
DCT IV- size 24
~31000 formulas
Search in algorithm space in SPIRAL:Exhaustive & Random Search, DP, Hill climbing, Genetic AlgorithmsBeyond search: design of optimal tree
DCT IV- size 24
~31000 formulas
scheduled
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
WhyWhy Search?Search?
DCT, type IV, size 16
• maaaany different formulas• large spread in runtimes, even for modest size• precisely equal arithmetic cost• best formula is platform-dependent
~31000 formulas
Toy problem:scheduled
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
NumberNumber of Formulas/Algorithmsof Formulas/Algorithms
k
123456789
# DFT, size 2^k
16
40296
27744162570361280~1.01 • 10^27~2.31 • 10^61
~2.86 • 10^133
# DCT-IV, size 2^k
110
12631242
19244433627343815121631354242
~1.07 • 10^38~2.30 • 10^76
~1.06 • 10^153
differ in data flow not in arithmetic cost exponential search space
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SearchSearch Methods Available in SPIRALMethods Available in SPIRAL
Exhaustive SearchDynamic Programming (DP)Random SearchHill ClimbingSTEER (similar to a genetic algorithm)
Good100s-1000sAllHill Climbing
(very) good100s-1000sAllSTEER
fair/goodUser decidedAllRandom
(very) good10s-100sAll DP
BestAllVery smallExhaust
ResultsTimedSizesFormulasPossible
Search over • algorithm space and • implementation options (degree of unrolling)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
STEERSTEERPopulation n:
Population n+1:
……
……
Mutation
Cross-Breeding expanddifferently
swapexpansions
Survival of Fittest
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
LearningLearning to Generate Fast Algorithmsto Generate Fast Algorithms
• Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Some Experimental ResultsSome Experimental Results
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
DCTDCT Type IV Size 16Type IV Size 16
Fastest Found Formulas Number of Formulas Timed
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Experimental Experimental ResultsResults
high performance code(compared with FFTW)
different transforms
search methods(applicable to all transforms)
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Some Experimental ResultsSome Experimental Results
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
GeneratedGenerated DFT Code: Pentium 4, SSEDFT Code: Pentium 4, SSE(P
seu
do
) g
flo
ps
DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 3.1
0
1
2
3
4
5
6
7
4 5 6 7 8 9 10 11 12 13
Spiral SSEIntel MKL interl.FFTW 2.1.3Spiral CSpiral C vectSIMD-FFT
hand-tuned vendor assembly code
compiler vectorization
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
GeneratedGenerated DFT Code: Pentium 4, SSE2DFT Code: Pentium 4, SSE2g
flo
ps
DFT 2n double precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 1.8
0
0.5
1
1.5
2
2.5
3
3.5
4 5 6 7 8 9 10 11 12
Intel MKL splitSpiral SSE2FFTW 2.1.3Spiral CSpiral C vectSpiral SSE2 exh.
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
GeneratedGenerated DFT Code: Pentium III, SSEDFT Code: Pentium III, SSEg
flo
ps
DFT 2n single precision, Pentium III, 1 GHz, using Intel C compiler 6.0n
speedups (to C code) up to factor of 2.1
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
4 5 6 7 8 9 10 11 12 13
Intel MKLSpiral SSESpiral C vectSpiral CFFTW 2.1.3
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Other Other transformstransforms
0
0.5
1
1.5
2
2.5
3
3.5
4 5 6 7 8 9 10 11 12 13 14
Spiral floatSpiral C vectSpiral C SSE
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
4x4 8x8 16x16 32x32 64x64 128x128
Spiral SSESpiral C vectSpiral C float
gfl
op
s
transform size transform size
2-dim DCT 2n x 2n
Pentium 4, 2.53 GHz, SSEWHT 2n
Pentium 4, 2.53 GHz, SSE
speedups (to C code) up to factor of 3
WHT has only additionsvery simple transform
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
Best Best DFT Trees, size DFT Trees, size 210 = 1024= 1024
scalar
C vect
SIMD
Pentium 4float
Pentium 4double
Pentium IIIfloat
AthlonXPfloat
10
6
4
4
10
87
5
10
8
6
4
2
2
2
2 2
2 2
2 2
2
21
2
2 3
10
8
5
2
2
2 3
10
97
5
12
22 3
10
6
2
4
2 2
2 2
4
10
4
2
6
2 4
2 2
2
10
5
2
5
2 3 3
10
6
3
4
2 2 3
10
8
5
2
2
2 3
10
6
4
4
2 2
2 2
2
10
5
2
5
2 3 3
trees platform/datatype dependent
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
CrosstimingCrosstiming of best trees on Pentium 4of best trees on Pentium 4S
low
dow
n f
acto
r w
.r.t
. bes
t
DFT 2n single precision, runtime of best found of other platforms
n
software adaptation is necessary
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
4 5 6 7 8 9 10 11 12 13
Pentium 4 SSEPentium 4 SSE2AthlonXP SSEPentiumIII SSEPentium 4 float
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRAL SPIRAL SystemSystem
DSP transformspecifies
user
goes for a coffee
Formula Generator
SPL Compiler Sea
rch
En
gin
e
runtime (performance)
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcodeS P
I R
A L
(or an
espresso
for sm
all transfo
rms)
platform-adaptedimplementation
comes back
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRAL SystemSystem
Available for download (v3.1): www.ece.cmu.edu/~spiral
Easy installation (Unix: configure/make; Windows: install shield)
Unix/Linux and Windows 98/ME/NT/2000/XP
Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
Extensible
New version (4.0) in preparation
~ 30 Publications in the areas of Signal Processing, High
Performance Computing, Compilers, Machine Learning,
Mathematics
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
SPIRALSPIRAL System: SummarySystem: Summary
Available for download: www.ece.cmu.edu/~spiral
Easy installation (Unix: configure/make; Windows: install shield)
Unix/Linux and Windows 98/ME/NT/2000/XP
Current transforms: DFT, DHT, WHT, DCT/DST type I – IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
Extensible
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
OrganizationOrganization
Mathematical Framework
Formula Generator
SPL and SPL Compiler
Search Engine
SPIRAL system
Conclusions
Transforms, Rules, and Formulas
Transform → Algorithm
Algorithm → Implementation
How to find the best implementation
Everything taken together
Car
negi
e M
ello
n
12/5/2002 IBM-Thomas T. J. Watson Res. Center
ConclusionsConclusions
Mathematical computer representation of algorithmsAutomatic translation of algorithms into code
SPIRAL closes the gap between math domain (algorithms) andimplementation domain (programs)
High level: Mathematical manipulation of algorithmsLow level: Coding degrees of freedom
SPIRAL does automatic optimization by intelligent search/learning in the space of alternatives
a new paradigm of software development