Performance Analysis of Divide and Conquer Algorithms for the WHT
description
Transcript of Performance Analysis of Divide and Conquer Algorithms for the WHT
Performance Analysis of Divide and Conquer Performance Analysis of Divide and Conquer Algorithms for the WHTAlgorithms for the WHT
Jeremy Johnson
Mihai Furis, Pawel Hitczenko, Hung-Jen Huang
Dept. of Computer Science
Drexel University
www.spiral.net
LACSI 2006 – Automatic Tuning of Libraries and Applications
MotivationMotivation
• On modern machines operation count is not always the most important performance metric.
• Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code.
• Automatic Performance Tuning and Architecture Adaptation– Generate and Test
– FFT, Matrix Multiplication, …
• Explain performance distribution
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• Space of WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count
– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
Walsh-Hadamard TransformWalsh-Hadamard Transform
• y = WHTN x, N = 2n
n
N WHTWHTWHT 22...
11
11WHT2
1111
1111
1111
1111
11
11
11
11
WHTWHTWHT 224
LACSI 2006 – Automatic Tuning of Libraries and Applications
Factoring the WHT MatrixFactoring the WHT Matrix
• AC DCD• A • A C) = (A C
• Im nmn
1100
1100
0011
0011
1010
0101
1010
0101
1111
1111
1111
1111
4WHT
WHT2 WHT2WHT2WHT2
LACSI 2006 – Automatic Tuning of Libraries and Applications
Recursive AlgorithmRecursive Algorithm
11
11
11
11
11
11
11
1111
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
(WHT(WHT22 I I44)(I)(I22 (WHT (WHT22 I I22) (I) (I22 WHT WHT22))))
LACSI 2006 – Automatic Tuning of Libraries and Applications
Iterative AlgorithmIterative Algorithm
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11111111
11111111
11111111
11111111
11111111
11111111
11111111
11111111
(WHT(WHT22 I I44)(I)(I22 WHT WHT22 I I22) (I) (I44 WHT WHT22))))
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT AlgorithmsWHT Algorithms
• Recursive
• Iterative
• General
n
i
ini
N1
IWHTIWHT 2221
WHTIIWHTWHT 2 2/2/2 NNN
nnn t
t
i
nnnnnn tiii
1
1
where
,2222 IWHTIWHT 111
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT ImplementationWHT Implementation
– N = N1* N2**Nt Ni=2ni
– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop) R=N; S=1;
for i=t,…,1 R=R/Ni
for j=0,…,R-1
for k=0,…,S-1
S=S* Ni;
Mb,s
t
i 1
)
nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt
i
ii
i
i
NSkSjNN
NSkSjN xWHTx ,,
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Partition TreesPartition Trees
4
1 3
1 2
1 1
4
1 1 11
Right Recursive
Iterative
9
3 4 2
1 2 1
1 1
4
13
12
11
Left Recursive
4
2 2
1 1 1 1
Balanced
LACSI 2006 – Automatic Tuning of Libraries and Applications
Number of AlgorithmsNumber of Algorithms
1,1
1,1 TTT 1
1
n
nn
nnnn t
tn
8.6),/(
)22(2
8811
))T(1(
)T()1(
2462
2/3
2
432
0
T
)T(
T)T(
n
z
zzzzzz
n
n
n
nn
z
zz
z
zz
z
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count
– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT PackageWHT PackagePPüschel & Johnson (ICASSP ’00)üschel & Johnson (ICASSP ’00)
• Allows easy implementation of any of the possible
WHT algorithms
• Partition tree representation
W(n)=small[n] | split[W(n1),…W(nt)]
• Tools
– Measure runtime of any algorithm
– Measure hardware events (coupled with PCL/PAPI)
– Search for good implementation
• Dynamic programming
• Evolutionary algorithm
LACSI 2006 – Automatic Tuning of Libraries and Applications
Algorithm ComparisonAlgorithm Comparison
Recursive/Iterative Runtime
0.00E+002.00E-014.00E-016.00E-018.00E-011.00E+00
1.20E+001.40E+001.60E+001.80E+002.00E+00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
WHT size(2̂ n)
ratio r1/i1
Rec &Bal/It Instruction Count
0
0.5
1
1.5
2
2.5
1 3 5 7 9 11 13 15 17 19
rr1/i1
lr1/i1
bal1/i1
Rec&It/Best Runtime
0.00E+00
2.00E+00
4.00E+00
6.00E+00
8.00E+00
1.00E+01
1.20E+01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
WHT size(2̂ n)
ratio
r1/b
r3/b
i1/b
i3/b
b/b
Small/It Runtime
0.00E+002.00E+00
4.00E+006.00E+00
8.00E+001.00E+01
1.20E+01
1 2 3 4 5 6 7 8
WHT size(2̂ n)
ratio I_1/rt
r_1/rt
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache Miss DataCache Miss Data
Recursive vs. Best
0.00E+00
1.00E+00
2.00E+00
3.00E+00
4.00E+00
5.00E+00
6.00E+00
1 4 7
10 13 16 19 22
size
Rat
io R
ecur
sive
/Iter
ativ
eInstructions
L1 Data CacheMisses
L2 Cache Misses
Recursive vs. Iterative
0.00E+00
2.00E-01
4.00E-01
6.00E-01
8.00E-01
1.00E+00
1.20E+00
1.40E+00
1.60E+00
1 4 7
10
13
16
19
22
size
Ra
tio
Re
cu
rsiv
e/It
era
tiv
e
Instructions
L1 Data CacheMisses
L2 Data CacheMisses
Recursive vs. Iterative Normalized to Best
0.00E+00
2.00E+00
4.00E+00
6.00E+00
8.00E+00
1.00E+01
1.20E+01
1 4 7
10 13 16 19 22
size
Rat
io A
lg T
ime
/Be
st T
ime
Recursive Time
Iterative Time
Iterative vs. Best
0.00E+00
1.00E+00
2.00E+00
3.00E+00
4.00E+00
5.00E+00
6.00E+00
7.00E+00
8.00E+00
9.00E+00
1 4 7
10
13
16
19
22
size
Rat
io It
erat
ive
/Bes
t
Instructions
L1 Data CacheMisses
L2 Cache Misses
LACSI 2006 – Automatic Tuning of Libraries and Applications
Histogram (n = 16, 10,000 samples)Histogram (n = 16, 10,000 samples)
• Wide range in performance despite equal number of arithmetic operations (n2n flops)
• Pentium III vs. UltraSPARC II
LACSI 2006 – Automatic Tuning of Libraries and Applications
OutlineOutline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model– Instruction Count
– Cache
LACSI 2006 – Automatic Tuning of Libraries and Applications
WHT ImplementationWHT Implementation
– N = N1* N2Nt Ni=2ni
– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop) R=N; S=1;
for i=t,…,1 R=R/Ni
for j=0,…,R-1
for k=0,…,S-1
S=S* Ni;
Mb,s
t
i 1
)
nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt
i
ii
i
i
NSkSjNN
NSkSjN xWHTx ,,
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Instruction Count ModelInstruction Count Model
)()()A()IC(3
1
3
1AL nlninn
ll
ii
A(n) = number of calls to WHT procedure= number of instructions outside loopsAl(n) = Number of calls to base case of size l l = number of instructions in base case of size l
Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body
LACSI 2006 – Automatic Tuning of Libraries and Applications
Small[1]Small[1].file "s_1.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl apply_small1
.type apply_small1,@function
apply_small1:
movl 8(%esp),%edx //load stride S to EDX
movl 12(%esp),%eax //load x array's base address to EAX
fldl (%eax) // st(0)=R7=x[0]
fldl (%eax,%edx,8) //st(0)=R6=x[S]
fld %st(1) //st(0)=R5=x[0]
fadd %st(1),%st // R5=x[0]+x[S]
fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]
fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????
fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]
fstpl (%eax) //store x[0]=x[0]+x[S]
fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]
ret
LACSI 2006 – Automatic Tuning of Libraries and Applications
Recurrences Recurrences
leaf a ,0)A(
... ),A(1)A(1
12
n
nnnn
n
nnnti
t
i
i
leaf a ,0)(
... ,)()(
... ,)()(
... ),()(
L
2L2L
2L2L
L2L
11
23
1
...
122
11
11
11
n
nnnn
nnnn
nnnn
n
nnnnn
nnnnn
nntn
i
ti
t
i
ti
t
i
ti
t
i
ii
ii
i
lnl
ln
llA leaves ofnumber where,)( 2
LACSI 2006 – Automatic Tuning of Libraries and Applications
Histogram using Instruction Model (P3)Histogram using Instruction Model (P3)
l = 12, l = 34, and l = 106 = 271 = 18, 2 = 18, and 1 = 20
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache ModelCache Model
• Different WHT algorithms access data in different patterns• All algorithms with the same set of leaf nodes have the same
number of memory accesses
• Count misses for accesses to data array– Parameterized by cache size, associativity, and block size
– simulate using program traces (restrict to data vector accesses)
– Analytic formula?
LACSI 2006 – Automatic Tuning of Libraries and Applications
Blocked AccessBlocked Access4
1
1 2
3
)))(()((
))((
4242282
828216
WHTIIWHTIIWHT
WHTIIWHTWHT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LACSI 2006 – Automatic Tuning of Libraries and Applications
Interleaved AccessInterleaved Access
4
3
2 1
1 ))())(((
))((
2822424
282816
WHTIIWHTIIWHT
WHTIIWHTWHT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache SimulatorCache Simulator
4
1
1 2
3
4
3
2 1
1
• 144 memory accesses• C = 4, A = 1, B = 1 (80, 112)• C = 4, A = 4, B = 1 (48, 48)• C = 4, A = 1, B = 2 (72, 88)
• Iterative vs. Recursive (192 memory accesses)• C = 4, A = 1, B = 1 (128, 112)
LACSI 2006 – Automatic Tuning of Libraries and Applications
Cache Misses as a Function of Cache SizeCache Misses as a Function of Cache Size
C=22 C=23
C=24 C=25
LACSI 2006 – Automatic Tuning of Libraries and Applications
Formula for Cache MissesFormula for Cache Misses
• M(L,WN,R) = Number of misses for LWHTN R
)...,,...(
leaf a is if
/ if ),,(
111 M2
3
2N22M C
nnWnn
WN
NW
tii
t
i
N
rr
N
l
rN
nn
i
i
LACSI 2006 – Automatic Tuning of Libraries and Applications
Closed FormClosed Form
• M(L,WN,R) = Number of misses for LWHTN R
• M(0,W_n,0) = 3(n-c)*2n + k*2n
• C = 2c, k = number of parts in the rightmost c positions• c = 3, n = 4
4
1 1 11
Iterative
4
1 3
1 2
1 1
Right Recursive
4
2 2
1 1 1 1
Balanced
k = 3
k = 2
k = 1
LACSI 2006 – Automatic Tuning of Libraries and Applications
Summary of Results and Future WorkSummary of Results and Future Work
• Instruction Count Model– min, max, expected value, variance, limiting distribution
• Cache Model– Direct mapped (closed form solution, distribution, expected
value, and variance)
• Combine models• Extend cache formula to include A and B• Use as heuristic to limit search and predict performance
LACSI 2006 – Automatic Tuning of Libraries and Applications
SponsorsSponsors
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting, DESA: Intelligent HW-SW
Compilers for Signal Processing Applications, and NSF
ITR/NGS #0325687: Intelligent HW/SW Compilers for DSP.