Performance Analysis of Divide and Conquer Algorithms for the WHT

31
Performance Analysis of Divide and Performance Analysis of Divide and Conquer Algorithms for the WHT Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University www.spiral.net

description

Performance Analysis of Divide and Conquer Algorithms for the WHT. Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University. www.spiral.net. Motivation. On modern machines operation count is not always the most important performance metric. - PowerPoint PPT Presentation

Transcript of Performance Analysis of Divide and Conquer Algorithms for the WHT

Page 1: Performance Analysis of Divide and Conquer Algorithms for the WHT

Performance Analysis of Divide and Conquer Performance Analysis of Divide and Conquer Algorithms for the WHTAlgorithms for the WHT

Jeremy Johnson

Mihai Furis, Pawel Hitczenko, Hung-Jen Huang

Dept. of Computer Science

Drexel University

www.spiral.net

Page 2: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

MotivationMotivation

• On modern machines operation count is not always the most important performance metric.

• Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code.

• Automatic Performance Tuning and Architecture Adaptation– Generate and Test

– FFT, Matrix Multiplication, …

• Explain performance distribution

Page 3: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

OutlineOutline

• Space of WHT Algorithms

• WHT Package and Performance Distribution

• Performance Model– Instruction Count

– Cache

Page 4: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Walsh-Hadamard TransformWalsh-Hadamard Transform

• y = WHTN x, N = 2n

n

N WHTWHTWHT 22...

11

11WHT2

1111

1111

1111

1111

11

11

11

11

WHTWHTWHT 224

Page 5: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Factoring the WHT MatrixFactoring the WHT Matrix

• AC DCD• A • A C) = (A C

• Im nmn

1100

1100

0011

0011

1010

0101

1010

0101

1111

1111

1111

1111

4WHT

WHT2 WHT2WHT2WHT2

Page 6: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Recursive AlgorithmRecursive Algorithm

11

11

11

11

11

11

11

1111

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

(WHT(WHT22 I I44)(I)(I22 (WHT (WHT22 I I22) (I) (I22 WHT WHT22))))

Page 7: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Iterative AlgorithmIterative Algorithm

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11111111

11111111

11111111

11111111

11111111

11111111

11111111

11111111

(WHT(WHT22 I I44)(I)(I22 WHT WHT22 I I22) (I) (I44 WHT WHT22))))

Page 8: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

WHT AlgorithmsWHT Algorithms

• Recursive

• Iterative

• General

n

i

ini

N1

IWHTIWHT 2221

WHTIIWHTWHT 2 2/2/2 NNN

nnn t

t

i

nnnnnn tiii

1

1

where

,2222 IWHTIWHT 111

Page 9: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

WHT ImplementationWHT Implementation

– N = N1* N2**Nt Ni=2ni

– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))

• Implementation(nested loop) R=N; S=1;

for i=t,…,1 R=R/Ni

for j=0,…,R-1

for k=0,…,S-1

S=S* Ni;

Mb,s

t

i 1

)

nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt

i

ii

i

i

NSkSjNN

NSkSjN xWHTx ,,

i

Page 10: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Partition TreesPartition Trees

4

1 3

1 2

1 1

4

1 1 11

Right Recursive

Iterative

9

3 4 2

1 2 1

1 1

4

13

12

11

Left Recursive

4

2 2

1 1 1 1

Balanced

Page 11: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Number of AlgorithmsNumber of Algorithms

1,1

1,1 TTT 1

1

n

nn

nnnn t

tn

8.6),/(

)22(2

8811

))T(1(

)T()1(

2462

2/3

2

432

0

T

)T(

T)T(

n

z

zzzzzz

n

n

n

nn

z

zz

z

zz

z

Page 12: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

OutlineOutline

• WHT Algorithms

• WHT Package and Performance Distribution

• Performance Model– Instruction Count

– Cache

Page 13: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

WHT PackageWHT PackagePPüschel & Johnson (ICASSP ’00)üschel & Johnson (ICASSP ’00)

• Allows easy implementation of any of the possible

WHT algorithms

• Partition tree representation

W(n)=small[n] | split[W(n1),…W(nt)]

• Tools

– Measure runtime of any algorithm

– Measure hardware events (coupled with PCL/PAPI)

– Search for good implementation

• Dynamic programming

• Evolutionary algorithm

Page 14: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Algorithm ComparisonAlgorithm Comparison

Recursive/Iterative Runtime

0.00E+002.00E-014.00E-016.00E-018.00E-011.00E+00

1.20E+001.40E+001.60E+001.80E+002.00E+00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

WHT size(2̂ n)

ratio r1/i1

Rec &Bal/It Instruction Count

0

0.5

1

1.5

2

2.5

1 3 5 7 9 11 13 15 17 19

rr1/i1

lr1/i1

bal1/i1

Rec&It/Best Runtime

0.00E+00

2.00E+00

4.00E+00

6.00E+00

8.00E+00

1.00E+01

1.20E+01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

WHT size(2̂ n)

ratio

r1/b

r3/b

i1/b

i3/b

b/b

Small/It Runtime

0.00E+002.00E+00

4.00E+006.00E+00

8.00E+001.00E+01

1.20E+01

1 2 3 4 5 6 7 8

WHT size(2̂ n)

ratio I_1/rt

r_1/rt

Page 15: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Cache Miss DataCache Miss Data

Recursive vs. Best

0.00E+00

1.00E+00

2.00E+00

3.00E+00

4.00E+00

5.00E+00

6.00E+00

1 4 7

10 13 16 19 22

size

Rat

io R

ecur

sive

/Iter

ativ

eInstructions

L1 Data CacheMisses

L2 Cache Misses

Recursive vs. Iterative

0.00E+00

2.00E-01

4.00E-01

6.00E-01

8.00E-01

1.00E+00

1.20E+00

1.40E+00

1.60E+00

1 4 7

10

13

16

19

22

size

Ra

tio

Re

cu

rsiv

e/It

era

tiv

e

Instructions

L1 Data CacheMisses

L2 Data CacheMisses

Recursive vs. Iterative Normalized to Best

0.00E+00

2.00E+00

4.00E+00

6.00E+00

8.00E+00

1.00E+01

1.20E+01

1 4 7

10 13 16 19 22

size

Rat

io A

lg T

ime

/Be

st T

ime

Recursive Time

Iterative Time

Iterative vs. Best

0.00E+00

1.00E+00

2.00E+00

3.00E+00

4.00E+00

5.00E+00

6.00E+00

7.00E+00

8.00E+00

9.00E+00

1 4 7

10

13

16

19

22

size

Rat

io It

erat

ive

/Bes

t

Instructions

L1 Data CacheMisses

L2 Cache Misses

Page 16: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Histogram (n = 16, 10,000 samples)Histogram (n = 16, 10,000 samples)

• Wide range in performance despite equal number of arithmetic operations (n2n flops)

• Pentium III vs. UltraSPARC II

Page 17: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

OutlineOutline

• WHT Algorithms

• WHT Package and Performance Distribution

• Performance Model– Instruction Count

– Cache

Page 18: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

WHT ImplementationWHT Implementation

– N = N1* N2Nt Ni=2ni

– x = WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s))

• Implementation(nested loop) R=N; S=1;

for i=t,…,1 R=R/Ni

for j=0,…,R-1

for k=0,…,S-1

S=S* Ni;

Mb,s

t

i 1

)

nn WHTWHT 222 II( n1+ ··· + ni-1 2ni+1+ ··· + nt

i

ii

i

i

NSkSjNN

NSkSjN xWHTx ,,

i

Page 19: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Instruction Count ModelInstruction Count Model

)()()A()IC(3

1

3

1AL nlninn

ll

ii

A(n) = number of calls to WHT procedure= number of instructions outside loopsAl(n) = Number of calls to base case of size l l = number of instructions in base case of size l

Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body

Page 20: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Small[1]Small[1].file "s_1.c"

.version "01.01"

gcc2_compiled.:

.text

.align 4

.globl apply_small1

.type apply_small1,@function

apply_small1:

movl 8(%esp),%edx //load stride S to EDX

movl 12(%esp),%eax //load x array's base address to EAX

fldl (%eax) // st(0)=R7=x[0]

fldl (%eax,%edx,8) //st(0)=R6=x[S]

fld %st(1) //st(0)=R5=x[0]

fadd %st(1),%st // R5=x[0]+x[S]

fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]

fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????

fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]

fstpl (%eax) //store x[0]=x[0]+x[S]

fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]

ret

Page 21: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Recurrences Recurrences

leaf a ,0)A(

... ),A(1)A(1

12

n

nnnn

n

nnnti

t

i

i

leaf a ,0)(

... ,)()(

... ,)()(

... ),()(

L

2L2L

2L2L

L2L

11

23

1

...

122

11

11

11

n

nnnn

nnnn

nnnn

n

nnnnn

nnnnn

nntn

i

ti

t

i

ti

t

i

ti

t

i

ii

ii

i

lnl

ln

llA leaves ofnumber where,)( 2

Page 22: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Histogram using Instruction Model (P3)Histogram using Instruction Model (P3)

l = 12, l = 34, and l = 106 = 271 = 18, 2 = 18, and 1 = 20

Page 23: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Cache ModelCache Model

• Different WHT algorithms access data in different patterns• All algorithms with the same set of leaf nodes have the same

number of memory accesses

• Count misses for accesses to data array– Parameterized by cache size, associativity, and block size

– simulate using program traces (restrict to data vector accesses)

– Analytic formula?

Page 24: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Blocked AccessBlocked Access4

1

1 2

3

)))(()((

))((

4242282

828216

WHTIIWHTIIWHT

WHTIIWHTWHT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

Page 25: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Interleaved AccessInterleaved Access

4

3

2 1

1 ))())(((

))((

2822424

282816

WHTIIWHTIIWHT

WHTIIWHTWHT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

                               

                               

                               

                               

                               

                               

                               

Page 26: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Cache SimulatorCache Simulator

4

1

1 2

3

4

3

2 1

1

• 144 memory accesses• C = 4, A = 1, B = 1 (80, 112)• C = 4, A = 4, B = 1 (48, 48)• C = 4, A = 1, B = 2 (72, 88)

• Iterative vs. Recursive (192 memory accesses)• C = 4, A = 1, B = 1 (128, 112)

Page 27: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Cache Misses as a Function of Cache SizeCache Misses as a Function of Cache Size

C=22 C=23

C=24 C=25

Page 28: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Formula for Cache MissesFormula for Cache Misses

• M(L,WN,R) = Number of misses for LWHTN R

)...,,...(

leaf a is if

/ if ),,(

111 M2

3

2N22M C

nnWnn

WN

NW

tii

t

i

N

rr

N

l

rN

nn

i

i

Page 29: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Closed FormClosed Form

• M(L,WN,R) = Number of misses for LWHTN R

• M(0,W_n,0) = 3(n-c)*2n + k*2n

• C = 2c, k = number of parts in the rightmost c positions• c = 3, n = 4

4

1 1 11

Iterative

4

1 3

1 2

1 1

Right Recursive

4

2 2

1 1 1 1

Balanced

k = 3

k = 2

k = 1

Page 30: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

Summary of Results and Future WorkSummary of Results and Future Work

• Instruction Count Model– min, max, expected value, variance, limiting distribution

• Cache Model– Direct mapped (closed form solution, distribution, expected

value, and variance)

• Combine models• Extend cache formula to include A and B• Use as heuristic to limit search and predict performance

Page 31: Performance Analysis of Divide and Conquer Algorithms for the WHT

LACSI 2006 – Automatic Tuning of Libraries and Applications

SponsorsSponsors

Work supported by DARPA (DSO), Applied & Computational

Mathematics Program, OPAL, through grant managed by

research grant DABT63-98-1-0004 administered by the Army

Directorate of Contracting, DESA: Intelligent HW-SW

Compilers for Signal Processing Applications, and NSF

ITR/NGS #0325687: Intelligent HW/SW Compilers for DSP.