The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

32
The Study of Cache Oblivious Algorithms Prepared by Jia Guo

Transcript of The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Page 1: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

The Study of Cache Oblivious Algorithms

Prepared by Jia Guo

Page 2: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

2CS598dhp

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

Page 3: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

3CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 4: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

4CS598dhp

Assumption

Only two levels of memory hierarchies: An ideal cache

Fully associativeOptimal replacement strategy“Tall cache”

A very large memory

Page 5: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

5CS598dhp

An Ideal Cache Model

An ideal cache model (Z,L)

Z: Total words in the cache

L: Words in one cache line

Page 6: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

6CS598dhp

Cache Complexity

An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses

it incurs. Q(n; Z, L)

Page 7: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

7CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 8: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

8CS598dhp

Cache Aware Algorithms

Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).

Need to adjust parameters when running on different platforms.

Page 9: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

9CS598dhp

Example:

A blocked matrix multiplication algorithm

s is a tuning parameter to make the algorithm run fast

A11s

s

n

A

Page 10: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

10CS598dhp

Example (2)

Cache complexity The three s x s sub matrices should fit into the cache so

they occupy cache lines

Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into

cache n2/L cache misses needed to read n2 elements It is

)( Zs

)//1(

))/()/(/1(32

32

ZLnLn

LZsnLn

)/()/,max( 22 LssLss

Page 11: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

11CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition and FFT

Conclusion

Page 12: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

12CS598dhp

Cache Oblivious Algorithms

Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.

The following algorithms introduced are proved to have the optimal cache complexity.

Page 13: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

13CS598dhp

Matrix Multiplication

Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p

Proceed recursively until reach the base case - one element.

n ≥ max (m, p)

m ≥ max (n, p)

p ≥ max (n, m)

Page 14: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

14CS598dhp

Matrix Multiplication (2)

12

111211 B

BAA

2

121 B

BAAA*B

A1*B1 A2*B2

A11*B11 A12*B12 A21*B21 A22*B22

22

212221 B

BAA

Assume Sizes of A, B are nx4n, 4nxn

+ +

+

Page 15: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

15CS598dhp

Matrix Multiplication (3)

Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

Page 16: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

16CS598dhp

Matrix Multiplication (4)

Cache complexityCan achieve the same as the cache complexity

of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache

complexity is achieved.

Page 17: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

17CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

Page 18: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

18CS598dhp

If n is very large, the access of B in column will cause cache miss every time!

(No spatial locality in B)

Matrix Transposition

A AT for i 1 to m

for j 1 to n

B( j, i ) = A( i, j )

m x n

Bn x m

Page 19: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

19CS598dhp

Matrix Transposition (2)

Partition array A along the longer dimension and recursively execute the transpose function.

A1A111

A12A12

A21A21

A22A22

A11A11TT

A21A21TT

A12A12TT

A22A22TT

Page 20: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

20CS598dhp

Matrix Transposition (3)

Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)

Page 21: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

21CS598dhp

Fast Fourier Transform

Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DF

T of a composite size n = n1n2 as:

Perform n2 DFTs of size n1.

Multiply by complex roots of unity called twiddle factors.

Perform n1 DFTs of size n2.

1

0

][][n

j

ijnjXiY

Page 22: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

22CS598dhp

1

0

[ ] [ ]n

ij

j

Y i X j w

2 1

1 1 1 2 2 2

1 2

2 1

1 1

1 2 1 1 2 20 0

[ ] [ ]n n

i j i j i jn n n

j j

Y i i n X j n j w w w

n2

n1

Page 23: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

23CS598dhp

Assume X is a row-major n1× n2 matrixSteps:

Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

Page 24: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

24CS598dhp

Fast Fourier Transform

*twiddle factor

Transpose to select n2 DFT of size n1

Call FFT recursively with n1=2, n2=2 Reach the base case, return

Transpose to select n1 DFT of size n2

Transpose and return

n1=4, n2=2

Page 25: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

25CS598dhp

Fast Fourier Transform

Cache complexityOptimal for a Cooley-Tukey algorithm, when n

is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

Page 26: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

26CS598dhp

Other Cache Oblivious Algorithms

Funnelsort Distribution sortLU decomposition without pivots

Page 27: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

27CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transpositionFFT

Conclusion

Page 28: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

28CS598dhp

Questions

How large is the range of practicality of cache-oblivious algorithms?

What are the relative strengths of cache-oblivious and cache-aware algorithms?

Page 29: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

29CS598dhp

Practicality of Cache-oblivious Algorithms

Average time to transpose an NxN matrix, divided by N2

Page 30: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

30CS598dhp

Practicality of Cache-oblivious Algorithms (2)

Average time taken to multiply two NxN matrices, divided by N3

Page 31: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

31CS598dhp

Question 2

Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

Page 32: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

32CS598dhp

References

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999.

Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.