The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

The Study of Cache Oblivious Algorithms

Prepared by Jia Guo

2CS598dhp

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.

3CS598dhp

Outline

Cache complexity Cache aware algorithmsCache oblivious algorithms

Matrix multiplicationMatrix transposition FFT

Conclusion

4CS598dhp

Assumption

Only two levels of memory hierarchies: An ideal cache

Fully associativeOptimal replacement strategy“Tall cache”

A very large memory

5CS598dhp

An Ideal Cache Model

An ideal cache model (Z,L)

Z: Total words in the cache

L: Words in one cache line

6CS598dhp

Cache Complexity

An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses

it incurs. Q(n; Z, L)

7CS598dhp

Outline



Conclusion

8CS598dhp

Cache Aware Algorithms

Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).

Need to adjust parameters when running on different platforms.

9CS598dhp

Example:

A blocked matrix multiplication algorithm

s is a tuning parameter to make the algorithm run fast

A11s

s

n

A

10CS598dhp

Example (2)

Cache complexity The three s x s sub matrices should fit into the cache so

they occupy cache lines

Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into

cache n2/L cache misses needed to read n2 elements It is

)( Zs

)//1(

))/()/(/1(32

32

ZLnLn

LZsnLn

)/()/,max( 22 LssLss

11CS598dhp

Outline


Matrix multiplicationMatrix transposition and FFT

Conclusion

12CS598dhp

Cache Oblivious Algorithms

Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.

The following algorithms introduced are proved to have the optimal cache complexity.

13CS598dhp

Matrix Multiplication

Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p

Proceed recursively until reach the base case - one element.

n ≥ max (m, p)

m ≥ max (n, p)

p ≥ max (n, m)

14CS598dhp

Matrix Multiplication (2)

12

111211 B

BAA

2

121 B

BAAA*B

A1*B1 A2*B2

A11*B11 A12*B12 A21*B21 A22*B22

22

212221 B

BAA

Assume Sizes of A, B are nx4n, 4nxn

+ +

+

15CS598dhp


Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.

16CS598dhp


Cache complexityCan achieve the same as the cache complexity

of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache

complexity is achieved.

17CS598dhp

Outline



Conclusion

18CS598dhp

If n is very large, the access of B in column will cause cache miss every time!

(No spatial locality in B)

Matrix Transposition

A AT for i 1 to m

for j 1 to n

B( j, i ) = A( i, j )

m x n

Bn x m

19CS598dhp

Matrix Transposition (2)

Partition array A along the longer dimension and recursively execute the transpose function.

A1A111

A12A12

A21A21

A22A22

A11A11TT

A21A21TT

A12A12TT

A22A22TT

20CS598dhp

Matrix Transposition (3)

Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)

21CS598dhp

Fast Fourier Transform

Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DF

T of a composite size n = n1n2 as:

Perform n2 DFTs of size n1.

Multiply by complex roots of unity called twiddle factors.

Perform n1 DFTs of size n2.

1

0

][][n

j

ijnjXiY

22CS598dhp

1

0

[ ] [ ]n

ij

j

Y i X j w

2 1

1 1 1 2 2 2

1 2

2 1

1 1

1 2 1 1 2 20 0

[ ] [ ]n n

i j i j i jn n n

j j

Y i i n X j n j w w w

n2

n1

23CS598dhp

Assume X is a row-major n1× n2 matrixSteps:

Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place

24CS598dhp


*twiddle factor

Transpose to select n2 DFT of size n1

Call FFT recursively with n1=2, n2=2 Reach the base case, return

Transpose to select n1 DFT of size n2

Transpose and return

n1=4, n2=2

25CS598dhp


Cache complexityOptimal for a Cooley-Tukey algorithm, when n

is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)

26CS598dhp

Other Cache Oblivious Algorithms

Funnelsort Distribution sortLU decomposition without pivots

27CS598dhp

Outline


Matrix multiplicationMatrix transpositionFFT

Conclusion

28CS598dhp

Questions

How large is the range of practicality of cache-oblivious algorithms?

What are the relative strengths of cache-oblivious and cache-aware algorithms?

29CS598dhp

Practicality of Cache-oblivious Algorithms

Average time to transpose an NxN matrix, divided by N2

30CS598dhp

Practicality of Cache-oblivious Algorithms (2)

Average time taken to multiply two NxN matrices, divided by N3

31CS598dhp

Question 2

Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.

32CS598dhp

References

Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999.

Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Documents

Transcript of The Study of Cache Oblivious Algorithms Prepared by Jia Guo.