Cache Performance Analysis

30
Faculty of Computer Science CMPUT 229 © 2006 Cache Performance Analysis Hitting for performance

description

Hitting for performance. Cache Performance Analysis. Standard Matrix Multiplication. for (i = 0; i

Transcript of Cache Performance Analysis

Page 1: Cache Performance Analysis

Faculty of Computer Science

CMPUT 229 © 2006

Cache Performance Analysis

Hitting for performance

Page 2: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Standard Matrix Multiplication

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

c[i,j] = 0.0;

for(k = 0; k<n ; k++){

c[i,j] = c[i,j] + a[i,k] * b[k,j];

}

}

}

Assume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000

Address(c[0,0]) = $8100000

What is the data cache hit ratio for this program?

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

sum = 0.0;

for(k = 0; k<n ; k++){

temp1 load(a[i,k]);

temp2 load(b[k,j]);

sum sum + temp1*temp2;

}

store(c[i,j]) sum;

}

}

Page 3: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

Page 4: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Access AnalysisAssume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000

Address(c[0,0]) = $8100000

What is the data cache hit ratio for this program?

32K-byte cache

128-byte cache line= 256 lines/cache

Page 5: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Access AnalysisAssume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000

Address(c[0,0]) = $8100000

What is the data cache hit ratio for this program?

128-byte cache lines

8-byte element= 16 elements/line

32K-byte cache

128-byte cache line= 256 lines/cache

Page 6: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Data Access Pattern

If we ignore conflict misses, then:

Every 16th access of A is a miss;

Every access to B is a miss;

How many hits and misses will occur to compute one element of C?

256 lines/cache

16 elements/line

In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits.

In B there will be 1024 misses.

Thus, what is the hit ratio?

# hits

# of accessesHit ratio = =

960 hits

2048 accesses = 0.47 = 47%

Page 7: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Address anatomy

The data cache has 32 Kbytes and 128-byte cache lines;

128 = 27

256 = 28

256 lines/cache

16 elements/line

15 14 7 6 031

Tag Index Offset

7 bits8 bits17 bits

Page 8: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Conflict Misses 256 lines/cache

16 elements/lineCache

Access Address Index Outcome

A[0,0] $80000000 0 miss

B[0,0] $80800000 0 miss

A[0,1] $80000004 0 miss

B[1,0] $80801000 32 miss

A[0,2] $80000008 0 hit

B[2,0] $80802000 64 miss

A[0,3] $8000000C 0 hit

B[3,0] $80803000 96 miss

A[0,4] $80000010 0 hit

B[4,0] $80804000 128 miss

A[0,5] $80000014 0 hit

B[5,0] $80805000 160 miss

A[0,6] $80000018 0 hit

B[6,0] $80806000 192 miss

A[0,7] $8000001C 0 hit

B[7,0] $80807000 244 miss

A[0,8] $80000020 0 hit

B[8,0] $80808000 0 miss

A[0,9] $80000024 0 miss

B[9,0] $80809000 32 miss

0326496128160192244

In General:

A 1024-element row of A

Occupies 64 16-element cache lines.

There will be 2 conflict misses in two of these rows.

A total of 4 conflict misses per row.

Thus the accesses of A will result in 68 misses and

986 hits for each 1024 accesses.

The conflict misses are not significant and can be ignored.

Page 9: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Matrix Multiplication with Transpose

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

for(k = 0; k<n ; k++){

c[i,j] = c[i,j] + a[i,k] * b1[j,k];

}

}

}

Assume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000

Address(c[0,0]) = $8100000

What is the data cache hit ratio for this program?

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

b1[i,j] = b[j,i];

}

}

Where in memory

should we place

matrix b1 to reduce

conflict misses?

Page 10: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Where to place matrix b1?

15 14 7 6 031

Tag Index Offset

Intuitively the index of b1[0][0] should be away from the index of a[0][0].

The index of a[0][0] is 0.

Thus we could aim to place b1 at an address whose index is 128.

Page 11: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Access Pattern for the Transpose

If we ignore conflict misses, then:

Every 16th access of b1 is a miss;

Every access to b is a miss;

The transpose’s inner loop yields:

2048 accesses

960 hits.

And the inner loop is repeated 1024 times:

1024 2048 accesses

1024 960 hitsThus, the hit ratio is:

# hits

# of accessesHit ratio = =

960 hits

2048 accesses = 0.47 = 47%

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

b1[i,j] = b[j,i];

}

}

Page 12: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Access Pattern for the Multiplication

If we ignore conflict misses, then:

Every 16th access of a is a miss;

Every 16th access to b1 is a miss;

Thus the inner loop yields 2048 accesses

and 1920 hits.

for (i = 0; i<n ; i++){

for(j = 0; j<n ; j++){

sum = 0.0;

for(k = 0; k<n ; k++){

temp1 load(a[i,k]);

temp2 load(b1[j,k]);

sum sum + temp1*temp2;

}

store(c[i,j]) sum;

}

}

The inner loop is executed n2 times.

The total number of accesses (ignoring

accesses to c) in the multiplication is:

1024 1024 2048 accesses

1024 1024 1920 hits

Page 13: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Hit Ratio for Multiplication with Transpose

1024 960+ 1024 1024 1920 hits

2048 1024 + 1024 1024 2048 accesses Hit ratio =

The total number of accesses (ignoring

accesses to c) in the multiplication is:

1024 1024 2048 accesses

1024 1024 1920 hits

The transpose yields:

1024 2048 accesses

1024 960 hits.

960+ 1024 1920 hits

1025 2048 accesses Hit ratio = = 0.937 = 93.7%

Page 14: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Blocked Matrix Multiplication*for (i0 = 0; i0<n ; i0 = i0 + b){

for(j0 = 0; j0<n ; j0 = j0 + b){

for(k0 = 0; k0<n ; k0 = k0 + b){

for(i = i0; i< min(i0+b-1,n) ; i++){

for(j = j0; j< min(j0+b-1,n) ; j++){

for(k = k0; j< min(k0+b-1,n) ; j++){

c[i,j] = c[i,j] + a[i,k] * b[k,j];

}

}

}

}

}

•Code adapted from http://www.netlib.org/utk/papers/autoblock/node2.html

Assumes that all elements of matrix c were initialized to zero beforehand

Page 15: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

2

0

Page 16: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

3

1

Page 17: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

2

Page 18: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

4

Page 19: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

6

Page 20: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

8

Page 21: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

10

Page 22: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

12

Page 23: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4

14

Multiplying the first row of the block of A by the block of B required

18 accesses that resulted in 4 misses.

How many of the 18 accesses required to multiply the second row

of the block of A by the block of B will be misses?

Page 24: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4|1

14|1

Page 25: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4|1

14|17

Page 26: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

4| 1 | 1 = 6

14|17|17 = 48

Page 27: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

What is the hit ratio for the next block multiplication?

4| 1 | 1 = 6

14|17|17 = 48

3 hits and 48 references

In general, there are b misses and 2b3 accesses

2b3 - b

2b3Hit ratio =

Page 28: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

miss

hit

What is the hit ratio for the next block multiplication?

4| 1 | 1 = 6

14|17|17 = 48

3 hits and 48 references

In general, there are b misses and 2b3 accesses

2b2 - 1

2b2Hit ratio =

Page 29: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Data Access PatternA B

Assume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

What should be

the value of b?

Do the memory locations

of A and B matter?

miss

hit

Page 30: Cache Performance Analysis

© 2006

Department of Computing Science

CMPUT 229

Cache Usage for Blocked Matrix Multiplication

Assume that:

Each matrix element is stored in 8 bytes;

The data cache has 32 Kbytes and 128-byte cache lines;

The data cache is direct associative;

for (i0 = 0; i0<n ; i0 = i0 + b){

for(j0 = 0; j0<n ; j0 = j0 + b){

for(k0 = 0; k0<n ; k0 = k0 + b){

for(i = i0; i< min(i0+b-1,n) ; i++){

for(j = j0; j< min(j0+b-1,n) ; j++){

for(k = k0; j< min(k0+b-1,n) ; j++){

c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } }

Ignore conflict misses.

Estimate the hit ratio for the block computation if b=16.

2b2 - 1

2b2Hit ratio =

2(16)2 - 1

2(16)2Hit ratio =

Hit ratio = 99.8%