1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
-
Upload
godwin-mcdaniel -
Category
Documents
-
view
230 -
download
0
description
Transcript of 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
![Page 1: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/1.jpg)
1
Cache Memory
![Page 2: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/2.jpg)
2
Outline
• Cache mountain• Matrix multiplication• Suggested Reading: 6.6, 6.7
![Page 3: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/3.jpg)
3
6.6 Putting it Together: The Impact of Caches on Program Performance
6.6.1 The Memory Mountain
![Page 4: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/4.jpg)
4
The Memory Mountain P512
• Read throughput (read bandwidth)– The rate that a program reads data from the
memory system• Memory mountain
– A two-dimensional function of read bandwidth versus temporal and spatial locality
– Characterizes the capabilities of the memory system for each computer
![Page 5: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/5.jpg)
5
Memory mountain main routineFigure 6.41 P513
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16 /* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS]; /* The array we'll be traversing */
![Page 6: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/6.jpg)
6
Memory mountain main routine
int main()
{
int size; /* Working set size (in bytes) */
int stride; /* Stride (in array elements) */
double Mhz; /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */
Mhz = mhz(0); /* Estimate the clock frequency */
![Page 7: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/7.jpg)
7
Memory mountain main routine
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride, Mhz));
printf("\n");
}
exit(0);
}
![Page 8: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/8.jpg)
8
Memory mountain test functionFigure 6.40 P512
/* The test function */
void test (int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
![Page 9: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/9.jpg)
9
Memory mountain test function
/* Run test (elems, stride) and return read throughput (MB/s) */
double run (int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test (elems, stride); /* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
![Page 10: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/10.jpg)
10
The Memory Mountain
• Data– Size
• MAXBYTES(8M) bytes or MAXELEMS(2M) words
– Partially accessed• Working set: from 8MB to 1KB• Stride: from 1 to 16
![Page 11: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/11.jpg)
11
The Memory MountainFigure 6.42 P514
s1
s3
s5
s7
s9
s11
s13
s15
8m
2m 512k 12
8k 32k 8k
2k
0
200
400
600
800
1000
1200
read
thro
ughp
ut (M
B/s
)
stride (words) working set size (bytes)
Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache
Ridges ofTemporalLocality
L1
L2
mem
Slopes ofSpatialLocality
xe
![Page 12: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/12.jpg)
12
Ridges of temporal locality
• Slice through the memory mountain with stride=1– illuminates read throughputs of different
caches and memory
Ridges: 山脊
![Page 13: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/13.jpg)
13
Ridges of temporal localityFigure 6.43 P515
0
200
400
600
800
1000
12008m 4m 2m
1024
k
512k
256k
128k
64k
32k
16k
8k 4k 2k 1k
working set size (bytes)
read
thro
ugpu
t (M
B/s
)
L1 cacheregion
L2 cacheregion
main memoryregion
![Page 14: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/14.jpg)
14
A slope of spatial locality
• Slice through memory mountain with size=256KB– shows cache block size.
![Page 15: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/15.jpg)
15
A slope of spatial localityFigure 6.44 P516
0
100
200
300
400
500
600
700
800
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
read
thro
ughp
ut (M
B/s
)
one access per cache line
![Page 16: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/16.jpg)
16
6.6 Putting it Together: The Impact of Caches on Program Performance
6.6.2 Rearranging Loops to Increase Spatial Locality
![Page 17: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/17.jpg)
17
2221
1211
2221
1211
2221
1211
bbbb
aaaa
cccc
2222122122
2122112121
2212121112
2112111111
babacbabacbabacbabac
Matrix Multiplication P517
![Page 18: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/18.jpg)
18
Matrix Multiplication ImplementationFigure 6.45 (a) P518
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; }} O(n3)adds and multipliesEach n2 elements of A and B is read n times
![Page 19: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/19.jpg)
19
Matrix Multiplication P517
• Assumptions:– Each array is an nn array of double, with size 8– There is a single cache with a 32-byte block size
( B=32 )– The array size n is so large that a single matrix row
does not fit in the L1 cache– The compiler stores local variables in registers, and
thus references to local variables inside loops do not require any load and store instructions.
![Page 20: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/20.jpg)
20
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
Variable sumheld in register
Matrix MultiplicationFigure 6.45 (a) P518
![Page 21: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/21.jpg)
21
/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Column-wise
Row-wise Fixed
• Misses per Inner Loop Iteration:A B C
0.25 1.0 0.0
Matrix multiplication (ijk)
Figure 6.46 P519
1) (AB)
![Page 22: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/22.jpg)
22
/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}
A B C
(i,*)
(*,j)(i,j)
Inner loop:
Row-wise Column-wise
Fixed• Misses per Inner Loop Iteration:A B C0.25 1.0 0.0
Matrix multiplication (jik)Figure 6.45 (b) P518
Figure 6.46 P519
1) (AB)
![Page 23: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/23.jpg)
23
/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
• Misses per Inner Loop Iteration:A B C0.0 0.25 0.25
Matrix multiplication (kij)Figure 6.45 (e) P518
Figure 6.46 P519
3) (BC)
![Page 24: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/24.jpg)
24
/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wise Row-wiseFixed
• Misses per Inner Loop Iteration:A B C
0.0 0.25 0.25
Matrix multiplication (ikj)Figure 6.45 (f) P518
Figure 6.46 P519
3) (BC)
![Page 25: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/25.jpg)
25
/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
Column -wise
Column-wise
Fixed• Misses per Inner Loop Iteration:
A B C1.0 0.0 1.0
Matrix multiplication (jki)Figure 6.45 (c) P518
Figure 6.46 P519
2) (AC)
![Page 26: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/26.jpg)
26
/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}
A B C
(*,j)(k,j)
Inner loop:
(*,k)
FixedColumn-wise
Column-wise
• Misses per Inner Loop Iteration:A B C1.0 0.0 1.0
Matrix multiplication (kji)Figure 6.45 (d) P518
Figure 6.46 P519
2) (AC)
![Page 27: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/27.jpg)
27
Pentium matrix multiply performanceFigure 6.47 (d) P519
0
10
20
30
40
50
60
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Cyc
les/
itera
tion
kjijkikijikjjikijk
2) (AC)
3) (BC)
1) (AB)
![Page 28: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/28.jpg)
28
Pentium matrix multiply performance
• Notice that miss rates are helpful but not perfect predictors.– Code scheduling matters, too.
![Page 29: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/29.jpg)
29
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
Summary of matrix multiplication
1) (AB) 3) (BC) 2) (AC)
![Page 30: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/30.jpg)
30
6.6 Putting it Together: The Impact of Caches on Program Performance
6.6.3 Using Blocking to Increase Temporal Locality
![Page 31: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/31.jpg)
31
Improving temporal locality by blocking P520
• Example: Blocked matrix multiplication– “block” (in this context) does not mean “cache
block”.– Instead, it mean a sub-block within the matrix.– Example: N = 8; sub-block size = 4
![Page 32: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/32.jpg)
32
Improving temporal locality by blocking
C11 = A11B11 + A12B21 C12 = A11B12 + A12B22
C21 = A21B11 + A22B21 C22 = A21B12 + A22B22
A11 A12
A21 A22
B11 B12
B21 B22X =
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
![Page 33: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/33.jpg)
33
for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}
Blocked matrix multiply (bijk)Figure 6.48 P521
![Page 34: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/34.jpg)
34
Blocked matrix multiply analysis
• Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C– Loop over i steps through n row slivers of A & C,
using same B
Sliver: 长条
![Page 35: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/35.jpg)
35
A B C
block reused n times in succession
row sliver accessedbsize times
Update successiveelements of sliver
i ikk
kk jjjj
for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }
InnermostLoop Pair
Blocked matrix multiply analysis
Figure 6.49 P522
![Page 36: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/36.jpg)
36
Pentium blocked matrix multiply performanceFigure 6.50 P523
0
10
20
30
40
50
60
Array size (n)
Cyc
les/
itera
tion
kjijkikijikjjikijkbijk (bsize = 25)bikj (bsize = 25)
2)3)1)
![Page 37: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/37.jpg)
37
6.7 Putting it Together: Exploring Locality in Your Programs
![Page 38: 1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.](https://reader036.fdocuments.in/reader036/viewer/2022062503/5a4d1acd7f8b9ab0599701f6/html5/thumbnails/38.jpg)
38
Techniques P523
• Focus your attention on the inner loops• Try to maximize the spatial locality in your
programs by reading data objects sequentially, in the order they are stored in memory
• Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory
• Miss rates, the number of memory accesses