Class11 Cache

download

of 41

  • date post

    30-Aug-2014
  • Category

    Documents

  • view

    104
  • download

    0

Embed Size (px)

Transcript of Class11 Cache

CS 105 Tour of the Black Holes of Computing Cache MemoriesTopics

Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance

New Topic: Cache Buffer, between processor and memory

Often several levels of caches

Small but fast

Old values will be removed from cache to make space for new values

Capitalizes on spatial locality and temporal locality

Spatial locality: If a value is used, nearby values are likely to be used Temporal locality: If a value is used, it is likely to be used again soon.

Parameters vary by system; unknown to programmer Cache friendly code2 CS105

Cache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware.

Hold frequently accessed blocks of main memory

CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure:CPU chip register file L1 cache cache bus ALU system bus memory bus I/O bridge main memoryCS105

L2 cache3

bus interface

Inserting an L1 Cache Between the CPU and Main MemoryThe transfer unit between the CPU register file and the cache is a 4-byte blockline 0 line 1

The tiny, very fast CPU register file has room for four 4-byte words

The transfer unit between the cache and main memory is a 4-word block (16 bytes) block 10

The small fast L1 cache has room for two 4-word blocks. It is an associative memory

abcd

...block 21

pqrs

...block 304

The big slow main memory has room for many 4-word blocks

wxyz

...CS105

General Org of a Cache MemoryCache is an array of sets1 valid bit t tag bits per line per line B = 2b bytes per cache block

Each set contains one or more linesEach line holds a block of dataS = 2s sets

validset 0: valid

tagtag

0 0

11

B1E lines per set B1

valid

tag tag

0

1 1

B1 B1

set 1:valid

0 valid tag tag 0 valid 0 1 B1 1 B1

Set # hash code Tag hash keyset S-1:

5

Cache size: C = B x E x S data bytes

CS105

Addressing CachesAddress A: t bitsv set 0: v v set 1: v tag tag tag tag 0 0 0 0 v set S-1: tag 0 0 1 B1 1 1 1 1 B1 B1 B1 B1m-1

s bits

b bits0

The word at address A is in the cache if the tag bits in one of the lines in set match

v

tag

1

B1

The word contents begin at offset bytes from the beginning of the block

6

CS105

Direct-Mapped CacheSimplest kind of cache Characterized by exactly one line per set

set 0: set 1:

valid

tag

cache block cache block

E=1 lines per set

valid

tag

set S-1:

valid

tag

cache block

7

CS105

Accessing Direct-Mapped CachesSet selection

Use the set index bits to determine the set of interest

set 0: selected set set 1:

valid valid

tag tag

cache block

cache block

m-1

t bits tag

s bits b bits 0 00 001 set index block offset

set S-1: valid

tag

cache block

8

CS105

Accessing Direct-Mapped CachesLine matching and word selection

Line matching: Find a valid line in the selected set with a matching tag Word selection: Then extract the word=1? (1) The valid bit must be set0 1 2 3 4 5 6 7

selected set (i):

1

0110

w0

w1 w2

w3

(2) The tag bits in the cache =? line must match the tag bits in the addressm-1

t bits 0110 tag

s bits b bits 0 i 100 set index block offset

(3) If (1) and (2), then cache hit, and block offset selects starting byte

9

CS105

Direct-Mapped Cache Simulationt=1 s=2 x xx b=1 x M=16 addressable bytes, B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002] 0 [00002] (miss) tag data0 m[1] m[0] M[0-1]

v1 1

13 [11012] (miss) v tag data1 1 0 1 m[1] m[0] M[0-1] m[13] m[12] M[12-13]

(1)

(3)

1 1

v1 1

8 [10002] (miss) tag data1 m[9] m[8] M[8-9]

v1 1

0 [00002] (miss) tag data0 1 m[1] m[0] M[0-1] m[13] m[12] M[12-13] CS105

(4) 10

1

1

M[12-13]

(5)

1 1

Why Use Middle Bits as Index?4-line Cache 0000x 0001x 0010x 0011x 0100x 0101x High-Order Bit Indexing 0110x Adjacent memory lines would 0111x map to same cache entry 1000x Poor use of spatial locality 1001x Middle-Order Bit Indexing 1010x Consecutive memory lines map 1011x to different cache lines 1100x Can hold C-byte region of address space in cache at one 1101x time 1110x 1111x 11

00x 01x 10x 11x

High-Order Bit Indexing 0000x 0001x 0010x 0011x 0100x 0101x 0110x 0111x 1000x 1001x 1010x 1011x 1100x 1101x 1110x 1111x

Middle-Order Bit Indexing

CS105

Set-Associative CachesCharacterized by more than one line per set

set 0:

validvalid valid

tagtag tag

cache blockcache block cache block E=2 lines per set

set 1:

valid

tag

cache block

set S-1:

valid valid

tag tag

cache block cache block

12

CS105

Accessing Set-Associative CachesSet selection

Identical to direct-mapped cacheset 0: valid valid tag tag cache block cache block

Selected set

set 1:

validvalid

tagtag

cache blockcache block

validm-1

tag tag

cache block

t bits tag

s bits b bits 0 set S-1: 00 001 set index block offset

valid

cache block

13

CS105

Accessing Set Associative CachesLine matching and word selection

Must compare the tag in each valid line in the selected set=1? (1) The valid bit must be set0 1 2 3 4 5 6 7

1 selected set (i): 1

1001 0110 w0 w1 w2 w3

(2) The tag bits in one of the cache lines must match the tag bits in the addressm-1

=?

(3) If (1) and (2), then cache hit, and block offset selects starting byte s bits b bits 0 i 100 set index block offsetCS105

t bits 0110 tag

14

Write StrategiesOn a Hit Write Through: Write to cache and to memory Write Back: Write just to cache. Write to memory only when a block is replaced. Requires a dirty bit

On a miss: Write Allocate: Allocate a cache line for the value to be written

Write NoAllocate: Dont allocate a lineSome processors buffer writes: proceed to next instruction before write completes 15 CS105

Multi-Level CachesOptions: separate data and instruction caches, or a unified cacheProcessorRegs L1 d-cache L1 i-cache

Unified L2 Cache

Memory

disk

size: speed: $/Mbyte: line size:

200 B 3 ns

8-64 KB 3 ns

8B 32 B larger, slower, cheaper

1-4MB SRAM 6 ns $100/MB 32 B

128 MB DRAM 60 ns $1.50/MB 8 KB

30 GB 8 ms $0.05/MB

16

CS105

Intel Pentium Cache HierarchyL1 Data 1 cycle latency 16 KB 4-way assoc Write-through 32B lines

Regs.

L1 Instruction 16 KB, 4-way 32B lines Processor Chip

L2 Unified 128KB2 MB 4-way assoc Write-back Write allocate 32B lines

Main Memory Up to 4GB

17

CS105

Cache Performance MetricsMiss Rate

Fraction of memory references not found in cache (misses/references) Typical numbers: 3-10% for L1 Can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time

Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache)

Typical numbers: 1 clock cycle for L1 3-8 clock cycles for L2

Miss Penalty

Additional time required because of a miss Typically 25-100 cycles for main memory

Average Access Time = Hit Time + Miss Rate * Miss Penalty 18 CS105

Writing Cache-Friendly CodeRepeated references to variables are good (temporal locality)

Stride-1 reference patterns are good (spatial locality)Examples:

Cold cache, 4-byte words, 4-word cache blocksint sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; }

int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; }

Miss rate = 1/4 = 25% 19

Miss rate = 100%CS105

The Memory MountainRead throughput (read bandwidth)

Number of bytes read from memory per second (MB/s)

Memory mountain

Measured read throughput as a function of spatial and temporal locality Compact way to characterize memory system performance

20

CS105

Memory Mountain Test Function/* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink;for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ } 21 CS105

Memory Mountain Main Routine/* mountain.c - Generate the memory mountain. */ #define MINBYTES (1