Cache Memory

Big is Slow

• The more phone numbers stored, the slower the access

7.1

555-1212

Spatial Locality - You’re likely to call a lot of people you knowTemporal Locality - If you call somebody today, you’re more likely to call them tomorrow, too

Spatial Locality - You’re likely to call a lot of people you knowTemporal Locality - If you call somebody today, you’re more likely to call them tomorrow, too

• Consider looking up a telephone number

• In your memory

• In your personal organizer

• In the personal directory

• In the Phone book

And so it is with Computers

• Main memory• Big• Slow• “Far” from CPU

7.1

MainMemory

Registers

CPU

Load or I-FetchStore

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

Assembly language programmers andcompilers manage all transitions betweenregisters and main memory

• Our system has two kinds of memory

• Registers• Close to CPU• Small number of them• Fast

The problem...

7.1

IF RF MLW WBEX... ...

Instruction FetchInstruction Fetch Memory AccessMemory Access

• Since every instruction has to be fetched from memory, we lose big time

• We lose double big time when executing a load or store

• DRAM Memory access takes around 5ns

• At 1 GHz, that’s 5 cycles

• At 2 GHz, that’s 10 cycles

• At 3 GHz, that’s way too many cycles…

Note: Access time is much faster in some memory modes, but basic access is around 50ns

Note: Access time is much faster in some memory modes, but basic access is around 50ns

A hopeful thought

7.1

• Static RAMs are much faster than DRAMs

• <1 ns possible (instead of 5ns)

• So, build memory out of SRAMs

• SRAMs cost about 20 times as much as DRAM• Technology limitations cause the price difference

• Access time gets worse if larger SRAM systems are needed (small is fast...)

A more hopeful thought

7.1

• Remember the telephone directory?

• Do the same thing with computer memory

Registers

CPU

Load or I-FetchStore

MainMemory(DRAM)

SRAM CacheCache

The big question: What goes in the cache?

• Build a hierarchy of memories between the registers and main memory

• Closer to CPU: Small and fast (frequently used)

• Closer to Main Memory: Big and slow (more rarely used)

Locality

7.1

i = i+1;if (i<20) { z = i*i + 3*i -2;}q = A[i];

Temporal localityTemporal locality

p = A[i];q = A[i+1]r = A[i] * A[i+3] - A[i+2]

name = employee.name;rank = employee.rank;salary = employee.salary;

Spatial LocalitySpatial Locality

The program is very likelyto access the same dataagain and again over time

The program is very likelyto access data that is closetogether

The Cache

7.2

5600100001016

24471048431028

4 Most recently accessedMemory locations (exploitstemporal locality)

Issues: How do we know what’s in the cache? What if the cache is full?

Issues: How do we know what’s in the cache? What if the cache is full?

Cache

5600100032231004

23100811221012

01016323241020

8451024431028

9761032775541036

4331040778510442447104877510524331056

Main Memory Fragment

Goals for Cache Organization

• Complete

• Data may come from anywhere in main memory

• Fast lookup

• We have to look up data in the cache on every memory access

• Exploits temporal locality

• Stores only the most recently accessed data

• Exploits spatial locality

• Stores related data

Direct Mapping

7.2

IndexTag

Always zero (words)

Va

lidTag DataIndex

Cache

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

6-bit Address

560000Y0077511Y0184501Y10

3323400N11

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

In a direct-mapped cache:-Each memory address

corresponds to one location in the cache-There are many differentmemory locations for each cache entry (four in this case)

Hits and Misses

7.2

• The hit rate and miss rate are the fraction of memory accesses that are hits and misses

• Typically, hit rates are around 95%• Many times instructions and data are considered

separately when calculating hit/miss rates

• When the CPU reads from memory:

• Calculate the index and tag

• Is the data in the cache? Yes – a hit, you’re done!• Data not in cache? This is a miss.

• Read the word from memory, give it to the CPU.

• Update the cache so we won’t miss again. Write the data and tag for this memory location to the cache. (Exploits temporal locality)

A 1024-entry Direct-mapped Cache

7.2

Tag DataIndex V012

10231022

...

...

012111231

Hit! Data

Tag

Index 1020

3220

Memory AddressMemory Address

Byte offset

One BlockOne Block

Example - 1024-entry Direct Mapped Cache

11153432321214

1101

Tag DataIndex V012

2332232323998

34238829

1976894111023

3...

Assume the cache has been used for awhile, so it’s not empty...

01

Index- 10 bits211

Tag- 20 bits1231

LW $t3, 0x0000E00C($0)

address = 0000 0000 0000 0000 1110 0000 0000 1100tag = 14 index = 3 byte offset=0

Hit: Data is 34238829

LB $t3, 0x00003005($0) (let’s assume the word at mem[0x00003004] = 8764)

address = 0000 0000 0000 0000 0011 0000 0000 0101tag = 3 index = 1 byte offset=1

Miss: load word from mem[0x00003004] and write into cache at index 1

3 8764

7.2

byte address

So, how’d we do?

7.2

Miss rates for DEC 3100 (MIPS machine)

Note: This isn’tjust the average

Benchmark Instruction Data miss Combinedmiss rate rate miss rate

spice 1.2% 1.3% 1.2%

gcc 6.1% 2.1% 5.4%

Separate 64KB Instruction/Data Caches (16K 1-word blocks)

Direct Mapping Review

7.2

IndexTag

Always zero (words)

Each word has only one placeit can be in the cache: Index must match exactly

Each word has only one placeit can be in the cache: Index must match exactly

Va

lidTag DataIndex

Cache

56003223

231122

032324

84543

97677554

43377852447775433

3649

Main Memory00 00 0000 01 0000 10 0000 11 0001 00 0001 01 0001 10 0001 11 0010 00 0010 01 0010 10 0010 11 0011 00 0011 01 0011 10 0011 11 00

6-bit Address

560000Y0077511Y0184501Y10

3323400N11

01

Index2

Tag31

Memory Address:Split depends oncache size

Total Memory Requirements

7.2

Tag Data (1 word)V

1 bit1 bit 32 - n - 2 bits32 - n - 2 bits 32 bits32 bits

For a direct-mapped cache with 2n slots and 32-bit addressesFor a direct-mapped cache with 2n

slots and 32-bit addresses

Total size of a direct-mapped cache with 2n blocks = 2n x (32 + (32 - n - 2) + 1) = 2n x (63 - n) bits

Note: Small caches take more space per entry!

index byte offset

One Slot:

Warning: Normally “cache size” refers only to the data portion and ignores the tags and valid bits.

Missed me, Missed me...

7.2

• What to do on a hit:

• Carry on... (Hits should take one cycle or less)

• What to do on an instruction fetch miss:

• Undo PC increment (PC <-- PC-4)

• Do a memory read

• Stall until memory returns the data

• Update the cache (data, tag and valid) at index

• Un-stall• What to do on a load miss

• Same thing, except don’t mess with the PC

Missed me, Missed me...

7.2

• What to do on a store (hit or miss)

• Won’t do to just write it to the cache

• The cache would have a different (newer) value than main memory

• Simple Write-Through

• Write both the cache and memory• Works correctly, but slowly

• Buffered Write-Through

• Write the cache

• Buffer a write request to main memory• 1 to 10 buffer slots are typical

Types of misses

• Cold miss: During initialization, when the cache is empty

• Capacity miss: When the cache is full

• Conflict miss: The cache is not full, but the data asks for a location that’s taken

Replacement Policy

• Which data should be replaced on a capacity/conflict miss?

• Random: simple, but not very useful

• Least Recently Used (LRU): exploiting spatial locality

• Least Frequently Used (LFU): better, but harder to implement

Splitting up

• It is common to use two separate caches for Instructions and for Data

• All Instruction fetches use the I-cache

• All data accesses (loads and stores) use the D-cache

• This allows the CPU to access the I-cache at the same time it is accessing the D-cache

• Still have to share a single memory

IF RF M WBEX

7.2

Note: The hit rate will probably be lower than for a combined cache of the same total size.

What about Spatial Locality?

7.2

• Spatial locality says that physically close data is likely to be accessed close together

Word 2 Word 1 Word 0

012131431

Index

1018

AddressAddress

Tag

34

Blockoffset

2

Byteoffset

2

One 4-word BlockOne 4-word BlockAll words in the same block have the same index and tag

• On a cache miss, don’t just grab the word needed, but also the words nearby

• The easiest way to do this is to increase the block size

DataTagVWord

CacheEntry

Note: 22 = 4

3

TagData (4-word Blocks)

Index V

012

20472046

...

...

32KByte/4-Word Block D.M. Cache

7.2

014141531

Hit!

Tag Index

11

17

17

Byte offset23

32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks

Block offset

Data

32

Mux0 1 2 3

211=2K

How Much Change?

7.2


spice 1 1.2% 1.3% 1.2%

gcc 1 6.1% 2.1% 5.4%

spice 4 0.3% 0.6% 0.4%

gcc 4 2.0% 1.7% 1.9%

Benchmark Block Size Instruction Data miss Combined(words) miss rate miss rate

Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)

The issue of Writes

7.2

Perform a write to a location with index 1000, tag 2420, word 1 (value 4334)

On a read miss, we read the entire block from memory into the cache

On a write hit, we write one word into the block. The other words in theblock are unchanged.

On a write miss, we write one word into the block and update the tag.

2330001 322 355 2word 3 word 2 word 1 word 0V tagBlock with

index 1000: 2420 4334

The other words are still the old data (for tag 3000). Bad news!

Solution 1: Don’t update the cache on a write miss. Write only to memory.

Solution 2: On a write miss, first read the referenced block in (including the old value of the word being written), then write the new word into the cache and write-through to memory.

Choosing a block size

7.2

• Large block sizes help with spatial locality, but...

• It takes time to read the memory in• Larger block sizes increase the time for misses

• It reduces the number of blocks in the cache• Number of blocks = cache size/block size

• Need to find a middle ground

• 16-64 bytes works nicely

Other Cache organizations

7.3

Direct MappedDirect Mapped

0:1:23:4:5:6:7:89:

10:11:12:13:14:15:

V Tag DataIndexIndex

Address = Tag | Index | Block offset

Fully AssociativeFully Associative

No IndexNo Index

Address = Tag | Block offset

Each address has only one possible location

Each address has only one possible location

Tag DataV

Fully Associative vs. Direct Mapped

7.3

• Fully associative caches provide much greater flexibility

• Nothing gets “thrown out” of the cache until it is completely full

• Direct-mapped caches are more rigid

• Any cached data goes directly where the index says to, even if the rest of the cache is empty

• A problem, though...

• Fully associative caches require a complete search through all the tags to see if there’s a hit

• Direct-mapped caches only need to look one place

A Compromise

7.3

2-Way set associative2-Way set associative


4-Way set associative4-Way set associative


0:

1:

2:

3:

4:

5:

6:

7:

V Tag Data

Each address has two possiblelocations with the same index

Each address has two possiblelocations with the same index

One fewer index bit: 1/2 the indexes

One fewer index bit: 1/2 the indexes

0:

1:

2:

3:

V Tag Data

Each address has four possiblelocations with the same index

Each address has four possiblelocations with the same index

Two fewer index bits: 1/4 the indexes

Two fewer index bits: 1/4 the indexes

Example: ARM processor cache• 4 Kbyte Direct mapped cache

• Source: ARM system’s developer’s guide

Example: ARM processor cache

• 4 Kbyte 4-way set associative cache

• Source: ARM system developer’s guide

Set Associative Example

V Tag DataIndex00000000

000:

001:

010:

011:

100:

101:

110:

111:

01001110001100110100010011110001101100001100111000

MissMissMissMissMiss

7.3

Index V Tag Data

0

0000000

00:

01:

10:

11:

V Tag DataIndex0

0000000

0:

1:

Direct-Mapped 2-Way Set Assoc. 4-Way Set Assoc.

01001110001100110100010011110001101100001100111000

MissMissHitMissMiss

01001110001100110100010011110001101100001100111000

MissMissHitMissHit

Byte offset (2 bits)Block offset (2 bits)Index (1-3 bits)Tag (3-5 bits)

010 -1 110010

0100 -

1 1100 -1

011110

01101100

1 01001

1 11001

1 01101

-

--

New Performance Numbers

7.3


spice Direct 0.3% 0.6% 0.4%

gcc Direct 2.0% 1.7% 1.9%

spice 2-way 0.3% 0.6% 0.4%

gcc 4-way 1.6% 1.4% 1.5%

Benchmark Associativity Instruction Data miss Combinedrate miss rate

Separate 64KB Instruction/Data Caches (4K 4-word blocks)

gcc 2-way 1.6% 1.4% 1.5%

spice 4-way 0.3% 0.6% 0.4%

Example

• Assuming a memory access time of 40 ns and a cache access time of 4 ns, estimate the average access time with a

• Hit rate of 90%

• Hit rate of 95%

• Hit rate of 99%

Example 2

• Assuming a memory access time of 40 ns and a cache access time of 4 ns calculate the hit rate required for an average access time of 5 ns

Example 3(A) Calculate the cache size, block size and total RAM size for a cache in a 32-bit

address bus system assuming:• A 2-bit block offset and a 6-bit index and direct-mapped cache• A 3-bit block offset and a 20-bit tag and a 4-way set associative cache

(B) Assuming a 32-bit memory address bus with a main memory access time of 40 ns and a assuming a direct-mapped cache access time of 4 ns, a cache size of 1 KB and a block size of 4 wordsi) Indicate whether each access in the following address access sequence is a hit or a miss and what kindii) calculate the access times. The replacement policy is Least Recently Used (LRU).

• Lw $4, 0x732A2120($0)• Lb $5, 0x732A2130($0)• Lb $6, 0xA32A232E($0)• Lb $6, 0x732A2320($0)• Lb $6, 0x923B2120($0)• Lb $6, 0x923B2121($0)• Lb $6, 0x532C2122($0)• Lb $6, 0xA32A2123($0)• Lw $4, 0x732A2120($0)• Lb $5, 0x532C2130($0)

iii) Calculate the cache hit rate after the above accesses(C) Repeat for a 2-way associative cache(D) Again for a 4-way associative cache

Example 4

Calculate the tag and index fields for a cache in a 32-bit address bus system assuming:

• 1 MByte cache size, block size 16, direct mapped cache

• 8 KByte cache size, block size 4, 4-way set associative cache

Calculate the total RAM required for the implementation of the two caches

Example 5

• Assuming a main memory access time of 40 ns a L1 cache access time of 4 ns and a L2 cache access time of 10 ns:

• Calculate the average access time assuming L1 hit rate 95% and combined L1-L2 hit rate 98%

• Calculate the required L1-L2 combined hit rate required for an average access time of 4.5 ns assuming L1 hit rate 95%

Example• For the 4KByte, direct-mapped cache with block

size 4 shown, show the cache contents after the following accesses

• Lw $4, 0x52134120($0)

• Lb $5, 0x732A2130($0)

• Lb $6, 0xA32A232E($0)

• Lb $6, 0x245A2310($0)

• Lb $6, 0x923B2120($0)

Index V Tag Word 3 Word 2 Word 1 Word 0

0

1

2

3

Address Data

0x245A2310 0x00001234

0x245A2314 0x00012345

0x245A2318 0x00023456

0x245A231C 0x00034567

…

0x52134120 0xFFFFABCD

0x52134124 0xFFFFBCDE

0x52134128 0xFFFFCDEF

0x5213412C 0xFFFFDEF0

…

0x732A2130 0x45678901

0x732A2134 0x56789012

0x732A2138 0x67890123

0x732A213C 0x78901234

…

0x923B2120 0x0000ABCD

0x923B2124 0x0000BCDE

0x923B2128 0x0000CDEF

0x923B212C 0x0000DEF0

…

0xA32A2320 0x2222ABCD

0xA32A2324 0x2222CDE

0xA32A2328 0x2222CDEF

0xA32A232C 0x2222DEF0

…

Cache Memory

Documents

Transcript of Cache Memory