Capp 08

7/27/2019 Capp 08

1/60

Memory Hierarchy Design

Topic 8

7/27/2019 Capp 08

2/60

2

Introduction

The five classic components of a computer:

Control

Datapath

Memory

ProcessorInput

Output

Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internalmemory) and hard disk (external memory)

Instructions are first fetched from external storage such as hard disk and arekept in the main memory. Before they go to the CPU, they are probablyextracted to stay in the caches

Programmers would desire an Indefinitely large memory such that any

particular word would be available in FAST memory

This forces the possibility of constructing a hierarchy of memories as a solution

7/27/2019 Capp 08

3/60

3

Technology Trends

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

2000 256 Mb 100 ns

Capacity Speed (latency)

CPU: 2x in 1.5 years 2x in 1.5 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

4000:1! 2.5:1!

Memory Performance Index

7/27/2019 Capp 08

4/60

4

The gap (latency) grows about 50% per year!From 2005 onwards, no change in processor perf (per core)

CPU1.35X/yr1.55X/yr

Memory7%/yr

Performance Gap between CPUs and Memory

(improvementratio)

7/27/2019 Capp 08

5/60

5

Levels of the Memory Hierarchy (Typical Server)

CPU Registers

1000 bytes(300ps) 0.30 ns

Cache

64 KB/256KB/2-4MB

1ns/3-10ns/10-20ns

Main Memor y

4-16GB50-100ns

Disk4-16TB5-10ms

Capacity

Access Time

Upper Level

Lower Level

Faster

Larger

Memory Hierarchy

Speed

Capacity

Registers

Cache

Memory

Disk Storage

Blocks (inclusion prop.)

Pages (inclusion prop.)

Files

???

7/27/2019 Capp 08

6/60

6

Levels of the Memory Hierarchy (Personal Mobile Device)

CPU Registers

500 bytes(500ps) 0.50 ns

Cache

64 KB/256KB

2ns/10-20ns

Main Memor y

256-512MB50-100ns

Flash (EEPROM)4-8GB25-50us

Capacity

Access Time

Upper Level

Lower Level

Faster

Larger

Memory Hierarchy

Speed

Capacity

Registers

Cache

Memory

Storage

Blocks (inclusion prop.)

Pages (inclusion prop.)

7/27/2019 Capp 08

7/60

7

Cache:

it mainly means the first level of the memory hierarchy encountered

once the address leaves the CPU

applied whenever buffering is employed to reuse commonlyoccurring items, i.e. file caches, name caches, and so on

Principle of Locality:

Program access a relatively small portion of the address space atany instant of time.

Two Different Types of Locality:

Temporal Locality(Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)

Spatial Locality(Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

Guideline: Fora given implementation technology and apower budget, Smaller hardware can be made Faster

ABCs of Caches

7/27/2019 Capp 08

8/60

8

Memory Hierarchy: Terminology

Traditionally designers of MH focused on Optimizing Avg Mem. AccessTime , which is determined by Cache Access Time, Miss Rate and MP

Hit: data appears in some block in the cache (example: Block X)

Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory

(Block Y) (3 Cs model of miss Compulsory, Capacity, Conflict )

Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in cache+ Time to deliver the blockto the processor

Hit Time

7/27/2019 Capp 08

9/60

9

Cache Measures

CPU execution time incorporated with cache performance:

CPU execution time = (CPU clock cycles + Memory stall cycles)* Clock cycle time

Memory s tal l cy cles: number of cycles during which the CPU isstalled waiting for a memory access

Memory stall clock cycles = Number of misses * miss penalty

= IC*(Misses/Instruction)*Miss penalty= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty

= IC * Reads per instruction * Read miss rate * Read miss penalty

+IC * Writes per instruction * Write miss rate * Write miss penalty

Memory access consists offetching instructions and reading/writingdata

7/27/2019 Capp 08

10/60

10

ExampleAssume we have a computer where the CPI is 1.0 when all memory accesses

hit the cache. The only data access are loads and stores, and these total 50% ofthe instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,

how much faster would the computer be if all instructions are in the cache?

Answer:

(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then

CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.

memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty

= IC*(1+50%)*2%*25 = IC*0.75

then CPU(B) = (IC + IC*0.75)* Clock cycle time

= 1.75*IC*clock cycle timeThe performance ration is easy to get to be the inverse of the CPU execution

time :

CPU(B)/CPU(A) = 1.75

The computer with no cache miss is 1.75 times faster.

Example

7/27/2019 Capp 08

11/60

11

Four Memory Hierarchy Questions

Q1 (block placement):

Where can a block/line be placed in the upper level(cache)?

(most popular set-associative, direct-mapped, fully-associative)

Q2 (block identification):How is a block found if it is in the upper level (cache)?

Q3 (block replacement):

Which bock should be replaced on a miss?

Q4 (write strategy):

What happens on a write? (caching read only data is verysimple)

(Write-through or write-back? Write buffer?)

7/27/2019 Capp 08

12/60

12

Q1(block placement): Where can a block be placed?

Direct mapped: (Block number) mod (Number of blocks in cache)Set associative: (Block number) mod (Number of sets in cache)

# of set # of blocks n-way: n blocks in a set 1-way = direct mapped

Fully associative: # of set = 1

Example: block 12 placedin a 8-block cache

7/27/2019 Capp 08

13/60

13

Simplest Cache: Direct Mapped (1-way)

Memory

4 Block Direct Mapped Cache

Block number

0

1

2

3

4

5

6

7

8

9

A

BC

D

E

F

Block Index in Cache

0

1

2

3

The block have only one place it can appear in thecache. The mapping is usually

(Block address) MOD ( Number of blocks in cache)

7/27/2019 Capp 08

14/60

14

Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2N byte cache:

The uppermost (32 - N) bits are always the Cache Tag

The lowest M bits are the Byte Select (Block Size = 2M

)

0

1

23

:

Cache Data

Byte 0

:

0x50

Stored as partof the cache state

Valid Bit

:

31

Byte 1Byte 31 :Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

7/27/2019 Capp 08

15/60

15

Block Offset selects the desired data from the block, the index filed selectsthe set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set

No need to check index or block offsetA val id bitis added to the Tag to indicate whether or not this entry

contains a valid address Select the desired bytes using Block Offset

Increasing associativity => shrinks index expands tag

Block Address Block Offset

(Block Size)Tag Cache/Set Index

Three portions of an address in a set-associative or direct-mapped cache

Q2 (block identification): How is a block found?

7/27/2019 Capp 08

16/60

16

Example: Two-way set associative cache

Cache Index selects a set from the cache

The two tags in the set are compared in parallel Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Cache Index

0431

Cache Tag Example: 0x50

Ex: 0x01

Byte Select

Ex: 0x00

9

0x50

7/27/2019 Capp 08

17/60

17

Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache:

N comparators vs. 1

Extra MUX delay for the data Data comes AFTERHit/Miss

In a direct mapped cache, Cache Block is available BEFOREHit/Miss:

Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

7/27/2019 Capp 08

18/60

18

Easy for Direct Mapped hardware decisions are simplified

Only one block frame is checked and only that block can be replacedSet Associative or Fully Associative

There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced Random: randomly selected (to spread allocation uniformly)

LRU: Least Recently Used block is removed (based on LOR) FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO

16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.464 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

There are little difference between LRU and random for the largest size cache, withLRU outperforming the others for smaller caches. FIFO generally outperforms

random in the smaller cache sizes

Q3 (block replacement): Which block should bereplaced on a cache miss?

7/27/2019 Capp 08

19/60

19

Reads dominate processor cache accesses.

E.g. 7% ofoverallmemory traffic are writes while 21% ofdata cacheaccess are writes . Making the common case fast (Amdahls law)Two option we can adopt when writing to the cache:

Write throughThe information is written to both the block in thecache and to the block in the lower-level memory.Write backThe information is written only to the block in the cache.

The modified cache block is written to main memory only when it isreplaced.

To reduce the frequency of writing back blocks on replacement, a dirtybit is used to indicate whether the block was modified in the cache(dirty) or not (clean). If clean, no write back since identical information

to the cache is foundPros and Cons

WT: simple to be implemented. The cache is always clean, so readmisses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within

a block require only one write to the lower-level memory

Q4(write strategy): What happens on a write?

7/27/2019 Capp 08

20/60

20

A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write buffer

Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:

Typical number of entries: 4

Processor Cache

Write Buffer

DRAM

Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU issaid to write stall

A common optimization to reduce write stall is a write buffer, whichallows the processor to continue as soon as the data are written to thebuffer, thereby overlapping processor execution with memory updating

7/27/2019 Capp 08

21/60

21

Two options on a write miss

Write al locate the block is allocated on a write miss,followed by the write hit actions

Write misses act like read misses

No-writ e allocate write misses do not affect the cache. Theblock is modified only in the lower-level memory

Block stay out of the cache in no-write allocate until theprogram tries to read the blocks, but with write allocate

even blocks that are only written will still be in the cache

Write-Miss Policy: Write Allocate vs. Not Allocate

7/27/2019 Capp 08

22/60

22

Write Through with Wri te Al locate:

on hits it writes to cache and main memory on misses it updates the block in main memory and brings the block

to the cache

Bringing the block to cache on a miss does not make a lot of sense inthis combination because the next hit to this block will generate awrite to main memory anyway (according to Write Through policy)

Write Through w ith No Write Al loc ate:

on hits it writes to cache and main memory;

on misses it updates the block in main memory not bringing thatblock to the cache;

Subsequent writes to the block will update main memory becauseWrite Through policy is employed. So, some time is saved notbringing the block in the cache on a miss because it appears useless

anyway.


7/27/2019 Capp 08

23/60

23

Write Back with Write A l locate: on hits it writes to cache settingdirty bit for the block, main

memory is not updated;

on misses it updates the block in main memory and brings the block tothe cache;

Subsequent writes to the same block, if the block originally caused amiss, will hit in the cache next time, setting dirty bit for the block. Thatwill eliminate extra memory accesses and result in very efficientexecution compared with Write Through with Write Allocate

combination.


7/27/2019 Capp 08

24/60

24

Write Back w ith No Write Al locate:

on hits it writes to cache settingdirty bit for the block, mainmemory is not updated;

on misses it updates the block in main memory not bringing that blockto the cache;

Subsequent writes to the same block, if the block originally caused amiss, will generate misses all the way and result in very inefficientexecution.


7/27/2019 Capp 08

25/60

25

Example:Assume a fully associative write-back cache with many

cache entries that starts empty. Below is sequence of five memoryoperations.

Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].

What are the number of hits and misses (inclusive reads and writes) whenusing no-write allocate versus write allocate?

Answer :

No-write Allocate: Write allocate:

Write Mem[100]; 1 write miss Write Mem[100]; 1 write missWrite Mem[100]; 1 write miss Write Mem[100]; 1 write hitRead Mem[200]; 1 read miss Read Mem[200]; 1 read missWrite Mem[200]; 1 write hit Write Mem[200]; 1 write hitWrite Mem[100]. 1 write miss Write Mem[100]; 1 write hit

4 misses; 1 hit 2 misses; 3 hits

Write-Miss Policy Example

7/27/2019 Capp 08

26/60

26

Example: Consider a computer with the following features:

90% of all memory accesses are found in the cache (hit ratio = 0.9);

The block size is 2 words and the whole block is read on any miss;

The CPU sends references to the cache at the rate of107 words per

second;

25% of the above references are writes (writes = 25%, reads = 75%);

The bus can support 107 words per second, read or writes (total bus

bandwidth = 107);

The bus reads or writes a single word at a time;

Assume at any one time, 30% of the block frames in the cache have

been modified;

The cache uses write allocate on a write miss.Calculate the percentage of the bus bandwidth used on

the average in the two cases - Write Back and Write

Through.


7/27/2019 Capp 08

27/60

27

Write-Miss Policy: Write Allocate

Write Through(Total Bus B/W used = B/W for

Read Hit + B/W for Read Miss +B/W for Write Hit + B/W for Write

Miss)

Write Back

Read Hit : 0 Read Hit = 0

Read Miss : 107 0.1 0.75 2= 0.15107

Read Miss : 1070.10.75(2+0.32)= 0.195107

Write Hit : 107 0.9 0.25 1= 0.225107

Write Hit = 0

Write Miss :107 0.1 0.25 (2 +1)= 0.075107

Write Miss :1070.10.25(2+0.32 )= 0.075107

Total Bus B/W used =(0.15+0.225+0.075)107 = 0.45107

Total Bus B/W used =(0.195+0.075)107 = 0.27107

7/27/2019 Capp 08

28/60

28

Example: Split Cache vs. Unified Cache

Which has the better avg. memory access time?A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?

Miss rates Size Instruction Cache Data Cache Unified Cache16KB 0.4% 11.4%

32 KB 3.18%

Cache Performance

AssumeA hit takes 1 clock cycle and the miss penalty is 100 cyclesA load or store takes 1 extra clock cycle on a unified cache since

there is only one cache port 36% of the instructions are data transfer instructions.

About 74% of the memory accesses are instruction references

Answer :Average memory access time (split)= % instructions x (Hit time + Instruction miss rate x Miss penalty)+ % data x (Hit time + Instruction miss rate x Miss penalty)= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24

Average memory access time(unified)= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

7/27/2019 Capp 08

29/60

29

Example:Suppose a processor:

Ideal CPI = 1.0 (ignoring memory stalls)

Avg. miss rate is 2%Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles

What are the impact on performance when behavior of the cache is included?

Answer :CPI = CPU execution cycles per instr. + Memory stall cycles per instr.= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty

CPI with cache = 1.0 + 2% x 1.5 x 100 = 4CPI without cache = 1.0 + 1.5 x 100 = 151

CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache = IC x 151 x Clock cycle time

Without cache, the CPI of the processor increases from 1 to 151!

75 % of the time the processor is stalled waiting for memory! (CPI: 14)

Impact of Memory Access on CPU Performance

7/27/2019 Capp 08

30/60

30

Example:What is the impact of two different cache organizations (direct

mapped vs. 2-way set associative) on the performance of a CPU?

Ideal CPI = 2.0 (ignoring memory stalls)

Clock cycle time is 1.0 nsAvg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to

accommodate the selection multiplexer Cache miss penalty is 75 ns

Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.

Answer :

Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 nsAvg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instructionx Miss penalty) x Clock cycle time

= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 ICCPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC

Impact of Cache Organizations on CPU Performance

7/27/2019 Capp 08

31/60

31

Summary of Performance Equations

7/27/2019 Capp 08

32/60

32

Next we look at ways to improve cache and memory access times.

TimeCycleClockPenaltyMissRateMissnInstructio

AccessesMemoryCPIICTimeCPU Execution )(*

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Improving Cache Performance

7/27/2019 Capp 08

33/60

33

Reducing Cache Miss Penalty

Average Memory Access Time

= Hit Time + Miss Rate * Miss Penalty

Time to handle a miss is becoming more and more the

controlling factor. This is because of the great improvement inspeed of processors as compared to the speed of memory.

Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes

4. Merging write buffer5. Victim caches

O1 M ltil l C h

7/27/2019 Capp 08

34/60

34

Approaches

Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap

L1: fast hits, L2: fewer misses L2 Equations

Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

Average Memory Access Time = Hit TimeL1

+Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2)Hit TimeL1

7/27/2019 Capp 08

35/60

35

Design of L2 Cache

Size

Since everything in L1 cache is likely to be in L2 cache, L2 cacheshould be much bigger than L1

Whether data in L1 is in L2 novice approach: design L1 and L2 independently

multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only)

Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-levelblock to be replaced => slightly higher 1st-level miss rate

i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2

Advantage: prevent wasting space in L2

i.e. AMD Athlon: 64 KB L1 and 256 KB L2

7/27/2019 Capp 08

36/60

36

Dont wait for full block to be loaded before restarting CPU

Critical Word FirstRequest missed word first from memoryand send it to CPU as soon as it arrives; let CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetch and requested word first

Early restartAs soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so its

not clear if benefit by early restart

Generally useful only in large blocks,

block

O2: Critical Word First and Early Restart

7/27/2019 Capp 08

37/60

37

Serve reads before writes have been completed Write through with write buffers

SW R3, 512(R0) ; M[512]

7/27/2019 Capp 08

38/60

38

O4: Merging Write Buffer If a write buffer is empty, the data and the full address are

written in the buffer, and the write is finished from the CPUs

perspective Usually a write buffer supports multi-words

Write merging: addresses of write buffers are checked to see ifthe address of the new data matches the address of a validwrite buffer entry. If so, the new data are combined

Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry

writing multiple words at the same time is faster than writing multiple times

O5: Victim Caches

7/27/2019 Capp 08

39/60

39

O5: Victim Caches

Idea of recycling: remember what was discarded latest due tocache miss in case it is needed again rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cacheand its refill pathcontain only blocks that are discarded from a cache because of a miss,

victimschecked on a miss before going to the next lower-level memoryVictim caches of 1 to 5 entries are effective at reducing misses,

especially for small, direct-mapped data caches

AMD Athlon: 8 entries

7/27/2019 Capp 08

40/60

40

Reducing Miss Rate

3 Cs of Cache Miss

CompulsoryThe first access to a block is not in the cache, so the blockmust be brought into the cache. Also called cold start misses orfirst

reference misses.

(Misses in even an Infinite Cache)

CapacityIf the cache cannot contain all the blocks needed during

execution of a program, capacity misseswill occur due to blocks being

discarded and later retrieved.

(Misses in Fully Associative Size X Cache)

ConflictIf block-placement strategy is set associative or direct mapped,

conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocks

map to its set. Also called collision misses orinterference misses.

(Misses in N-way Associative but hits in Fully Associative Size X Cache)

7/27/2019 Capp 08

41/60

41

3 Cs of Cache Miss

miss rate 1-way associative cache size X= miss rate 2-way associative cache size X/2

Compulsory vanishingly

small

3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule

Conflict

Cache Size (KB)

MissRateperType

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 816

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

7/27/2019 Capp 08

42/60

42

3Cs Relative Miss Rate

Conflict

Flaws: for fixed block size

Good: insight => invention

Cache Size (KB)

MissRateperType

0%

20%

40%

60%

80%

100%

1 2 4 816

32

64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

7/27/2019 Capp 08

43/60

43

Five Techniques to Reduce Miss Rate

1. Larger block size

2. Larger caches

3. Higher associativity4. Way prediction and pseudoassociative caches

5. Compiler optimizations

O1 L Bl k Si

7/27/2019 Capp 08

44/60

44

Block Size (bytes)

Miss

Rate

0%

5%

10%

15%

20%

25%

16

32

64

128

256

1K

4K

16K

64K

256K

Size of Cache

Using the principle of

locality: The larger theblock, the greater thechance parts of it will beused again.

O1: Larger Block Size

Take advantage of spatial locality

-The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is

small Usually high latency and high bandwidth encourage large block size

7/27/2019 Capp 08

45/60

45

O2: Larger Caches

Increasing capacity of cache reduces capacity misses(Figure 5.14 and 5.15)

May be longer hit time and higher cost

Trends: Larger L2 or L3 off-chip caches

Cache Size (KB)

MissRateperType

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 816

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

7/27/2019 Capp 08

46/60

46

Figure 5-14 and 5-15 show how improve miss rates improve

with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule:

Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

Tradeoff: higher associative cache complicates the circuit May have longer clock cycle

Beware: Execution time is the only final measure!Will Clock Cycle time increase as a result of having a more

complicated cache?

Hill [1988] suggested hit time for 2-way vs. 1-way is:external cache +10%,internal + 2%

O3: Higher Associativity

O4 W P di ti & P d i ti C h

7/27/2019 Capp 08

47/60

47

O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, orblock within the set of the next cache access

Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a

latency of 3 clock cycles excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower

level one fast hit and one slow hit

Invert the most significant bit to the find other block in the pseudoset

Miss penalty may become slightly longer

O5: Compiler Optimizations

7/27/2019 Capp 08

48/60

48

O5: Compiler Optimizations

Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])

Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%in an 8KB cache

Get best performance when it was possible to prevent some instruction fromentering the cache

Aligning basic block: the entry point is at the beginning of a

cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops

Improve spatial locality => reduce misses Make data be accessed in order

=> maximize use of data in a cache block before discarded

/* Before: row first */

for(j=0;j

7/27/2019 Capp 08

49/60

49

Blocking: operating on submatrices or blocksMaximize accesses to the data loaded into the cache before replacedImprove temporal localityX=Y*Z

/* Before */for(i=0;i

7/27/2019 Capp 08

50/60

50

5.6 Reducing Cache Penalty or Miss Rate

via Parallelism

Three techniques that overlap the execution of instructions

1.Nonblocking caches to reduce stalls on cache misses

to match the out-of-order processors

2.Hardware prefetching of insructions and data

3.Compiler-controlled prefetching

O1: Nonblocking cache to reduce stalls on cache miss

7/27/2019 Capp 08

51/60

51

O1: Nonblocking cache to reduce stalls on cache missFor pipelined computers that allow out-of-order completion, theCPU need not stall on a cache missseparate I-cache and D-cache Continue fetching instructions from I-cache while waiting for D-cache to

return missing data

Nonblocking cache (lookup-free cache) hit under miss: D-cache continues to supply cache hits during a miss hit under multiple miss or miss under miss: overlap multiple misses

Ratio of average memory stalltime for a blocking cache tohit-under-miss schemes

first 14 are FP programsaverage: 76% for 1-miss,51% for 2-miss, 39% for 64-miss

final 4 are INT programsaverage: 81%, 78% and 78%

O2: Hardware Prefetching of Instructions and Data

7/27/2019 Capp 08

52/60

52

O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than

accessing main memory)Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss

the requested block is placed in I-cache when it returns the prefetched block is placed in ins truct io n stream bu ffer (ISB) 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block

direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

UltraSPARC III: data prefetch If a load hits in the prefetch cache

the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next

prefetched block using the difference between the current address and theprevious address

Up to 8 simultaneous prefetches

It may interfere with demand misses resulting in lowering performance

O3: Compiler Controlled Prefetching

7/27/2019 Capp 08

53/60

53

O3: Compiler-Controlled Prefetching

Compiler-controlled prefetching Register prefetch: load the value into a register

Cache prefetch: load data only into the cache (not register)Faulting vs. nonfaulting: the address does or does not cause anexception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction

Most effective prefetch: semantically invisible to a program doesnt change the contents of registers and memory, and cannot cause virtual memory faults

nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being

fetched

Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead

5 7 R d i Hit Ti

7/27/2019 Capp 08

54/60

54

5.7 Reducing Hit Time

Importance of cache hit time

Average Memory Access Time= Hit Time + Miss Rate * Miss Penalty

More importantly, cache access time limits the clock cycle rate in

many processors today!

Fast hit time:Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache

Four techniques:

1.Small and simple caches

2.Avoiding address translation during indexing of the cache

3.Pipelined cache access

4.Trace caches

O1: Small and Simple Caches

7/27/2019 Capp 08

55/60

55

O1: Small and Simple Caches

A time-consuming portion of a cache hit is using the index portion of theaddress to read the tag memory and then compare it to the address

Guideline: smaller hardware is fasterWhy Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB

second level cache?

Small data cache and thus fast clock rate

Guideline: simpler hardware is fasterDirect Mapped, on chip

General design:small and simple cache for 1st-level cache

Keeping the tags on chip and the data off chip for 2nd-level cachesThe emphasis recently is on fast clock time while hiding L1 misses with

dynamic execution and using L2 caches to avoid going to memory

O2: Avoiding address translation during cache indexing

7/27/2019 Capp 08

56/60

56

O2: Avoiding address translation during cache indexing

Two tasks: indexing the cache and comparing addressesvirtually vs. physically addressed cache

virtual cache: use virtual address (VA) for the cachephysical cache: use physical address (PA) after translating virtual address

Challenges to virtual cache1.Protection: page-level protection (RW/RO/Invalid) must be checked

Its checked as part of the virtual to physical address translationsolution: an addition field to copy the protection information from TLB and check

it on every access to the cache

2.context switching: same VA of different processes refer to different PA,requiring the cache to be flushedsolution: increase width of cache address tag with process-identifier tag (PID)

3.Synonyms or aliases: two different VA for the same PA

inconsistency problem: two copies of the same data in a virtual cachehardware ant ial iasingsolution: guarantee every cache block a unique PA

Alpha 21264: check all possible locations. If one is found, it is invalidatedsoftware page-color ingsolution: forcing aliases to share some address bits

Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA

4.I/O: typically use PA, so need to interact with cache (see Section 5.12)

Vi t ll i d d h i ll t d h

7/27/2019 Capp 08

57/60

57

Virtually indexed, physically tagged cache

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym Problem

CPU

$ TB

MEM

VA

PA

TagsPA

Overlap cache access

with VA translation:requires $ index toremain invariant

across translation

VATags

L2 $

O3: Pipelined Cache Access

7/27/2019 Capp 08

58/60

58

O3: Pipelined Cache Access

Simply to pipeline cache access

Multiple clock cycle for 1st-level cache hit

Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache

Pentium: 1 clock cycle

Pentium Pro ~ Pentium III: 2 clocks

Pentium 4: 4 clocks

Drawback: Increasing the number of pipeline stages leads to greater penalty on mispredicted branches and

more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather thandecreasing the actual latency of a cache hit

O4: Trace Caches

7/27/2019 Capp 08

59/60

59

O4: Trace Caches

Trace cache forinstructions: find a dynamic sequence ofinstructions including taken branches to load into a cache block The cache blocks contain

dynamic traces of executed instructions determined by CPU

rather than static sequences of instructions determined by memory

branch prediction is folded into the cache: validated along with the

addresses to have a valid fetch i.e. Intel NetBurst microarchitecture

advantage: better utilization Trace caches store instructions only from the branch entry point to the exit

of the trace

Unused part of a long block entered or exited from a taken branch inconventional I-cache may not be fetched

Downside: store the same instructions multiple times

CacheOptimization

7/27/2019 Capp 08

60/60

OptimizationSummary

5.4 miss penalty

5.5 miss rate

5.6 parallelism

5.7 hit time

Capp 08

Documents

Transcript of Capp 08