Career And Professional Planning (CAPP) Cynthia Mika and Vicci White [email protected] 208.885.6121.
Capp 08
-
Upload
adityabaid4 -
Category
Documents
-
view
220 -
download
0
Transcript of Capp 08
-
7/27/2019 Capp 08
1/60
Memory Hierarchy Design
Topic 8
-
7/27/2019 Capp 08
2/60
2
Introduction
The five classic components of a computer:
Control
Datapath
Memory
ProcessorInput
Output
Where do we fetch instructions to execute?
Build a memory hierarchy which includes main memory & caches (internalmemory) and hard disk (external memory)
Instructions are first fetched from external storage such as hard disk and arekept in the main memory. Before they go to the CPU, they are probablyextracted to stay in the caches
Programmers would desire an Indefinitely large memory such that any
particular word would be available in FAST memory
This forces the possibility of constructing a hierarchy of memories as a solution
-
7/27/2019 Capp 08
3/60
3
Technology Trends
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
2000 256 Mb 100 ns
Capacity Speed (latency)
CPU: 2x in 1.5 years 2x in 1.5 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
4000:1! 2.5:1!
Memory Performance Index
-
7/27/2019 Capp 08
4/60
4
The gap (latency) grows about 50% per year!From 2005 onwards, no change in processor perf (per core)
CPU1.35X/yr1.55X/yr
Memory7%/yr
Performance Gap between CPUs and Memory
(improvementratio)
-
7/27/2019 Capp 08
5/60
5
Levels of the Memory Hierarchy (Typical Server)
CPU Registers
1000 bytes(300ps) 0.30 ns
Cache
64 KB/256KB/2-4MB
1ns/3-10ns/10-20ns
Main Memor y
4-16GB50-100ns
Disk4-16TB5-10ms
Capacity
Access Time
Upper Level
Lower Level
Faster
Larger
Memory Hierarchy
Speed
Capacity
Registers
Cache
Memory
Disk Storage
Blocks (inclusion prop.)
Pages (inclusion prop.)
Files
???
-
7/27/2019 Capp 08
6/60
6
Levels of the Memory Hierarchy (Personal Mobile Device)
CPU Registers
500 bytes(500ps) 0.50 ns
Cache
64 KB/256KB
2ns/10-20ns
Main Memor y
256-512MB50-100ns
Flash (EEPROM)4-8GB25-50us
Capacity
Access Time
Upper Level
Lower Level
Faster
Larger
Memory Hierarchy
Speed
Capacity
Registers
Cache
Memory
Storage
Blocks (inclusion prop.)
Pages (inclusion prop.)
-
7/27/2019 Capp 08
7/60
7
Cache:
it mainly means the first level of the memory hierarchy encountered
once the address leaves the CPU
applied whenever buffering is employed to reuse commonlyoccurring items, i.e. file caches, name caches, and so on
Principle of Locality:
Program access a relatively small portion of the address space atany instant of time.
Two Different Types of Locality:
Temporal Locality(Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)
Spatial Locality(Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)
Guideline: Fora given implementation technology and apower budget, Smaller hardware can be made Faster
ABCs of Caches
-
7/27/2019 Capp 08
8/60
8
Memory Hierarchy: Terminology
Traditionally designers of MH focused on Optimizing Avg Mem. AccessTime , which is determined by Cache Access Time, Miss Rate and MP
Hit: data appears in some block in the cache (example: Block X)
Hit Rate: the fraction of cache access found in the cache Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the main memory
(Block Y) (3 Cs model of miss Compulsory, Capacity, Conflict )
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in cache+ Time to deliver the blockto the processor
Hit Time
-
7/27/2019 Capp 08
9/60
9
Cache Measures
CPU execution time incorporated with cache performance:
CPU execution time = (CPU clock cycles + Memory stall cycles)* Clock cycle time
Memory s tal l cy cles: number of cycles during which the CPU isstalled waiting for a memory access
Memory stall clock cycles = Number of misses * miss penalty
= IC*(Misses/Instruction)*Miss penalty= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty
= IC * Reads per instruction * Read miss rate * Read miss penalty
+IC * Writes per instruction * Write miss rate * Write miss penalty
Memory access consists offetching instructions and reading/writingdata
-
7/27/2019 Capp 08
10/60
10
ExampleAssume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% ofthe instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%,
how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then
CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time(B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.
memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty
= IC*(1+50%)*2%*25 = IC*0.75
then CPU(B) = (IC + IC*0.75)* Clock cycle time
= 1.75*IC*clock cycle timeThe performance ration is easy to get to be the inverse of the CPU execution
time :
CPU(B)/CPU(A) = 1.75
The computer with no cache miss is 1.75 times faster.
Example
-
7/27/2019 Capp 08
11/60
11
Four Memory Hierarchy Questions
Q1 (block placement):
Where can a block/line be placed in the upper level(cache)?
(most popular set-associative, direct-mapped, fully-associative)
Q2 (block identification):How is a block found if it is in the upper level (cache)?
Q3 (block replacement):
Which bock should be replaced on a miss?
Q4 (write strategy):
What happens on a write? (caching read only data is verysimple)
(Write-through or write-back? Write buffer?)
-
7/27/2019 Capp 08
12/60
12
Q1(block placement): Where can a block be placed?
Direct mapped: (Block number) mod (Number of blocks in cache)Set associative: (Block number) mod (Number of sets in cache)
# of set # of blocks n-way: n blocks in a set 1-way = direct mapped
Fully associative: # of set = 1
Example: block 12 placedin a 8-block cache
-
7/27/2019 Capp 08
13/60
13
Simplest Cache: Direct Mapped (1-way)
Memory
4 Block Direct Mapped Cache
Block number
0
1
2
3
4
5
6
7
8
9
A
BC
D
E
F
Block Index in Cache
0
1
2
3
The block have only one place it can appear in thecache. The mapping is usually
(Block address) MOD ( Number of blocks in cache)
-
7/27/2019 Capp 08
14/60
14
Example: 1 KB Direct Mapped Cache, 32B Blocks
For a 2N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2M
)
0
1
23
:
Cache Data
Byte 0
:
0x50
Stored as partof the cache state
Valid Bit
:
31
Byte 1Byte 31 :Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
-
7/27/2019 Capp 08
15/60
15
Block Offset selects the desired data from the block, the index filed selectsthe set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set
No need to check index or block offsetA val id bitis added to the Tag to indicate whether or not this entry
contains a valid address Select the desired bytes using Block Offset
Increasing associativity => shrinks index expands tag
Block Address Block Offset
(Block Size)Tag Cache/Set Index
Three portions of an address in a set-associative or direct-mapped cache
Q2 (block identification): How is a block found?
-
7/27/2019 Capp 08
16/60
16
Example: Two-way set associative cache
Cache Index selects a set from the cache
The two tags in the set are compared in parallel Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Cache Index
0431
Cache Tag Example: 0x50
Ex: 0x01
Byte Select
Ex: 0x00
9
0x50
-
7/27/2019 Capp 08
17/60
17
Disadvantage of Set Associative Cache
N-way Set Associative Cache v.s. Direct Mapped Cache:
N comparators vs. 1
Extra MUX delay for the data Data comes AFTERHit/Miss
In a direct mapped cache, Cache Block is available BEFOREHit/Miss:
Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
-
7/27/2019 Capp 08
18/60
18
Easy for Direct Mapped hardware decisions are simplified
Only one block frame is checked and only that block can be replacedSet Associative or Fully Associative
There are many blocks to choose from on a miss to replace
Three primary strategies for selecting a block to be replaced Random: randomly selected (to spread allocation uniformly)
LRU: Least Recently Used block is removed (based on LOR) FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategiesAssociativity: 2-way 4-way 8-waySize LRU Random FIFO LRU Random FIFO LRU Random FIFO
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.464 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, withLRU outperforming the others for smaller caches. FIFO generally outperforms
random in the smaller cache sizes
Q3 (block replacement): Which block should bereplaced on a cache miss?
-
7/27/2019 Capp 08
19/60
19
Reads dominate processor cache accesses.
E.g. 7% ofoverallmemory traffic are writes while 21% ofdata cacheaccess are writes . Making the common case fast (Amdahls law)Two option we can adopt when writing to the cache:
Write throughThe information is written to both the block in thecache and to the block in the lower-level memory.Write backThe information is written only to the block in the cache.
The modified cache block is written to main memory only when it isreplaced.
To reduce the frequency of writing back blocks on replacement, a dirtybit is used to indicate whether the block was modified in the cache(dirty) or not (clean). If clean, no write back since identical information
to the cache is foundPros and Cons
WT: simple to be implemented. The cache is always clean, so readmisses cannot result in writesWB: writes occur at the speed of the cache. And multiple writes within
a block require only one write to the lower-level memory
Q4(write strategy): What happens on a write?
-
7/27/2019 Capp 08
20/60
20
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Processor Cache
Write Buffer
DRAM
Write Stall and Write Buffer
When the CPU must wait for writes to complete during WT, the CPU issaid to write stall
A common optimization to reduce write stall is a write buffer, whichallows the processor to continue as soon as the data are written to thebuffer, thereby overlapping processor execution with memory updating
-
7/27/2019 Capp 08
21/60
21
Two options on a write miss
Write al locate the block is allocated on a write miss,followed by the write hit actions
Write misses act like read misses
No-writ e allocate write misses do not affect the cache. Theblock is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until theprogram tries to read the blocks, but with write allocate
even blocks that are only written will still be in the cache
Write-Miss Policy: Write Allocate vs. Not Allocate
-
7/27/2019 Capp 08
22/60
22
Write Through with Wri te Al locate:
on hits it writes to cache and main memory on misses it updates the block in main memory and brings the block
to the cache
Bringing the block to cache on a miss does not make a lot of sense inthis combination because the next hit to this block will generate awrite to main memory anyway (according to Write Through policy)
Write Through w ith No Write Al loc ate:
on hits it writes to cache and main memory;
on misses it updates the block in main memory not bringing thatblock to the cache;
Subsequent writes to the block will update main memory becauseWrite Through policy is employed. So, some time is saved notbringing the block in the cache on a miss because it appears useless
anyway.
Write-Miss Policy: Write Allocate vs. Not Allocate
-
7/27/2019 Capp 08
23/60
23
Write Back with Write A l locate: on hits it writes to cache settingdirty bit for the block, main
memory is not updated;
on misses it updates the block in main memory and brings the block tothe cache;
Subsequent writes to the same block, if the block originally caused amiss, will hit in the cache next time, setting dirty bit for the block. Thatwill eliminate extra memory accesses and result in very efficientexecution compared with Write Through with Write Allocate
combination.
Write-Miss Policy: Write Allocate vs. Not Allocate
-
7/27/2019 Capp 08
24/60
24
Write Back w ith No Write Al locate:
on hits it writes to cache settingdirty bit for the block, mainmemory is not updated;
on misses it updates the block in main memory not bringing that blockto the cache;
Subsequent writes to the same block, if the block originally caused amiss, will generate misses all the way and result in very inefficientexecution.
Write-Miss Policy: Write Allocate vs. Not Allocate
-
7/27/2019 Capp 08
25/60
25
Example:Assume a fully associative write-back cache with many
cache entries that starts empty. Below is sequence of five memoryoperations.
Write Mem[100];Write Mem[100];Read Mem[200];Write Mem[200];Write Mem[100].
What are the number of hits and misses (inclusive reads and writes) whenusing no-write allocate versus write allocate?
Answer :
No-write Allocate: Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write missWrite Mem[100]; 1 write miss Write Mem[100]; 1 write hitRead Mem[200]; 1 read miss Read Mem[200]; 1 read missWrite Mem[200]; 1 write hit Write Mem[200]; 1 write hitWrite Mem[100]. 1 write miss Write Mem[100]; 1 write hit
4 misses; 1 hit 2 misses; 3 hits
Write-Miss Policy Example
-
7/27/2019 Capp 08
26/60
26
Example: Consider a computer with the following features:
90% of all memory accesses are found in the cache (hit ratio = 0.9);
The block size is 2 words and the whole block is read on any miss;
The CPU sends references to the cache at the rate of107 words per
second;
25% of the above references are writes (writes = 25%, reads = 75%);
The bus can support 107 words per second, read or writes (total bus
bandwidth = 107);
The bus reads or writes a single word at a time;
Assume at any one time, 30% of the block frames in the cache have
been modified;
The cache uses write allocate on a write miss.Calculate the percentage of the bus bandwidth used on
the average in the two cases - Write Back and Write
Through.
Write-Miss Policy: Write Allocate vs. Not Allocate
-
7/27/2019 Capp 08
27/60
27
Write-Miss Policy: Write Allocate
Write Through(Total Bus B/W used = B/W for
Read Hit + B/W for Read Miss +B/W for Write Hit + B/W for Write
Miss)
Write Back
Read Hit : 0 Read Hit = 0
Read Miss : 107 0.1 0.75 2= 0.15107
Read Miss : 1070.10.75(2+0.32)= 0.195107
Write Hit : 107 0.9 0.25 1= 0.225107
Write Hit = 0
Write Miss :107 0.1 0.25 (2 +1)= 0.075107
Write Miss :1070.10.25(2+0.32 )= 0.075107
Total Bus B/W used =(0.15+0.225+0.075)107 = 0.45107
Total Bus B/W used =(0.195+0.075)107 = 0.27107
-
7/27/2019 Capp 08
28/60
28
Example: Split Cache vs. Unified Cache
Which has the better avg. memory access time?A 16-KB instruction cache with a 16-KB data cache (split cache), orA 32-KB unified cache?
Miss rates Size Instruction Cache Data Cache Unified Cache16KB 0.4% 11.4%
32 KB 3.18%
Cache Performance
AssumeA hit takes 1 clock cycle and the miss penalty is 100 cyclesA load or store takes 1 extra clock cycle on a unified cache since
there is only one cache port 36% of the instructions are data transfer instructions.
About 74% of the memory accesses are instruction references
Answer :Average memory access time (split)= % instructions x (Hit time + Instruction miss rate x Miss penalty)+ % data x (Hit time + Instruction miss rate x Miss penalty)= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24
Average memory access time(unified)= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
-
7/27/2019 Capp 08
29/60
29
Example:Suppose a processor:
Ideal CPI = 1.0 (ignoring memory stalls)
Avg. miss rate is 2%Avg. memory references per instruction is 1.5 Miss penalty is 100 cycles
What are the impact on performance when behavior of the cache is included?
Answer :CPI = CPU execution cycles per instr. + Memory stall cycles per instr.= CPI execution + Miss rate x Memory accesses per instr. x Miss penalty
CPI with cache = 1.0 + 2% x 1.5 x 100 = 4CPI without cache = 1.0 + 1.5 x 100 = 151
CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle timeCPU time without cache = IC x 151 x Clock cycle time
Without cache, the CPI of the processor increases from 1 to 151!
75 % of the time the processor is stalled waiting for memory! (CPI: 14)
Impact of Memory Access on CPU Performance
-
7/27/2019 Capp 08
30/60
30
Example:What is the impact of two different cache organizations (direct
mapped vs. 2-way set associative) on the performance of a CPU?
Ideal CPI = 2.0 (ignoring memory stalls)
Clock cycle time is 1.0 nsAvg. memory references per instruction is 1.5 Cache size: 64 KB, block size: 64 bytes For set-associative, assume the clock cycle time is stretched 1.25 times to
accommodate the selection multiplexer Cache miss penalty is 75 ns
Hit time is 1 clock cycle Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.
Answer :
Avg. memory access time1-way= 1.0+(0.014 x 75) = 2.05 nsAvg. memory access time2-way= 1.0 x 1.25 + (0.01 x 75) = 2.00 ns
CPU time1-way = IC x (CPIexecution + Miss rate x Memory accesses per instructionx Miss penalty) x Clock cycle time
= IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 ICCPU time2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC
Impact of Cache Organizations on CPU Performance
-
7/27/2019 Capp 08
31/60
31
Summary of Performance Equations
-
7/27/2019 Capp 08
32/60
32
Next we look at ways to improve cache and memory access times.
TimeCycleClockPenaltyMissRateMissnInstructio
AccessesMemoryCPIICTimeCPU Execution )(*
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Improving Cache Performance
-
7/27/2019 Capp 08
33/60
33
Reducing Cache Miss Penalty
Average Memory Access Time
= Hit Time + Miss Rate * Miss Penalty
Time to handle a miss is becoming more and more the
controlling factor. This is because of the great improvement inspeed of processors as compared to the speed of memory.
Five optimizations1. Multilevel caches2. Critical word first and early restart3. Giving priority to read misses over writes
4. Merging write buffer5. Victim caches
O1 M ltil l C h
-
7/27/2019 Capp 08
34/60
34
Approaches
Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap
L1: fast hits, L2: fewer misses L2 Equations
Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Average Memory Access Time = Hit TimeL1
+Miss RateL1x (Hit TimeL2 +Miss RateL2x Miss PenaltyL2)Hit TimeL1
-
7/27/2019 Capp 08
35/60
35
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cacheshould be much bigger than L1
Whether data in L1 is in L2 novice approach: design L1 and L2 independently
multilevel inclusion: L1 data are always present in L2 Advantage: easy for consistency between I/O and cache (checking L2 only)
Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-levelblock to be replaced => slightly higher 1st-level miss rate
i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
multilevel exclusion: L1 data is never found in L2 A cache miss in L1 results in a swap of blocks between L1 and L2
Advantage: prevent wasting space in L2
i.e. AMD Athlon: 64 KB L1 and 256 KB L2
-
7/27/2019 Capp 08
36/60
36
Dont wait for full block to be loaded before restarting CPU
Critical Word FirstRequest missed word first from memoryand send it to CPU as soon as it arrives; let CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetch and requested word first
Early restartAs soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution Given spatial locality, CPU tends to want next sequential word, so its
not clear if benefit by early restart
Generally useful only in large blocks,
block
O2: Critical Word First and Early Restart
-
7/27/2019 Capp 08
37/60
37
Serve reads before writes have been completed Write through with write buffers
SW R3, 512(R0) ; M[512]
-
7/27/2019 Capp 08
38/60
38
O4: Merging Write Buffer If a write buffer is empty, the data and the full address are
written in the buffer, and the write is finished from the CPUs
perspective Usually a write buffer supports multi-words
Write merging: addresses of write buffers are checked to see ifthe address of the new data matches the address of a validwrite buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words(left) without merging (right) Four writes are merged into a single entry
writing multiple words at the same time is faster than writing multiple times
O5: Victim Caches
-
7/27/2019 Capp 08
39/60
39
O5: Victim Caches
Idea of recycling: remember what was discarded latest due tocache miss in case it is needed again rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cacheand its refill pathcontain only blocks that are discarded from a cache because of a miss,
victimschecked on a miss before going to the next lower-level memoryVictim caches of 1 to 5 entries are effective at reducing misses,
especially for small, direct-mapped data caches
AMD Athlon: 8 entries
-
7/27/2019 Capp 08
40/60
40
Reducing Miss Rate
3 Cs of Cache Miss
CompulsoryThe first access to a block is not in the cache, so the blockmust be brought into the cache. Also called cold start misses orfirst
reference misses.
(Misses in even an Infinite Cache)
CapacityIf the cache cannot contain all the blocks needed during
execution of a program, capacity misseswill occur due to blocks being
discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
ConflictIf block-placement strategy is set associative or direct mapped,
conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocks
map to its set. Also called collision misses orinterference misses.
(Misses in N-way Associative but hits in Fully Associative Size X Cache)
-
7/27/2019 Capp 08
41/60
41
3 Cs of Cache Miss
miss rate 1-way associative cache size X= miss rate 2-way associative cache size X/2
Compulsory vanishingly
small
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
Conflict
Cache Size (KB)
MissRateperType
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 816
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
-
7/27/2019 Capp 08
42/60
42
3Cs Relative Miss Rate
Conflict
Flaws: for fixed block size
Good: insight => invention
Cache Size (KB)
MissRateperType
0%
20%
40%
60%
80%
100%
1 2 4 816
32
64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
-
7/27/2019 Capp 08
43/60
43
Five Techniques to Reduce Miss Rate
1. Larger block size
2. Larger caches
3. Higher associativity4. Way prediction and pseudoassociative caches
5. Compiler optimizations
O1 L Bl k Si
-
7/27/2019 Capp 08
44/60
44
Block Size (bytes)
Miss
Rate
0%
5%
10%
15%
20%
25%
16
32
64
128
256
1K
4K
16K
64K
256K
Size of Cache
Using the principle of
locality: The larger theblock, the greater thechance parts of it will beused again.
O1: Larger Block Size
Take advantage of spatial locality
-The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is
small Usually high latency and high bandwidth encourage large block size
-
7/27/2019 Capp 08
45/60
45
O2: Larger Caches
Increasing capacity of cache reduces capacity misses(Figure 5.14 and 5.15)
May be longer hit time and higher cost
Trends: Larger L2 or L3 off-chip caches
Cache Size (KB)
MissRateperType
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 816
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
-
7/27/2019 Capp 08
46/60
46
Figure 5-14 and 5-15 show how improve miss rates improve
with higher associativity 8-way set asociative is as effective as fully associative for practical purposes 2:1 Cache Rule:
Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2
Tradeoff: higher associative cache complicates the circuit May have longer clock cycle
Beware: Execution time is the only final measure!Will Clock Cycle time increase as a result of having a more
complicated cache?
Hill [1988] suggested hit time for 2-way vs. 1-way is:external cache +10%,internal + 2%
O3: Higher Associativity
O4 W P di ti & P d i ti C h
-
7/27/2019 Capp 08
47/60
47
O4: Way Prediction & Pseudoassociative Caches
way prediction: extra bits are kept in cache to predict the way, orblock within the set of the next cache access
Example: 2-way I-cache of Alpha 21264 If the predictor is correct, I-cache latency is 1 clock cycle If incorrect, tries the other block, changes the way predictor, and has a
latency of 3 clock cycles excess of 85% accuracyreduce conflict miss and maintain the hit speed of direct-mapped cache
pseudoassociative or column associative On a miss, a 2nd cache entry is checked before going to the next lower
level one fast hit and one slow hit
Invert the most significant bit to the find other block in the pseudoset
Miss penalty may become slightly longer
O5: Compiler Optimizations
-
7/27/2019 Capp 08
48/60
48
O5: Compiler Optimizations
Improve hit rate by compile-time optimization Reordering instructions with profiling information (McFarling[1989])
Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75%in an 8KB cache
Get best performance when it was possible to prevent some instruction fromentering the cache
Aligning basic block: the entry point is at the beginning of a
cache block Decrease the chance of a cache miss for sequential code Loop Interchange: exchanging the nesting of loops
Improve spatial locality => reduce misses Make data be accessed in order
=> maximize use of data in a cache block before discarded
/* Before: row first */
for(j=0;j
-
7/27/2019 Capp 08
49/60
49
Blocking: operating on submatrices or blocksMaximize accesses to the data loaded into the cache before replacedImprove temporal localityX=Y*Z
/* Before */for(i=0;i
-
7/27/2019 Capp 08
50/60
50
5.6 Reducing Cache Penalty or Miss Rate
via Parallelism
Three techniques that overlap the execution of instructions
1.Nonblocking caches to reduce stalls on cache misses
to match the out-of-order processors
2.Hardware prefetching of insructions and data
3.Compiler-controlled prefetching
O1: Nonblocking cache to reduce stalls on cache miss
-
7/27/2019 Capp 08
51/60
51
O1: Nonblocking cache to reduce stalls on cache missFor pipelined computers that allow out-of-order completion, theCPU need not stall on a cache missseparate I-cache and D-cache Continue fetching instructions from I-cache while waiting for D-cache to
return missing data
Nonblocking cache (lookup-free cache) hit under miss: D-cache continues to supply cache hits during a miss hit under multiple miss or miss under miss: overlap multiple misses
Ratio of average memory stalltime for a blocking cache tohit-under-miss schemes
first 14 are FP programsaverage: 76% for 1-miss,51% for 2-miss, 39% for 64-miss
final 4 are INT programsaverage: 81%, 78% and 78%
O2: Hardware Prefetching of Instructions and Data
-
7/27/2019 Capp 08
52/60
52
O2: Hardware Prefetching of Instructions and Data
Prefetch instructions or data before requested by the CPU either directly into the caches or into an external buffer (faster than
accessing main memory)Instruction prefetch: frequently done in hardware outside cache Fetch two blocks on a miss
the requested block is placed in I-cache when it returns the prefetched block is placed in ins truct io n stream bu ffer (ISB) 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block
direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)
UltraSPARC III: data prefetch If a load hits in the prefetch cache
the block is read from the prefetch cache the next prefetch request is issued: calculating the stride of the next
prefetched block using the difference between the current address and theprevious address
Up to 8 simultaneous prefetches
It may interfere with demand misses resulting in lowering performance
O3: Compiler Controlled Prefetching
-
7/27/2019 Capp 08
53/60
53
O3: Compiler-Controlled Prefetching
Compiler-controlled prefetching Register prefetch: load the value into a register
Cache prefetch: load data only into the cache (not register)Faulting vs. nonfaulting: the address does or does not cause anexception for virtual address faults and protection violations normal load instruction = faulting register prefetch instruction
Most effective prefetch: semantically invisible to a program doesnt change the contents of registers and memory, and cannot cause virtual memory faults
nonbinding prefetch: nonfaulting cache prefetch Overlapping execution: CPU proceeds while the prefetched data are being
fetched
Advantage: The compiler may avoid unnecessary prefetches in hardware Drawback: Prefetch instructions incurs instruction overhead
5 7 R d i Hit Ti
-
7/27/2019 Capp 08
54/60
54
5.7 Reducing Hit Time
Importance of cache hit time
Average Memory Access Time= Hit Time + Miss Rate * Miss Penalty
More importantly, cache access time limits the clock cycle rate in
many processors today!
Fast hit time:Quickly and efficiently find out if data is in the cache, and if it is, get that data out of the cache
Four techniques:
1.Small and simple caches
2.Avoiding address translation during indexing of the cache
3.Pipelined cache access
4.Trace caches
O1: Small and Simple Caches
-
7/27/2019 Capp 08
55/60
55
O1: Small and Simple Caches
A time-consuming portion of a cache hit is using the index portion of theaddress to read the tag memory and then compare it to the address
Guideline: smaller hardware is fasterWhy Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB
second level cache?
Small data cache and thus fast clock rate
Guideline: simpler hardware is fasterDirect Mapped, on chip
General design:small and simple cache for 1st-level cache
Keeping the tags on chip and the data off chip for 2nd-level cachesThe emphasis recently is on fast clock time while hiding L1 misses with
dynamic execution and using L2 caches to avoid going to memory
O2: Avoiding address translation during cache indexing
-
7/27/2019 Capp 08
56/60
56
O2: Avoiding address translation during cache indexing
Two tasks: indexing the cache and comparing addressesvirtually vs. physically addressed cache
virtual cache: use virtual address (VA) for the cachephysical cache: use physical address (PA) after translating virtual address
Challenges to virtual cache1.Protection: page-level protection (RW/RO/Invalid) must be checked
Its checked as part of the virtual to physical address translationsolution: an addition field to copy the protection information from TLB and check
it on every access to the cache
2.context switching: same VA of different processes refer to different PA,requiring the cache to be flushedsolution: increase width of cache address tag with process-identifier tag (PID)
3.Synonyms or aliases: two different VA for the same PA
inconsistency problem: two copies of the same data in a virtual cachehardware ant ial iasingsolution: guarantee every cache block a unique PA
Alpha 21264: check all possible locations. If one is found, it is invalidatedsoftware page-color ingsolution: forcing aliases to share some address bits
Suns Solaris: all aliases must be identical in last 18 bits => no duplicate PA
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
Vi t ll i d d h i ll t d h
-
7/27/2019 Capp 08
57/60
57
Virtually indexed, physically tagged cache
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
CPU
$ TB
MEM
VA
PA
TagsPA
Overlap cache access
with VA translation:requires $ index toremain invariant
across translation
VATags
L2 $
O3: Pipelined Cache Access
-
7/27/2019 Capp 08
58/60
58
O3: Pipelined Cache Access
Simply to pipeline cache access
Multiple clock cycle for 1st-level cache hit
Advantage: fast cycle time and slow hitExample: accessing instructions from I-cache
Pentium: 1 clock cycle
Pentium Pro ~ Pentium III: 2 clocks
Pentium 4: 4 clocks
Drawback: Increasing the number of pipeline stages leads to greater penalty on mispredicted branches and
more clock cycles between the issue of the load and the use of the data
Note that it increases the bandwidth of instructions rather thandecreasing the actual latency of a cache hit
O4: Trace Caches
-
7/27/2019 Capp 08
59/60
59
O4: Trace Caches
Trace cache forinstructions: find a dynamic sequence ofinstructions including taken branches to load into a cache block The cache blocks contain
dynamic traces of executed instructions determined by CPU
rather than static sequences of instructions determined by memory
branch prediction is folded into the cache: validated along with the
addresses to have a valid fetch i.e. Intel NetBurst microarchitecture
advantage: better utilization Trace caches store instructions only from the branch entry point to the exit
of the trace
Unused part of a long block entered or exited from a taken branch inconventional I-cache may not be fetched
Downside: store the same instructions multiple times
CacheOptimization
-
7/27/2019 Capp 08
60/60
OptimizationSummary
5.4 miss penalty
5.5 miss rate
5.6 parallelism
5.7 hit time