Post on 24-Feb-2016
description
Computer Architecture 2014 – Caches1
Computer Architecture
Cache Memory
By Yoav Etsion and Dan TsafrirPresentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
Computer Architecture 2014 – Caches2
In the old days… The predecessor of ENIAC
(the first general-purpose electronic computer)
Designed & built in 1944-1949 by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann
Unlike ENIAC, binary rather than decimal, and a “stored program” machine
Operational until 1961
EDVAC (Electronic DiscreteVariable Automatic Computer)
Computer Architecture 2014 – Caches3
In the olden days… In 1945, Von Neumann wrote:
“…This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.”
Von Neumann & EDVAC
Computer Architecture 2014 – Caches4
In the olden days… Later, in 1946, he wrote:
“…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available……We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible”
Von Neumann & EDVAC
Computer Architecture 2014 – Caches5
Not so long ago… In 1994, in their paper
“Hitting the Memory Wall: Implications of the Obvious”,
William Wulf and Sally McKee said:
“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs.
The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”
Computer Architecture 2014 – Caches6
Not so long ago…
1
10
100
1000
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
Perf
orm
ance
Time
DRAM9% per yr2X in 10 yrs
CPU60% per yr2X in 1.5 yrs
Gap grew 50% per year
Computer Architecture 2014 – Caches7
More recently (2008)…lo
wer
= s
low
erFast
Slow
The memory wall in the multicore era
Perf
orm
ance
(se
cond
s)
Processor cores
Conventionalarchitecture
Computer Architecture 2014 – Caches8
Memory Trade-Offs Large (dense) memories are slow Fast memories are small, expensive and consume high
power Goal: give the processor a feeling that it has a memory
which is large (dense), fast, consumes low power, and cheap
Solution: a Hierarchy of memories
Speed: Fastest SlowestSize: Smallest BiggestCost: Highest LowestPower: Highest Lowest
L1CacheCPU L2
CacheL3
CacheMemory(DRAM)
Computer Architecture 2014 – Caches9
Typical levels in mem hierarchy
Response time Size Memory level≈ 0.5 ns ≈ 100 bytes CPU registers≈ 1 ns ≈ 64 KB L1 cache ≈ 20 ns ≈ 8 – 32 MB Last Leve cache
(LLC)≈ 150 ns ≈ 4 – 100s GB Main memory
(DRAM)W? r? 128 GB SSD≈ 5 ms ≈ 1 – 4 TB Hard disk (SATA)
Computer Architecture 2014 – Caches10
Why Hierarchy Works: Locality
Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again
soon Example: code and variables in loops
Keep recently accessed data closer to the processor
Spatial Locality (Locality in Space): If an item is referenced, nearby items tend to be referenced
soon Example: scanning an array
Move contiguous blocks closer to the processor
Due to locality, memory hierarchy is a good idea We’re going to use what we’ve just recently used And we’re going to use its immediate neighborhood
Computer Architecture 2014 – Caches11
Programs with locality cache well ...
Time
Mem
ory
Add
ress
(one
dot
per
acc
ess)
SpatialLocality
Temporal Locality
Donald J. Hatfield, Jeanette Gerald: Program Restructuring forVirtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Bad locality behavior
Computer Architecture 2014 – Caches12
Memory Hierarchy: Terminology
For each memory level define the following Hit: data appears in the memory level Hit Rate: the fraction of accesses found in that level Hit Latency: time to access the memory level
• includes also the time to determine hit/miss Miss: need to retrieve data from next level Miss Rate: 1 - (Hit Rate) Miss Penalty: Time to bring in the missing info (replace a
block) + Time to deliver the info to the accessor
Average memory access time = t_effective = (Hit Lat. Hit Rate) + (Miss Pen. Miss
Rate) = (Hit Lat. Hit Rate) + (Miss Pen. (1- Hit
Rate)) If hit rate is close to 1, t_effective is close to Hit latency,
which is generally what we want
Computer Architecture 2014 – Caches13
Effective Memory Access Time
Cache – holds a subset of the memory Hopefully – the subset that is being used now Known as “the working set”
Effective memory access time• teffective = (tcache Hit Rate) + (tmem (1 – Hit rate))• tmem includes the time it takes to detect a cache miss
Example Assume: tcache = 10 ns , tmem = 100 nsec
Hit Rate t eff (nsec) 0 100 50 55 90 20 99 10.9 99.9 10.1
tmem/tcache goes up more important that hit-rate closer to 1
Computer Architecture 2014 – Caches14
The cache holds a small part of the entire memory Need to map parts of the memory into the cache
Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines” Typical block size is 32, 64 bytes Blocks are “aligned” in memory
Cache partitioned to cache lines Each cache line holds a block Only a subset of the blocks is mapped
to the cache at a given time The cache views an address as
Why use lines/blocks rather than words?
Cache – main idea
Block # offset
memory
cache0123456.
.
.
90919293
.
.
.
92
90
42
Line Tag
Computer Architecture 2014 – Caches15
Cache Lookup Cache hit
Block is mapped to the cache – return data according to block’s offset
Cache miss Block is not mapped to the cache
do a cacheline fill• Fetch block into fill buffer
• may require few cycles • Write fill buffer into cache
May need to evict another block from the cache • Make room for the new block
memory
cache0123456.
.
.
90919293
.
.
.
92
90
42
Computer Architecture 2014 – Caches16
Checking valid bit & tag Initially cache is empty
Need to have a “line valid” indication – line valid bit A line may also be invalidated
Line
Tag ArrayTag
Tag Offset0431
Data array
031
==
=
hit data
valid bit
v
Computer Architecture 2014 – Caches17
Cache organization Basic questions:
Associativity: Where can we place a memory block in the cache?
Eviction policy: Which cache line should be evicted on a miss?
Associativity: Ideally, every memory block can go to each cache line
• Called Fully-associative cache• Most flexible, but most expensive
Compromise: simpler designs • Blocks can only reside in a subset of cache lines
• Direct-mapped cache• 2-way set associative cache• N-way set associative cache
Computer Architecture 2014 – Caches18
Fully Associative Cache An address is partitioned to
offset within block block number
Each block may be mapped to each of the cache lines Lookup block in all lines
Each cache line has a tag All tags are compared to the
block# in parallel Need a comparator per line If one of the tags matches the
block#, we have a hit • Supply data according to
offset Best hit rate, but most
wasteful Must be relatively small
Tag Array
Tag
Tag = Block# Offset
Address Fields0431
Data array031
Line==
=
hit data
Computer Architecture 2014 – Caches19
Fully Associative Cache Is said to be a “CAM”
Content Addressable Memory
Tag Array
Tag
Tag = Block# Offset
Address Fields0431
Data array031
Line==
=
hit data
Computer Architecture 2014 – Caches20
Direct Map Cache Each memory block can only be
mapped to a single cache line
Offset Byte within the cache-line
Set The index into the “cache
array”, and to the “tag array” For a given set (an index), only
one of the cache lines that has this set can reside in the cache
Tag Remaining block bits are used
as tag Tag uniquely identifies mem.
block Must compare the tag stored in
the tag array to the tag of the address
TagArray
Tag
Set#
031
Tag Set Offset
Address
041331
Line
5Block number
29 =512 sets
DataArray
14
Computer Architecture 2014 – Caches21
Direct Map Cache (cont) Partition memory into slices
slice size = cache size Partition each slice to blocks
Block size = cache line size Distance of block from slice start
indicates position in cache (set) Advantages
Easy & fast hit/miss resolution Easy & fast replacement algorithm Lowest power
Disadvantage Line has only “one chance” Lines replaced due to “conflict
misses” Organization with highest miss-rate
CacheSize
.
.
.
.
x
x
x
Mapped to set X
CacheSize
CacheSize
Computer Architecture 2014 – Caches22
Line Size: 32 bytes 5 Offset bitsCache Size: 16KB = 214 Bytes
#lines = cache size / line size = 214/25=29=512
#sets = #lines = 512#set bits = 9 bits (=5…13)
#Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31)
Lookup Address: 0x123456780001 0010 0011 0100 0101 0110 0111
1000
Direct Map Cache – Example
offset=0x18
set=0x0B3
tag=0x048B1
TagTag Set Offset
Address Fields041331
Tag Array
=Hit/Miss
514
Computer Architecture 2014 – Caches23
Direct map (tiny example) Assume
Memory size is 2^5 = 32 bytes
For this, need 5-bit address A block is comprised of 4
bytes Thus, there are exactly 8
blocks
Note Need only 3-bits to identify a
block The offset is exclusively
used within the cache lines The offset is not used to
locate the cache line
00 01 10 11000001010011100101110111
Offset (within a block)Bl
ock
inde
x
Address 11111Address 01110
Address 00001
Computer Architecture 2014 – Caches24
Direct map (tiny example) Further assume
The size of our cache is 2 cache-lines (=> need 2=5-2-1 tag bits)
The address divides like so b4 b3 | b2 | b1 b0 tag | set | offset
00 01 10 11000001010011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 1101
b3 b401
tag array(bits)
data array(bytes)
memory array(bytes)
even cache linesodd cache lines
Computer Architecture 2014 – Caches25
Direct map (tiny example) Accessing address
0 0 0 1 0 (= marked “C”)
The address divides like so b4 b3 | b2 | b1 b0 tag (00) | set (0)| offset
(10)
00 01 10 11A B C D 000
001010011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11A B C D 0
1
b3 b40 0 0
1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches26
Direct map (tiny example) Accessing address
0 1 0 1 0 (=Y)
The address divides like so b4 b3 | b2 | b1 b0 tag (01) | set (0)| offset
(10)
00 01 10 11000001
W X Y Z 010011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11W X Y Z 0
1
b3 b41 0 0
1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches27
Direct map (tiny example) Accessing address
1 0 0 1 0 (=Q)
The address divides like so b4 b3 | b2 | b1 b0 tag (10) | set (0)| offset
(10)
00 01 10 11000001010011
T R Q P 100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11T R Q P 0
1
b3 b40 1 0
1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches28
Direct map (tiny example) Accessing address
1 1 0 1 0 (=J)
The address divides like so b4 b3 | b2 | b1 b0 tag (11) | set (0)| offset
(10)
00 01 10 11000001010011100101
L K J I 110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11L K J I 0
1
b3 b41 1 0
1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches29
Direct map (tiny example) Accessing address
0 0 1 1 0 (=B)
The address divides like so b4 b3 | b2 | b1 b0 tag (00) | set (1)| offset
(10)
00 01 10 11000
D C B A 001010011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 110
D C B A 1
b3 b40
0 0 1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches30
Direct map (tiny example) Accessing address
0 1 1 1 0 (=Y)
The address divides like so b4 b3 | b2 | b1 b0 tag (01) | set (1)| offset
(10)
00 01 10 11000001010
W Z Y X 011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 110
W Z Y X 1
b3 b40
1 0 1
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches31
Direct map (tiny example) Now assume
The size of our cache is 4 cache-lines
The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset
00 01 10 11000001010011
D C B A 100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11D C B A 00
011011
b41 00
011011
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches32
Direct map (tiny example) Now assume
The size of our cache is 4 cache-lines
The address divides like so b4 | b3 b2 | b1 b0 tag | set | offset
00 01 10 11W Z Y X 000
001010011100101110111
Offset (within a block)Bl
ock
inde
x
00 01 10 11W Z Y X 00
011011
b40 00
011011
tag array(bits)
cache array(bytes)
memory array(bytes)
Computer Architecture 2014 – Caches33
2-Way Set Associative Cache Each set holds two line (way 0 and way 1)
Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel)
Cache effectively partitioned into twoExample:
Line Size: 32 bytesCache Size 16KB#of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits
Address0001 0010 0011 01000101 0110 0111 1000
Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 = = 0x091A2
LineTagTag Line
Tag Set Offset
Address Fields041231
Cachestorage
Way 1Tag Array
Set#
031Way 0Tag Array
Set#
031 Cachestorage
WAY #1WAY #0
513
Computer Architecture 2014 – Caches34
2-Way Cache – Hit Decision
Tag Set Offset041231
Way 0
Tag
Set#
Data
=
Hit/Miss
MUX
Data Out
DataTag
Way 1
=
513
Computer Architecture 2014 – Caches35
2-Way Set Associative Cache (cont)
Partition memory into “slices” or “ways” slice size = way size = ½ cache size
Partition each slice to blocks Block size = cache line size Distance of block from slice-start
indicates position in cache (set) Compared to direct map cache
Half size slice 2× #slices 2× #blocks mapped to each cache set
Each set can have 2 blocks at a given time
++ Fewer collisions/evictions ---- More logic, more power consuming
WaySize
.
.
.
.
x
x
x
Mapped to set X
WaySize
WaySize
Computer Architecture 2014 – Caches36
N-way set associative cache Similarly to 2-way At the extreme, every cache line is a way…
Computer Architecture 2014 – Caches37
Cache organization summary Increasing set associativity
Improves hit rate Increases power consumption Increases access time
Strike a balance
Computer Architecture 2014 – Caches38
Cache Read Miss On a read miss – perform a cache line fill
Fetch entire block that contains the missing data from memory
Block is fetched into the cache line fill buffer May take a few bus cycles to complete the fetch
• e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles
• Can stream (forward) the critical chunk into the core before the line fill ends
Once the entire block fetched into the fill buffer It is moved into the cache
Computer Architecture 2014 – Caches39
Cache Replacement Policy Direct map cache – easy
A new block is mapped to a single line in the cache Old line is evicted (re-written to memory if needed)
N-way set associative cache – harder Choose a victim from all ways in the appropriate set But which? To determine, use a replacement algorithm
Example replacement policies FIFO (First In First Out) Random LRU (Least Recently used) Optimum (theoretical, postmortem, called “Belady”)
More on this next week…
Computer Architecture 2014 – Caches40
Cache Replacement Policy Direct map cache – easy
A new block is mapped to a single line in the cache Old line is evicted (re-written to memory if needed)
N-way set associative cache – harder Choose a victim from all ways in the appropriate set But which? To determine, use a replacement algorithm
Example replacement policies Optimum (theoretical, postmortem, called “Belady”) FIFO (First In First Out) Random LRU (Least Recently used)
• A decent approximation of Belady
Computer Architecture 2014 – Caches41
LRU Implementation 2 ways
1 bit per set to mark latest way accessed in set Evict way not pointed by bit
k-way set associative LRU Requires full ordering of way accesses Algorithm: when way i is accessed
x = counter[i] counter[i] = k-1for (j = 0 to k-1)
if( (ji) && (counter[j]>x) ) counter[j]--; When replacement is needed
• evict way with counter = 0 Expensive even for small k-s
• Because invoked for every load/store Need a log2k bit counter per line
Initial StateWay 0 1 2 3Count 0 1 2 3
Access way 2Way 0 1 2 3Count 0 1 3 2
Access way 0Way 0 1 2 3Count 3 0 2 1
Computer Architecture 2014 – Caches42
Pseudo LRU (PLRU) In practice, it’s sufficient to efficiently approximate
LRU Maintain k-1 bits, instead of k ∙ log2k bits
Assume k=4, and let’s enumerate the way’s cache lines We need 2 bits: cache line 00, cl-01, cl-10, and cl-11
Use a binary search tree to represent the 4 cache lines Set each of the 3 (=k-1) internal nodes to hold
a bit variable: B0, B1, and B2 Whenever accessing a cache line b1b0
Set the bit variable Bj to be thecorresponding cache line bit bk
Can think about the bit value as Bj “right side was referenced more recently”
Need to evict? Walk tree as follows: Go left if Bj = 1; go right if Bj = 0 Evict the leaf you’ve reached (= the opposite
direction relative to previous insertions)
00 01 11100
0 1
11 0
B0
B1 B2
cache lines
Computer Architecture 2014 – Caches43
Pseudo LRU (PLRU) – Example Access 3 (11), 0 (00), 2 (10), 1 (01)
=> next victim is 3 (11), as expected
00 01 11100
0 1
11 0
B0
B1 B2
00 01 11100
0 1
11 0
0
1 0
cache lines
00 01 11100
0 1
11 0
1
1
00 01 11100
0 1
11 0
0
0 1
00 01 11100
0 1
11 0
1
0 03 0 2
1
B1
Computer Architecture 2014 – Caches44
LRU vs. Random vs. FIFO LRU: hardest FIFO: easier, approximates LRU (oldest rather then
LRU) Random: easiest Results:
Misses per 1000 instructions in L1-d, on average Average across ten SPECint2000 / SPECfp2000
benchmarks PLRU turns out rather similar to LRUSize 2-way 4-way 8-wayLRU Rand FIFO LRU Rand FIFO LRU Rand FIFO
16K 114.1 117.3 115.5 111.7 115.1 113.1 109.0 111.8 110.464K 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3256K 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
Computer Architecture 2014 – Caches45
Effect of Cache on Performance
MPKI (miss per kilo-instruction) Average number of misses for every 1000 instructions
MPKI= Memory accesses per kilo-instruction × Miss
rate
Memory stall cycles= |Memory accesses| × Miss rate × Miss penalty cycles= IC/1000 × MPKI × Miss penalty cycles
CPU time= (CPU execution cycles + Memory stall cycles) × cycle time= IC/1000 × (1000* CPIexecution + MPKI × Miss penalty cycles) × cycle time
Computer Architecture 2014 – Caches46
Memory Update Policy on Writes
Write back:Lazy writes to next cache level; prefer cache
Write through:Immediately update next cache level
Computer Architecture 2014 – Caches47
Write Back: Cheaper writes Store operations that hit the cache
Write only to cache; next cache level (or memory) not accessed
Line marked as “modified” or “dirty” When evicted, line written to next level only if dirty
Pros: Saves memory accesses when line updated more than
once Attractive for multicore/multiprocessor
Cons: On eviction, the entire line must be written to memory
(there’s no indication which bytes within the line were modified)
Read miss might require writing to memory (evicted line is dirty)
Computer Architecture 2014 – Caches48
Write Through: Cheaper evictions
Stores that hit the cache Write to cache, and Write to next cache level (or memory)
Need to write only the bytes that were changed Not entire line Less work
When evicting, no need to write to next cache level Never dirty, so don’t need to be written Still need to throw stuff out, though
Use write buffers To mask waiting for lower level memory
Computer Architecture 2014 – Caches49
Write through: need write-buffer
A write buffer between cache & memory Processor core: writes data into cache & write buffer Write buffer allows processor to avoid stalling on writes
Works ok if store frequency in cycles << DRAM write cycle Otherwise store buffer overflows no matter how big it is
Write combining Combine adjacent writes to same location in write buffer
Note: on cache miss need to lookup write buffer (or drain it)
ProcessorCache
Write Buffer
DRAM
Computer Architecture 2014 – Caches50
Cache Write Miss The processor is not waiting for data
continues to work
Option 1: Write allocate: fetch the line into the cache Goes with write back policy
• Because, with write back,write ops are quicker if line in cache
Assumes more writes/reads to cache line will be performed soon
Hopes that subsequent accesses to the line hit the cache
Option 2: Write no allocate: do not fetch line into cache Goes with write through policy Subsequent writes would update memory anyhow (If read ops occur, first read will bring line to cache)
Computer Architecture 2014 – Caches51
WT vs. WB – Summary Write-Through Write-Back
Policy
Data written to cache block (if present)
also written to lower-level memory
Write data only to the cache
Update lower level when a block falls out of the cache
Complexity Less MoreCan read misses produce writes? No Yes
Do repeated writes make it to
lower level?Yes No
Upon write miss Write no allocate Write allocate
Computer Architecture 2014 – Caches52
Write Buffers for WT – Summary
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall Q. Why a buffer, why not just one register ?
A. Bursts of writes are common
Q. Are Read After Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or check in buffer
Computer Architecture 2014 – Caches53
Write-back vs. Write-through Commercial processors favor write-back
Write bursts to the same line are common Simplifies management of multi-cores
• Data in two consecutive cache levels is inconsistent while write is in-flight
• With write-through, this happens on every write
Computer Architecture 2014 – Caches54
Optimizing the Hierarchy
Computer Architecture 2014 – Caches55
Cache Line Size Larger line size takes advantage of spatial locality
Too big blocks: may fetch unused data While possibly evicting useful date miss rate goes up
Larger line size means larger miss penalty Longer time to fill line (critical chunk first reduces the
problem) Longer time to evict
avgAccessTime = missPenalty × missRate + hitTime × (1 – missRate)
Computer Architecture 2014 – Caches56
Classifying Misses: 3 Cs Compulsory
First access to a block which is not in the cache Block must be brought into cache Cache size does not matter Solution: prefetching
Capacity Cache cannot contain all blocks needed during program
execution Blocks are evicted and later retrieved Solution: increase cache size, stream buffers
Conflict Occurs in set associative or direct mapped caches when
too many blocks are mapped to the same set Solution: increase associativity, victim cache
Computer Architecture 2014 – Caches57
Cache Size (KB)
Mis
s Ra
te p
er T
ype
00.02
0.04
0.06
0.08
0.10.12
0.141 2 4 8 16 32 64 128
1-way
2-way4-way
8-wayCapacity
Compulsory
Conflict
3Cs in SPEC92
Compulsory
Capacity
Mis
s ra
te (f
ract
ion)
Computer Architecture 2014 – Caches58
Multi-ported Cache N-ported cache enables n accesses in parallel
Parallelize cache access in different pipeline stages Parallelize cache access in a super-scalar processors
For n=2, more than doubles the cache area size Wire complexity also degrades access times
Can help: “banked cache” Each line is divided to n banks Can fetch data from k n different banks in possibly
different lines
Computer Architecture 2014 – Caches59
Separate Code / Data Caches Parallelize data access and instruction fetch
Code cache is a read only cache No need to write back line into memory when evicted Simpler to manage
What about self modifying code ? I-cache “snoops” (=monitors) all write ops
• Requires a dedicated snoop port: read tag array + match tag
• (Otherwise snoops would stall fetch) If the code cache contains the written address
• Invalidate the corresponding cache line• Flush the pipeline – it may contain stale code
Computer Architecture 2014 – Caches60
20-May-2013
Computer Architecture 2014 – Caches61
Last-level cache (LLC) Either L2 or L3
LLC cache is bigger, but with higher latency Reduces L1 miss penalty – saves access to memory On modern processors, LLC is located on-chip
Since LLC contains L1 it needs to be significantly larger Data is replicated across the cache levels
• Fetching from LLC to L1 replicates data E.g., if LLC is only 2× L1, half of LLC is duplicated in L1
LLC is typically unified (code / data)
Computer Architecture 2014 – Caches62
Core 2 Duo Die Photo
L2 Cache
(Core 2 Duo L2 size is up to 6MB; it is shared by the cores.)
Computer Architecture 2014 – Caches63
Ivy Bridge (L3, “last level” cache)
(64KB data + 64KB instruction L1 cache per core; 512KB L2 data cache per core; and up to 32MB L3 cache shared by all cores)
Computer Architecture 2014 – Caches64
AMD Phenom II Six Core
Computer Architecture 2014 – Caches65
LLC: Inclusiveness Data replication across cache levels presents a
tradeoff:Inclusive vs. non-inclusive caches
Inclusive: LLC contains all data in higher cache levels Evicting a line from the LLC also evicts it from the higher
levels Pro: makes it easy to manage cache hierarchy
• LLC serves as coordination point Con: wasted cache space
Non-inclusive: L1 may contain data not present in LLC Pro: better use of cache resources Con: how do we know what data is in the caches?
Critical issue in multicore design Data coherency and consistency across individual L1
caches
Computer Architecture 2014 – Caches66
LLC: Inclusiveness Practicality wins - LLC is typically inclusive
All addresses in L1 are also contained in LLC
LLC eviction process Address evicted from LLC snoop invalidate it in L1 But data in L1 might be newer than in L2
• When evicting a dirty line from L1 write to L2 Thus, when evicting a line from L2 which is dirty in L1
• Snoop invalidate to L1 generates a write from L1 to L2• Line marked as modified in L2 Line written to memory
Computer Architecture 2014 – Caches67
Victim Cache The load on a cache set may be non-uniform
Some sets may have more conflict misses than others Solution: allocate ways to sets dynamically
Victim buffer adds some associativity to direct-mapped caches A line evicted from L1 cache is placed in the victim cache If victim cache is full evict its LRU line On L1 cache lookup, in parallel, also search victim cache
Direct-mapped cache Victim buffer (fully-assoc.)
Computer Architecture 2014 – Caches68
Victim Cache On victim cache hit
Line is moved back to cache Evicted line moved to the victim cache Same access time as cache hit
Direct-mapped cache Victim buffer (fully-assoc.)
Computer Architecture 2014 – Caches69
Stream Buffers Before inserting a new line into cache
Put new line in a stream buffer
If the line is expected to be accessed again Move the line from the stream buffer into cache E.g., if the line hits in the stream buffer
Example: Scanning a very large array (much larger than the
cache) Each item in the array is accessed just once If the array elements are inserted into the cache
• The entire cache will be thrashed If we detect that this is just a scan-once operation
• E.g., using a hint from the software• Can avoid putting the array lines into the cache
Computer Architecture 2014 – Caches70
Prefetching Predict future memory accesses
Fetch them from memory ahead of time
Instruction Prefetching On cache miss, prefetch sequential lines into stream
buffers Branch-predictor-directed prefetching
• Let branch predictor run ahead
Data Prefetching - predict future data accesses Next sequential (block prefetcher) Stride General pattern
Software Prefetching Compiler injects special prefetching instructions
Computer Architecture 2014 – Caches71
Prefetching Prefetch can greatly improve performance
…but incurs high overheads!
Predictions are not 100% accurate Need to predict correct address and make sure it arrives
on time• Too early: line may be evicted• Too late: processor has to stall
Closer to 50-60% in practice
Can waste memory bandwidth and power In some commodity processors, roughly 50% of data
brought from memory is never used due to aggressive prefetching
Computer Architecture 2014 – Caches72
Critical Word First Reduce Miss Penalty Don’t wait for full block to be loaded before
restarting CPU Early restart
• As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
Critical Word First• Request the missed word first from memory and send
it to the CPU as soon as it arrives• Let the CPU continue execution while filling the rest of
the words in the line• Also called wrapped fetch and requested word first
Example: Pentium 8 byte bus, 32 byte cache line 4 bus cycles to fill line Fetch data from 95H
80H-87H 88H-8FH 90H-97H 98H-9FH1 43 2
Computer Architecture 2014 – Caches73
Non-Blocking Cache Very important in OoO processors
Hit Under Miss Allow cache hits while one miss is in progress Another miss has to wait
Miss Under Miss, Hit Under Multiple Misses Allow hits and misses when other misses in progress Memory system must allow multiple pending requests Manage a list of outstanding cache misses
• When miss is served and data gets back, update list
Pending operations manages by MSHR Also known as “Miss-Status Holding Register”
Computer Architecture 2014 – Caches74
Compiler/Programmer Optimizations: Merging Arrays
Merge 2 arrays into a single array of compound elements
/* BEFORE: two sequential arrays */int val[SIZE];int key[SIZE];
/* AFTER: One array of structures */struct merge {
int val;int key;
} merged_array[SIZE];
Reduce conflicts between val and key Improves spatial locality
Computer Architecture 2014 – Caches75
Compiler optimizations: Loop Fusion Combine 2 independent loops that have same
looping and some variables overlap Assume each element in a is 4 bytes, 32KB cache, 32 B / line
for (i = 0; i < 10000; i++)a[i] = 1 / a[i];
for (i = 0; i < 10000; i++)sum = sum + a[i];
First loop: hit 7/8 of iterations Second loop: array > cache same hit rate as in 1st loop
Fuse the loops for (i = 0; i < 10000; i++) {a[i] = 1 / a[i];sum = sum + a[i];
} First line: hit 7/8 of iterations Second line: hit all
Computer Architecture 2014 – Caches76
Compiler Optimizations: Loop Interchange
Change loops nesting to access data in order stored in memory
Two dimensional array in memory:x[0][0] x[0][1] … x[0][99] x[1][0] x[1][1] … /* Before */
for (j = 0; j < 100; j++)for (i = 0; i < 5000; i++)
x[i][j] = 2 * x[i][j];/* After */
for (i = 0; i < 5000; i++)for (j = 0; j < 100; j++)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words Improved spatial locality
Computer Architecture 2014 – Caches78
Summary: Cache and Performance
Reduce cache miss rate Larger cache Reduce compulsory
misses• Larger Block Size• HW Prefetching (Instr,
Data)• SW Prefetching (Data)
Reduce conflict misses• Higher Associativity• Victim Cache
Stream buffers• Reduce cache thrashing
Compiler Optimizations
Reduce the miss penalty Early Restart and Critical
Word First on miss Non-blocking Caches (Hit
under Miss, Miss under Miss)
2nd/3rd Level Cache
Reduce cache hit time On-chip caches Smaller size cache (hit
time increases with cache size)
Direct map cache (hit time increases with associativity)
Bring frequently accessed data closer to the processor