Post on 21-Dec-2015
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 1
MAMAS – Computer Architecture234367
Lectures 3-4 Memory Hierarchy and Cache Memories
Dr. Avi Mendelson
Some of the slides were taken from:
(1) Lihu Rapoport (2) Randi Katz and (3) Petterson
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 2
Technology Trends
DRAMYear Size Cycle Time1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1000:1 1 Mb 2:1 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
Capacity Speed
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 1.4x in 10 years
Disk 2x in 3 years 1.4x in 10 years
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 3
Processor-DRAM Memory Gap (latency)
Processor-MemoryPerformance Gap:(grows 50% / year)
1
10
100
1000
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
DRAM
CPU
Pe
rfo
rma
nc
e
Time
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 4
Why can’t we build Memory at the same frequency as Logic?
1. It is too expensive to build large memory with that technology
2. The size of the memory determine its access time, the larger the slower.
We do not aim to achieve the best performance solution. We aim to achieve the best COST EFFECTIVE solution (best performance for a given amount of money).
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 5
Important observation – programs preserve locality (and we can help it)
Temporal Locality (Locality in Time):
– If an item is referenced, it will tend to be referenced again soon
– Example: code and variables in loops
=> Keep most recently accessed data items closer to the processor
Spatial Locality (Locality in Space):
– If an item is referenced, nearby items tend to be referenced soon
– Example: scanning an array
=> Move blocks of contiguous words closer to the processor
• Locality + smaller HW is faster + Amdahl’s law=> memory hierarchy
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 6
The Goal: illusion of large, fast, and cheap memory
Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast
(most of the time)?– Hierarchy:
Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest
Level 1 Level 4Level 3Level 2CPU
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 7
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns$.01-.001/bit
Main MemoryM Bytes100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents
-3 -4
CapacityAccess TimeCost
Infinite storagesec-min10
-6
Registers
Cache
Memory
Disk
Backup
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 8
Simple performance evaluation
Suppose we have a processor that can execute one instruction per cycle when working from the first level of the memory hierarchy (hits the L1).
Example: If the information is not found in the first level, the CPU waits for 10 cycles, and if it is found only in the third level, it costs another 100 cycles.
Level 1 Level 3Level 2CPU
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 9
Cache Performance
CPU time = (CPU execution cycles + Memory stall cycles) × cycle time
Memory stall cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty
Memory stall cycles = Memory accesses × Miss rate × Miss penalty
CPU time = IC × (CPIexecution + Mem accesses per instruction × Miss rate × Miss penalty) × Clock cycle time
Misses per instruction = Memory accesses per instruction × Miss rate
CPU time = IC × (CPIexecution + Misses per instruction × Miss penalty) × Clock cycle time
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 10
Example
Consider a program that executes 10x106 instructions and with CPI=1.
Each instruction causes (in average) 0.5 accesses to data 95% of the accesses hit L1 50% of the accesses to L2 are misses and so need to be looked
up in L3. What is the slowdown due to memory hierarchy?
Solution Program generates 15x106 accesses to memory that could be
executed in 10x106 cycles if all the information was at level-1 0.05* 15x106 = 750000 accesses to L2 and 375000 accesses to L3. New cycles = 10x106 + 10*750,000 + 100*375,000 = 55*106
It is 5.5 times slowdown!!!!!
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 11
The first level of the memory hierarchy: Cache memories – Main Idea
At this point we assume only two levels of memory hierarchy: Main memory and cache memory
For the simplicity we also assume that all the program (data and instructions) is placed in the main memory.
The cache memory(ies) is part of the Processor – Same technology– Speed: same order of magnitude as accessing Registers
Relatively small and expensive Acts like an HASH function: holds parts of the programs’
address spaces. It needs to achieve:
– Fast access time– Fast search mechanism– Fast replacement mechanism– High Hit-Ratio
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 12
Cache - Main Idea (cont) When processor needs instruction or data it first looks to
find it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there.
Address space (or main memory) is partitioned into blocks – Typical block size is 32, 64 or 128 bytes– Block address is the address of the first byte in the block
block address is aligned (multiple of the block size)
Cache holds lines, each line holds a block– Need to determine which line the block is mapped to (if at all)– A block may not exist in the cache - cache miss
If we miss the Cache– Entire block is fetched into a line fill buffer (may require few bus
cycles), and then put into the cache– Before putting the new block in the cache, another block may
need to be evicted from the cache (to make room for the new block)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 13
Memory Hierarchy: Terminology
For each memory level we can define the following:– Hit: data appears in the memory level
– Hit Rate: the fraction of memory accesses which are hits
– Hit Time: Time to access the memory level (includes also the time to determine hit/miss)
– Miss: data needs to be retrieved from the lower level
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the current level + Time to deliver the data to the processor
Average memory-access time = teffective = (Hit time Hit Rate) + (Miss Time Miss rate)
= (Hit time Hit Rate) + (Miss Time (1- Hit rate))
– If hit rate is close to 1 teffective is close to Hit time
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 14
Four Questions for Memory Hierarchy Designers
In order to increase efficiently, we are moving data in blocks between different levels of memories; e.g., pages in main memory.
In order to achieve that we need to answer (at least) 4 questions:
Q1: Where can a block be placed when brought? (Block placement)
Q2: How is a block found when needed? (Block identification)
Q3: Which block should be replaced on a miss? (Block replacement)
Q4: What happens on a write? (Write strategy)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 15
Q1-2: Where can a block be placed and how can we find it?
Direct Mapped: Each block has only one place that it can appear in the cache.
Fully associative: Each block can be placed anywhere in the cache.
Set associative: Each block can be placed in a restricted set of places in the cache.– If there are n blocks in a set, the cache placement is called n-way set
associative
What is the associativity of a direct mapped cache?
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 16
Fully Associative Cache
Tag Line
Tag Array
Tag = Block# Line Offset
Address Fields
0431
Data array
031
=
=
=
hit data
An address is partitioned to– offset within block
– block number
Each block may be mapped to each of the cache lines– need to lookup the block in
all lines
Each cache line has a tag– tag is compared to the block
number
– If one of the tags matches the block# we have a hit and the line is accessed according to the line offset
– need a comparator per line
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 17
Fully associative - Cont
Advantages Good utilization of the area, since any block in the main memory can be
mapped to any cache line
Disadvantage A lot of hardware Complicated hardware that causes a slow down to the access time.
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 18
Direct Mapped Cache
The l.s.bits of the block number determine to which cache line the block is mapped - called the set number
– Each block is mapped to a single line in the cache
– If a block is mapped to the same line as another, it will replace it.
The rest of the block number bits are used as a tag
– Compared to the tag stored in the cache for the appropriate set
Line
Tag Array
Tag
Set#
Cache storage 031
Tag Set Line Offset
Address
041331 5Block number
29 =
512 sets
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 19
Direct Map Cache (cont)
Memory is conceptually divided into slices whose size is the cache size
Offset from from slice start indicates position in cache (set)
Addresses with the same offset map into the same line
One tag per line is kept Advantages
Easy hit/miss resolution Easy replacement algorithm Lowest power and complexity
Disadvantage Excessive Line replacement
due to “Conflict misses”
CacheSize
Line 1
Line 2
.
.
.
.
Line n
x
x
x
Map to theSame set X
CacheSize
CacheSize
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 20
LineTagTag Line
2-Way Set Associative Cache Each set holds two lines (way 0 and way 1) Each block can be mapped into one of two lines in the
appropriate set
Tag Set Line Offset
Address Fields04121231
Cachestorage
Way 1Tag Array
Set#
031Way 0Tag Array
Set#
031 Cachestorage
WAY #1WAY #0
Example:Line Size: 32 bytesCache Size 16KB# of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits
Address 0x123456780001 0010 0011 010001001 0110 0111 1000
Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 =
0x091A2
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 21
2-Way Cache - Hit Decision
Tag Set Line Offset041231
Way 0
Tag
Set#
Data
=
Hit/Miss
MUX
Data Out
DataTag
Way 1
=
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 22
2-Way Set Associative Cache (cont)
Memory is conceptually divided into slices whose size is 1/2 the cache size (way size)
Offset from from slice start indicates set#Each set contains now two potential lines!
Addresses with the same offset map into the same set
Two tags per set, one tag per line is needed
WaySize
Line 1
Line 2
.
.
.
.
Line n
x
x
x
Map to theSame set X
WaySize
WaySize
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 23
What happens on a Cache miss?
Read miss– Cache line file - fetch the entire block that contains the missing data from memory
– Block is fetched into the cache line fill buffer
– May take a few bus cycles to complete the fetch e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles
– Once the entire line is fetched it is moved from the fill buffer into the cache
What happens on a write miss ?– The processor does not wait for data continues its work
– 2 options: write allocate and no write allocate
– Write allocate: fetch the line into the cache Assumes that we may read from the line soon Goes with write back policy (hoping that subsequent writes to the line hit the cache)
– Write no allocate: do not fetch line into the cache on a write miss Goes with write through policy (subsequent writes would update memory anyhow)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 24
Replacement
Each line contains Valid indication Direct map: simple, line can be brought to only one place
– Old line is evicted (re-written to cache, if needed)
n-Ways: need to choose among ways in set Options: FIFO, LRU, Random, Pseudo LRU LRU is the best (in average) LRU
– 2 ways: requires 1 bit per set to mark latest accessed
– 4-ways: Need to save full ordering
– Fully associative: Full ordering cannot be saved (too many bits) approximate LRU
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 25
Implementing LRU in a k-way set associative cache
For each set hold a kk matrix – Initialization:
0 0 0 0 …. 0
1 1 0 0 ….. 0
2 1 1 0 ….. 0
N 1 1 1 …. 1 0
When line j (1 j k) is accessed Set all bits on row J to 1
(done in parallel by hardware)
THEN Reset all bits on column J to
0 (at the same cycle)
i 0 0 :
j 1...101...111 : 0
Evict row with ALL “0”
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 26
Pseudo LRU We will use as an example a 4-way set associative cache. Full LRU records the full-order of way access in each set
(which way was most recently accessed, which was second, and so on).
Pseudo LRU (PLRU) records a partial order, using 3 bits per-set:– Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2 and
3.
– Bit1 specifies which of ways 0 and 1 was least recently used
– Bit2 specifies which of ways 2 and 3 was least recently used
For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1
0 1 2 3
0 1
0 01 1
bit0
bit2bit1
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 27
Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory
– Write buffer is just a FIFO Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory
– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Store frequency (w.r.t. time) > 1 / DRAM write cycle– If exists for a long period of time (CPU cycle time too quick and/or too
many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time
Write combining: combine writes in the write buffer On cache miss need to lookup write buffer
ProcessorCache
Write Buffer
DRAM
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 28
Improving Cache Performance
Separating data cache from instruction cache (will be discussed in future lectures)
Reduce the miss rate – In order to reduce the misses, we need to understand why misses happen
Reduce the miss penalty– Bring the information to the processor as soon as possible.
Reduce the time to hit in the cache – Using Amdahl law, since most of the time we hit the cache it is important
to make sure we are accelerating the hit process.
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 29
Classifying Misses: 3 Cs Compulsory
– The first access to a block is not in the cache, so the block must be brought into the cache
– Also called cold start misses or first reference misses
– Misses in even an Infinite Cache
– Solution: for a fixed cache-line size -> prefetching Capacity
– cache cannot contain all blocks needed during program execution (it also termed the working set of the program is too big) blocks are evicted and later retrieved
– Solution: increase cache size, stream buffers, software solution Conflict
– Occurs in set associative or direct mapped caches when too many blocks map to the same set
– Also called collision misses or interference misses
– Solution: increase associativity, victim cache, linker optimizations
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 30
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Absolute Miss Rate (SPEC92)
Conflict
Compulsory vanishinglysmall
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 31
How Can We Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size is not changed: What happens if:
1) Change Block Size: Which of 3Cs is obviously affected?
2) Change Associativity: Which of 3Cs is obviously affected?
3) Change Compiler: Which of 3Cs is obviously affected?
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 32
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
Reduce Misses via Larger Block Size
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 33
Reduce Misses via Higher Associativity
We have two conflicting trends here: Higher associativity
improve the hit ratio
BUT Increase the access time Slow down the replacement Increase complexity
Most of the modern cache memory systems are using at least 4-way associative cache memories
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 34
Example: Avg. Memory Access Time vs. Miss Rate
Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
Effective access time to cache(Red means -> not improved by more associativity)
Note this is for a specific example
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 35
Reducing Miss Penalty byCritical Word First
Don’t wait for full block to be loaded before restarting CPU
– Early restart As soon as the requested word of the block arrives, send it to the CPU
and let the CPU continue execution
– Critical Word First Request the missed word first from memory and send it to the CPU as
soon as it arrives Let the CPU continue execution while filling the rest of the words in the
block Also called wrapped fetch and requested word first
Example: – 64 bit = 8 byte bus, 32 byte cache line 4 bus cycles to fill line
– Fetch date from 95H80H-87H 88H-8FH 90H-97H 98H-9FH
1 24 3
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 36
Prefetchers
In order to avoid compulsory misses, we need to bring the information before it was requested by the program
We can use the locality of references behavior– Space -> bring the environment.
– Time -> same “patterns” repeats themselves.
Prefetching relies on having extra memory bandwidth that can be used without penalty
There are hardware and software prefetchers.
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 37
Hardware Prefetching
Instruction Prefetching–Alpha 21064 fetches 2 blocks on a miss
Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required
On miss check stream buffer
–Branch predictor directed prefetching Let branch predictor run ahead
Data Prefetching–Try to predict future data access
Next sequential Stride General pattern
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 38
Software Prefetching
Data Prefetch–Load data into register (HP PA-RISC loads)–Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)–Special prefetching instructions cannot cause faults;
a form of speculative execution How it is done
–Special prefetch intrinsic in the language–Automatically by the compiler
Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?–Higher superscalar reduces difficulty of issue bandwidth
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 39
Other techniques
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 40
Multi-ported cache and Banked Cache
A n-ported cache enables n cache accesses in parallel– Parallelize cache access in different pipeline stages
– Parallelize cache access in a super-scalar processors
Effectively doubles the cache die size Possible solution: banked cache
– Each line is divided to n banks
– Can fetch data from k n different banks (in possibly different lines)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 41
Separate Code / Data Caches
Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)
Code cache is a read only cache– No need to write back line into memory when evicted
– Simpler to manage
What about self modified code? (X86 only)– Whenever executing a memory write need to snoop the code cache
– If the code cache contains the written address, the line in which the address is contained is invalidated
– Now the code cache is accessed both in the fetch stage and in the memory access stage Tags need to be dual ported to avoid stalling
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 42
Increasing the size with minimum latency loss - L2 cache
L2 is much larger than L1 (256K-1M in compare to 32K-64K)
Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip)– If L2 is on-chip, why not just make L1 larger?
Can be inclusive:– All addresses in L1 are also contained in L2– Data in L1 may be more updated than in L2– L2 is unified (code / data)
Most architectures do not require the caches to be inclusive (although, due to the size difference they are)
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 43
Victim Cache Problem: per set load may non-uniform
– some sets may have more conflict misses than others Solution: allocate ways to sets dynamically, according to the load When a line is evicted from the cache it placed on the victim cache
– If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1
On cache lookup, victim cache lookup is also performed (in parallel) On victim cache hit,
– line is moved back to cache
– evicted line moved to the victim cache
– Same access time as cache hit Especially effective for direct mapped cache
– Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 44
Stream Buffers
Before inserting a new line into the cache put it in a stream buffer
Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future
Example:– Assume that we scan a very large array (much larger than the
cache), and we access each item in the array just once
– If we inset the array into the cache it will thrash the entire cache
– If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 45
Backup
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 46
Compiler issues
Data Alignment– Misaligned access might span several cache lines
– Prohibited in some architectures (Alpha, SPARC)
– Very slow in others (x86)
Solution 1: add padding to data structures Solution 2: make sure memory allocations are aligned
Code Alignment– Misaligned instruction might span several cache lines
– x86 only. VERY slow.
Solution: insert NOPs to make sure instructions are aligned
Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 47
Compiler issues 2
Overalignment– Alignment of an array can be a multiple of cache size
– Several arrays map to same cache lines
– Excessive conflict misses (thrashing)
for (int i=0; i<N; i++)
a[i] = a[i] + b[i] * c[i]
Solution 1: increase cache associativity Solution 2: break the alignment