Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.
-
Upload
delilah-inglett -
Category
Documents
-
view
218 -
download
0
Transcript of Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.
![Page 1: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/1.jpg)
Memory Hierarchy Design
Adapted from CS 252 Spring 2006 UC BerkeleyCopyright (C) 2006 UCB
![Page 2: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/2.jpg)
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990
Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache”
memories between CPU and DRAM.
Create a “memory hierarchy”.
![Page 3: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/3.jpg)
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
![Page 4: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/4.jpg)
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
![Page 5: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/5.jpg)
Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz
07 Reg L1 Inst
L1 Data L2 DRA
M Disk
Size 1K 64K 32K 512K 256M 80GLatenc
yCycles,
Time
1,0.6 ns
3,1.9 ns
3,1.9 ns
11,6.9 ns
88,55 ns
107,12 ms
Let programs address a memory space that scales to the disk size, at
a speed that is usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,
application
Goal: Illusion of large, fast, cheap memory
![Page 6: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/6.jpg)
iMac’s PowerPC 970: All caches on-chip
(1K)
Registers
512KL2
L1 (64K Instruction)
L1 (32K Data)
![Page 7: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/7.jpg)
Memory Hierarchy
• Provides (virtually) unlimited amounts of fast memory.
• Takes advantage of locality.– Principle of locality (p.47): Programs tend to
reuse data and instructions they have used recently. (Temporal and Spatial)
– Most programs do not access all code or data uniformly.
![Page 8: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/8.jpg)
The Principle of Locality• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design.
![Page 9: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/9.jpg)
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
on
e d
ot
per
acc
ess)
SpatialLocality
Temporal Locality
Bad locality behavior
![Page 10: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/10.jpg)
Multi-Level Memory Hierarchy
typical valuessmallerfastermore expensive
![Page 11: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/11.jpg)
• Goal: To provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level
• All data in one level are found in the level below. (The upper level is a subset of the lower level.)
Memory Hierarchy
![Page 12: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/12.jpg)
Why Important?
• Microprocessor performance improved 35% per year until 1986, and 55% per year since 1987.
• Memory: 7% per year performance improvement in latency
• Processor-memory performance gap
![Page 13: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/13.jpg)
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example: Block X)
– Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
![Page 14: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/14.jpg)
Cache Measures
• Hit rate: fraction found in that level– So high that usually talk about Miss rate– Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory • Average memory-access time
= Hit time + Miss rate x Miss penalty (ns or clocks)
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to lower level
= f(latency to lower level)– transfer time: time to transfer block
=f(BW between upper & lower levels)
![Page 15: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/15.jpg)
Cache Performance
Memory stall cycles =
Number of misses x Miss penalty =
MissesIC x x Miss penalty = Instruction
Memory accesses IC x x Miss rate x Miss penalty Instruction
![Page 16: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/16.jpg)
Example
• Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?
![Page 17: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/17.jpg)
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
![Page 18: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/18.jpg)
Block Placement
• Direct Mapped– (block address) mod (number of blocks in
cache)
• Fully-associative
• Set associative– (block address) mod (number of sets in
cache)– n-way set associative: n blocks in a set
![Page 19: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/19.jpg)
Block Placement
![Page 20: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/20.jpg)
Block Placement
• The vast majority of processor caches today are direct mapped, two-way set associative, or four-way set associative.
![Page 21: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/21.jpg)
Block Identification
used to check all the blocksin the set
used to selectthe set
address of thedesired data within the block
address
![Page 22: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/22.jpg)
DECstation 3100 cacheAddress (showing bit positions)
16 14 Byteoffset
Valid Tag Data
Hit Data
16 32
16Kentries
16 bits 32 bits
31 30 17 16 15 5 4 3 2 1 0
block size = one word
![Page 23: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/23.jpg)
64KB Cache Using 4-Word BlocksAddress (showing bit positions)
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 1 0
![Page 24: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/24.jpg)
Block Size vs. Miss Rate• Increasing block size: decreasing miss rate.
–The instruction references have better spatial locality.
• Serious problem with just increasing the block size: miss penalty increases.
–Miss penalty: the time required to fetch the block from the next lower level of the hierarchy and load it into the cache.
–Miss penalty = (latency to the first word) + (transfer time for the rest of the block)
![Page 25: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/25.jpg)
Block Size vs. Miss Rate
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes)
![Page 26: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/26.jpg)
Block Identification
• If the total cache size is kept the same, increasing associativity– increases the number of blocks per set– decreases the size of the index– increases the size of the tag
![Page 27: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/27.jpg)
Block Replacement
• With direct-mapped– Only one block frame is checked for a hit, and
only that block can be replaced.
• With fully associative or set associative,– Random– Least-recently used (LRU)– First in, First out (FIFO)
• See Figure 5.6 on p. 400.
![Page 28: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/28.jpg)
Write Strategy• Write through
– The information is written to both the block in the cache and to the block in the memory.
– Easier to implement, the cache is always clean.
• Write back– The information is written only to the block in
the cache. The modified block is written to main memory only when it is replaced.
– Less memory bandwidth, less power
• Write through with write buffer
![Page 29: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/29.jpg)
What happens on a write?Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level
memory
Write data only to the cache
Update lower level when a
block falls out of the cache
Debug Easy HardDo read misses produce writes? No Yes
Do repeated writes make it to lower level?
Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (“write-
allocate”).
![Page 30: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/30.jpg)
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes arecommon.Q. Are Read After
Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or send read 1st after check write buffers.
![Page 31: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/31.jpg)
Write Miss Policy
• Write allocate– The block is allocated on a write miss,
followed by the write hit.
• No-write allocate– Write misses do not affect the cache.– Unusual– The block is modified only in the memory.
![Page 32: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/32.jpg)
Exampl
• Assume a fully associative write-back cache with many cache entries that starts empty. What are the number of hits and misses when using no-write allocate versus write allocate?
Write Mem[100];
Write Mem[100];
Read Mem[200];
Write Mem[200];
Write Mem[100];
![Page 33: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/33.jpg)
Write Miss Policy
• Write-back caches use write allocate normally.
• Write-through caches often use no-write allocate.
![Page 34: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/34.jpg)
Example
• Assuming a cache of 16K blocks (1 block = 1 word) and 32-bit addressing, find the total number of sets and the total number of tag bits that are direct-mapped, 2-way set-associative, 4-way set-associative, and fully associative. Assume that this machine is byte-addressable.
![Page 35: Memory Hierarchy Design Adapted from CS 252 Spring 2006 UC Berkeley Copyright (C) 2006 UCB.](https://reader036.fdocuments.in/reader036/viewer/2022062318/551c4bce5503469d6a8b4967/html5/thumbnails/35.jpg)
Cache Performance
Average memory access time =
Hit time + Miss rate X Miss penalty