1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan bhuyan.
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
3
Transcript of 1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan bhuyan.
1 1999 ©UCB
CS 161Ch 7: Memory Hierarchy
LECTURE 14
Instructor: L.N. Bhuyanwww.cs.ucr.edu/~bhuyan
2 1999 ©UCB
Recap: Machine Organization: 5 classic components of any computer
Personal Computer
Processor (CPU)
(active)
Computer
Control(“brain”)
Datapath(“brawn”)
Memory(passive)
(where programs, & data live whenrunning)
Devices
Input
Output
Components of every computer belong to one of these five categories
3 1999 ©UCB
° Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
°DRAM access times are 50-70ns at cost of $100 to $200 per GB.
°Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
Memory Trends
2004
4 1999 ©UCB
Memory Latency Problem
µProc60%/yr.(2X/1.5yr)
DRAM5%/yr.(2X/15 yrs)1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
Processor-DRAM Memory Performance GapMotivation for Memory Hierarchy
5 1999 ©UCB
The Goal: Illusion of large, fast, cheap memory
° Fact: Large memories are slow, fast memories are small
° How do we create a memory that is large, cheap and fast (most of the time)?
° Hierarchy of Levels• Uses smaller and faster memory technologies
close to the processor
• Fast access time in highest level of hierarchy
• Cheap, slow memory furthest from processor
° The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level
6 1999 ©UCB
Recap: Memory Hierarchy Pyramid
Processor (CPU)
Size of memory at each level
Level 1
Level 2
Level n
Increasing Distance from
CPU,Decreasing
cost / MBLevel 3
. . .
transfer datapath: bus
Decreasing distance
from CPU, Decreasing
Access Time (Memory Latency)
7 1999 ©UCB
Why Hierarchy works: Natural Locality
°The Principle of Locality:• Programs access a relatively small portion of the address space at any second
Memory Address0 2n - 1
Probabilityof reference
° Temporal Locality (Locality in Time): Recently accessed data tend to be referenced again soon
° Spatial Locality (Locality in Space): nearby items will tend to be referenced soon
1
0
8 1999 ©UCB
Memory Hierarchy: TerminologyHit: data appears in level X: Hit Rate: the fraction
of memory accesses found in the upper level
Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate)
Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time
Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
Note: Hit Time << Miss Penalty
9 1999 ©UCB
Current Memory Hierarchy
Control
Data-path
Processor
reg
s
Secon-daryMem-ory
L2Cache
Speed(ns): 1ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.1 1-4 100-1000 100,000Cost ($/MB): -- $100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk
L1
c
ach
e
MainMem-ory
10 1999 ©UCB
Memory Hierarchy Technology°Random Access: access time is the same for
all locations (Hardware decoder used)
°Sequential Access: Very slow, Data accessed sequentially, access time is location dependent, considered as I/O, (Example: Disks and Tapes)
°DRAM: Dynamic Random Access Memory• High density, low power, cheap, slow
• Dynamic: needs to be “refreshed” regularly
°SRAM: Static Random Access Memory• Low density, high power, expensive, fast
• Static: content last “forever”(until lose power)
11 1999 ©UCB
° SRAM:• value is stored on a pair of inverting gates
• very fast but takes up more space than DRAM (4 to 6 transistors)
° DRAM:• value is stored as a charge on capacitor (must be
refreshed)
• very small but slower than SRAM (factor of 5 to 10)
Memories: Review
B
A A
B
Word line
Pass transistor
Capacitor
Bit line
12 1999 ©UCB
How is the hierarchy managed?
°Registers Memory• By the compiler (or assembly language Programmer)
°Cache Main Memory• By hardware
°Main Memory Disks• By combination of hardware and the operating system (virtual memory: will cover next)
• By the programmer (Files)
13 1999 ©UCB
Measuring Cache Performance
CPU time = Execution cycles X clock cycle time
If cache miss: (Execution cycles + Memory stall cycles) X clock cycle time
Read-stall cycles = # reads X Read miss rate X Read miss penalty
Write-stall cycles = # writes X write miss rate X write miss penalty
Memory-stall cycles = Read-stall + write stall
= Memory accesses X miss rate X miss penalty
= # instns X misses/instn X miss penalty
Measuring Cache Performance
14 1999 ©UCB
Example
Q: Cache miss penalty = 50 cycles and all instns take 2.0 cycles without memory stalls. Assume cache miss rate of 2% and 1.33 (why?) memory references per instn. What is the impact of cache?
Ans: CPU time= IC x (CPI + Memory stall cycles/instn) x cycle time t
Performance including cache misses is
CPU time= IC x (2.0 + (1.33 x .02 x 50)) x cycle time = IC x 3.33 x t
For a perfect cache that never misses CPU time = IC x 2.0 x t
Hence, including the memory hierarchy stretches CPU time by 1.67
But, without memory hierarchy, the CPI would increase to
2.0 + 50 x 1.33 or 68.5 – a factor of over 30 times longer.
15 1999 ©UCB
Cache Organization
(1) How do you know if something is in the cache?
(2) If it is in the cache, how to find it?
°Answer to (1) and (2) depends on type or organization of the cache
° In a direct mapped cache, each memory address is associated with one possible block within the cache• Therefore, we only need to look in a single location in the cache for the data if it exists in the cache
16 1999 ©UCB
Simplest Cache: Direct Mapped
Memory4-Block Direct Mapped CacheBlock
Address
0123456789
101112131415
Cache Index
0123
• Cache Block 0 can be occupied by data from:Memory block 0, 4, 8, 12
- Cache Block 1 can be occupied by data from:
Memory block 1, 5, 9, 13
0000two
0100two
1000two
1100two
• Block Size = 32/64 Bytes
17 1999 ©UCB
Simplest Cache: Direct Mapped
MainMemory
4-Block Direct Mapped CacheBlock
Address
0123456789
101112131415
Cache Index
0123
0010
0110
1010
1110
° index determines block in cache
° index = (address) mod (# blocks)
° If number of cache blocks is power of 2, then cache index is just the lower n bits of memory address [ n = log2(# blocks) ]
tag index
Memory block address
18 1999 ©UCB
Simplest Cache: Direct Mapped w/Tag
MainMemory
Direct Mapped CacheBlockAddress
0123456789
101112131415
0123
0010
0110
1010
1110
° tag determines which memory block occupies cache block
° tag = left hand bits of address
° hit: cache tag field = tag bits of address
° miss: tag field tag bits of addr.
tag
11
datacacheindex
19 1999 ©UCB
Finding Item within Block
° In reality, a cache block consists of a number of bytes/words (32 or 64 bytes) to (1) increase cache hit due to locality property and (2) reduce the cache miss time.
°Mapping: memory block I can be mapped to cache block frame I mod x, where x is the number of blocks in the cache
- Called congruent mapping
° Given an address of item, index tells which block of cache to look in
° Then, how to find requested item within the cache block?
° Or, equivalently, “What is the byte offset of the item within the cache block?”
20 1999 ©UCB
Issues with Direct-Mapped° If block size > 1, rightmost bits of index are really the offset within the indexed block
ttttttttttttttttt iiiiiiiiii oooo
tag index byteto check to offsetif have select
withincorrect block block block
21 1999 ©UCB
Accessing data in a direct mapped cache
Three types of events:
° cache miss: nothing in cache in appropriate block, so fetch from memory
° cache hit: cache block is valid and contains proper address, so read desired word
° cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory
Cache Access Procedure: (1) Use Index bits to select cache block (2) If valid bit is 1, compare the tag bits of the address with the cache block tag bits (3) If they match, use the offset to read out the word/byte.
22 1999 ©UCB
Data valid, tag OK, so read offset return word d
...
ValidTag 0x0-3 0x4-7 0x8-b 0xc-f
01234567
10221023
...
1 0 a b c d
° 000000000000000000 0000000001 1100
Index0
000000
00
12
3
23 1999 ©UCB
An Example Cache: DecStation 3100°Commercial Workstation: ~1985
°MIPS R2000 Processor (similar to pipelined machine of chapter 6)
°Separate instruction and data caches:• direct mapped
• 64K Bytes (16K words) each
• Block Size: 1 Word (Low Spatial Locality)
Solution: Increase block size – 2nd example
24 1999 ©UCB
16 14
Valid Tag Data
Hit
16 32
16Kentries
16 bits 32 bits
31 30 17 16 15 5 4 3 2 1 0
ByteOffset
Data
Address (showing bit positions)
DecStation 3100 Cache
If miss, cache controller stalls the processor,
loads data from main memory
25 1999 ©UCB
16 12 Byteoffset
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
Address (showing bit positions)31 . . . 16 15 . . 4 3 2 1 0
64KB Cache with 4-word (16-byte) blocks
Tag DataV
26 1999 ©UCB
Miss rates: 1-word vs. 4-word block (cache similar to DecStation 3100)
I-cache D-cache CombinedProgram miss rate miss rate miss rate
gcc 6.1% 2.1% 5.4%spice1.2% 1.3% 1.2%
gcc 2.0% 1.7% 1.9%spice0.3% 0.6% 0.4%
1-wordblock
4-wordblock
27 1999 ©UCB
Miss Rate Versus Block Size
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes) 1 KB
8 KB
16 KB
64 KB
256 KB
totalcachesize• Figure 7.12 -
for direct mapped cache
28 1999 ©UCB
Extreme Example: 1-block cache° Suppose choose block size = cache size? Then only one block in the cache
° Temporal Locality says if an item is accessed, it is likely to be accessed again soon
• But it is unlikely that it will be accessed again immediately!!!
• The next access is likely to be a miss
- Continually loading data into the cache butforced to discard them before they are used again
- Worst nightmare of a cache designer: Ping Pong Effect
29 1999 ©UCB
Block Size and Miss Penality°With increase in block size, the cost of a miss also increases
°Miss penalty: time to fetch the block from the next lower level of the hierarchy and load it into the cache
°With very large blocks, increase in miss penalty overwhelms decrease in miss rate
°Can minimize average access time if design memory system right
30 1999 ©UCB
Block Size Tradeoff
MissPenalty
Block Size
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
Block Size