1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan bhuyan.

1 1999 ©UCB

CS 161Ch 7: Memory Hierarchy

LECTURE 14

Instructor: L.N. Bhuyanwww.cs.ucr.edu/~bhuyan

2 1999 ©UCB

Recap: Machine Organization: 5 classic components of any computer

Personal Computer

Processor (CPU)

(active)

Computer

Control(“brain”)

Datapath(“brawn”)

Memory(passive)

(where programs, & data live whenrunning)

Devices

Input

Output

Components of every computer belong to one of these five categories

3 1999 ©UCB

° Users want large and fast memories!

SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.

°DRAM access times are 50-70ns at cost of $100 to $200 per GB.

°Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.

Memory Trends

2004

4 1999 ©UCB

Memory Latency Problem

µProc60%/yr.(2X/1.5yr)

DRAM5%/yr.(2X/15 yrs)1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

Processor-DRAM Memory Performance GapMotivation for Memory Hierarchy

5 1999 ©UCB

The Goal: Illusion of large, fast, cheap memory

° Fact: Large memories are slow, fast memories are small

° How do we create a memory that is large, cheap and fast (most of the time)?

° Hierarchy of Levels• Uses smaller and faster memory technologies

close to the processor

• Fast access time in highest level of hierarchy

• Cheap, slow memory furthest from processor

° The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level

6 1999 ©UCB

Recap: Memory Hierarchy Pyramid

Processor (CPU)

Size of memory at each level

Level 1

Level 2

Level n

Increasing Distance from

CPU,Decreasing

cost / MBLevel 3

. . .

transfer datapath: bus

Decreasing distance

from CPU, Decreasing

Access Time (Memory Latency)

7 1999 ©UCB

Why Hierarchy works: Natural Locality

°The Principle of Locality:• Programs access a relatively small portion of the address space at any second

Memory Address0 2n - 1

Probabilityof reference

° Temporal Locality (Locality in Time): Recently accessed data tend to be referenced again soon

° Spatial Locality (Locality in Space): nearby items will tend to be referenced soon

1

0

8 1999 ©UCB

Memory Hierarchy: TerminologyHit: data appears in level X: Hit Rate: the fraction

of memory accesses found in the upper level

Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate)

Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time

Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor

Note: Hit Time << Miss Penalty

9 1999 ©UCB

Current Memory Hierarchy

Control

Data-path

Processor

reg

s

Secon-daryMem-ory

L2Cache

Speed(ns): 1ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.1 1-4 100-1000 100,000Cost ($/MB): -- $100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk

L1

c

ach

e

MainMem-ory

10 1999 ©UCB

Memory Hierarchy Technology°Random Access: access time is the same for

all locations (Hardware decoder used)

°Sequential Access: Very slow, Data accessed sequentially, access time is location dependent, considered as I/O, (Example: Disks and Tapes)

°DRAM: Dynamic Random Access Memory• High density, low power, cheap, slow

• Dynamic: needs to be “refreshed” regularly

°SRAM: Static Random Access Memory• Low density, high power, expensive, fast

• Static: content last “forever”(until lose power)

11 1999 ©UCB

° SRAM:• value is stored on a pair of inverting gates

• very fast but takes up more space than DRAM (4 to 6 transistors)

° DRAM:• value is stored as a charge on capacitor (must be

refreshed)

• very small but slower than SRAM (factor of 5 to 10)

Memories: Review

B

A A

B

Word line

Pass transistor

Capacitor

Bit line

12 1999 ©UCB

How is the hierarchy managed?

°Registers Memory• By the compiler (or assembly language Programmer)

°Cache Main Memory• By hardware

°Main Memory Disks• By combination of hardware and the operating system (virtual memory: will cover next)

• By the programmer (Files)

13 1999 ©UCB

Measuring Cache Performance

CPU time = Execution cycles X clock cycle time

If cache miss: (Execution cycles + Memory stall cycles) X clock cycle time

Read-stall cycles = # reads X Read miss rate X Read miss penalty

Write-stall cycles = # writes X write miss rate X write miss penalty

Memory-stall cycles = Read-stall + write stall

= Memory accesses X miss rate X miss penalty

= # instns X misses/instn X miss penalty

Measuring Cache Performance

14 1999 ©UCB

Example

Q: Cache miss penalty = 50 cycles and all instns take 2.0 cycles without memory stalls. Assume cache miss rate of 2% and 1.33 (why?) memory references per instn. What is the impact of cache?

Ans: CPU time= IC x (CPI + Memory stall cycles/instn) x cycle time t

Performance including cache misses is

CPU time= IC x (2.0 + (1.33 x .02 x 50)) x cycle time = IC x 3.33 x t

For a perfect cache that never misses CPU time = IC x 2.0 x t

Hence, including the memory hierarchy stretches CPU time by 1.67

But, without memory hierarchy, the CPI would increase to

2.0 + 50 x 1.33 or 68.5 – a factor of over 30 times longer.

15 1999 ©UCB

Cache Organization

(1) How do you know if something is in the cache?

(2) If it is in the cache, how to find it?

°Answer to (1) and (2) depends on type or organization of the cache

° In a direct mapped cache, each memory address is associated with one possible block within the cache• Therefore, we only need to look in a single location in the cache for the data if it exists in the cache

16 1999 ©UCB

Simplest Cache: Direct Mapped

Memory4-Block Direct Mapped CacheBlock

Address

0123456789

101112131415

Cache Index

0123

• Cache Block 0 can be occupied by data from:Memory block 0, 4, 8, 12

- Cache Block 1 can be occupied by data from:

Memory block 1, 5, 9, 13

0000two

0100two

1000two

1100two

• Block Size = 32/64 Bytes

17 1999 ©UCB

Simplest Cache: Direct Mapped

MainMemory

4-Block Direct Mapped CacheBlock

Address

0123456789

101112131415

Cache Index

0123

0010

0110

1010

1110

° index determines block in cache

° index = (address) mod (# blocks)

° If number of cache blocks is power of 2, then cache index is just the lower n bits of memory address [ n = log2(# blocks) ]

tag index

Memory block address

18 1999 ©UCB

Simplest Cache: Direct Mapped w/Tag

MainMemory

Direct Mapped CacheBlockAddress

0123456789

101112131415

0123

0010

0110

1010

1110

° tag determines which memory block occupies cache block

° tag = left hand bits of address

° hit: cache tag field = tag bits of address

° miss: tag field tag bits of addr.

tag

11

datacacheindex

19 1999 ©UCB

Finding Item within Block

° In reality, a cache block consists of a number of bytes/words (32 or 64 bytes) to (1) increase cache hit due to locality property and (2) reduce the cache miss time.

°Mapping: memory block I can be mapped to cache block frame I mod x, where x is the number of blocks in the cache

- Called congruent mapping

° Given an address of item, index tells which block of cache to look in

° Then, how to find requested item within the cache block?

° Or, equivalently, “What is the byte offset of the item within the cache block?”

20 1999 ©UCB

Issues with Direct-Mapped° If block size > 1, rightmost bits of index are really the offset within the indexed block

ttttttttttttttttt iiiiiiiiii oooo

tag index byteto check to offsetif have select

withincorrect block block block

21 1999 ©UCB

Accessing data in a direct mapped cache

Three types of events:

° cache miss: nothing in cache in appropriate block, so fetch from memory

° cache hit: cache block is valid and contains proper address, so read desired word

° cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory

Cache Access Procedure: (1) Use Index bits to select cache block (2) If valid bit is 1, compare the tag bits of the address with the cache block tag bits (3) If they match, use the offset to read out the word/byte.

22 1999 ©UCB

Data valid, tag OK, so read offset return word d

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 0 a b c d

° 000000000000000000 0000000001 1100

Index0

000000

00

12

3

23 1999 ©UCB

An Example Cache: DecStation 3100°Commercial Workstation: ~1985

°MIPS R2000 Processor (similar to pipelined machine of chapter 6)

°Separate instruction and data caches:• direct mapped

• 64K Bytes (16K words) each

• Block Size: 1 Word (Low Spatial Locality)

Solution: Increase block size – 2nd example

24 1999 ©UCB

16 14

Valid Tag Data

Hit

16 32

16Kentries

16 bits 32 bits

31 30 17 16 15 5 4 3 2 1 0

ByteOffset

Data

Address (showing bit positions)

DecStation 3100 Cache

If miss, cache controller stalls the processor,

loads data from main memory

25 1999 ©UCB

16 12 Byteoffset

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

Address (showing bit positions)31 . . . 16 15 . . 4 3 2 1 0

64KB Cache with 4-word (16-byte) blocks

Tag DataV

26 1999 ©UCB

Miss rates: 1-word vs. 4-word block (cache similar to DecStation 3100)

I-cache D-cache CombinedProgram miss rate miss rate miss rate

gcc 6.1% 2.1% 5.4%spice1.2% 1.3% 1.2%

gcc 2.0% 1.7% 1.9%spice0.3% 0.6% 0.4%

1-wordblock

4-wordblock

27 1999 ©UCB

Miss Rate Versus Block Size

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes) 1 KB

8 KB

16 KB

64 KB

256 KB

totalcachesize• Figure 7.12 -

for direct mapped cache

28 1999 ©UCB

Extreme Example: 1-block cache° Suppose choose block size = cache size? Then only one block in the cache

° Temporal Locality says if an item is accessed, it is likely to be accessed again soon

• But it is unlikely that it will be accessed again immediately!!!

• The next access is likely to be a miss

- Continually loading data into the cache butforced to discard them before they are used again

- Worst nightmare of a cache designer: Ping Pong Effect

29 1999 ©UCB

Block Size and Miss Penality°With increase in block size, the cost of a miss also increases

°Miss penalty: time to fetch the block from the next lower level of the hierarchy and load it into the cache

°With very large blocks, increase in miss penalty overwhelms decrease in miss rate

°Can minimize average access time if design memory system right

30 1999 ©UCB

Block Size Tradeoff

MissPenalty

Block Size

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

Block Size

1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan bhuyan.

Documents

Transcript of 1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan bhuyan.