Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...

Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 1

MAMAS – Computer Architecture234367

Lectures 3-4 Memory Hierarchy and Cache Memories

Dr. Avi Mendelson

Some of the slides were taken from:

(1) Lihu Rapoport (2) Randi Katz and (3) Petterson


Technology Trends

DRAMYear Size Cycle Time1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1000:1 1 Mb 2:1 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 1.4x in 10 years

Disk 2x in 3 years 1.4x in 10 years


Processor-DRAM Memory Gap (latency)

Processor-MemoryPerformance Gap:(grows 50% / year)

1

10

100

1000

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU

Pe

rfo

rma

nc

e

Time


Why can’t we build Memory at the same frequency as Logic?

1. It is too expensive to build large memory with that technology

2. The size of the memory determine its access time, the larger the slower.

We do not aim to achieve the best performance solution. We aim to achieve the best COST EFFECTIVE solution (best performance for a given amount of money).


Important observation – programs preserve locality (and we can help it)

Temporal Locality (Locality in Time):

– If an item is referenced, it will tend to be referenced again soon

– Example: code and variables in loops

=> Keep most recently accessed data items closer to the processor

Spatial Locality (Locality in Space):

– If an item is referenced, nearby items tend to be referenced soon

– Example: scanning an array

=> Move blocks of contiguous words closer to the processor

• Locality + smaller HW is faster + Amdahl’s law=> memory hierarchy


The Goal: illusion of large, fast, and cheap memory

Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast

(most of the time)?– Hierarchy:

Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest

Level 1 Level 4Level 3Level 2CPU


Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents

-3 -4

CapacityAccess TimeCost

Infinite storagesec-min10

-6

Registers

Cache

Memory

Disk

Backup

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger


Simple performance evaluation

Suppose we have a processor that can execute one instruction per cycle when working from the first level of the memory hierarchy (hits the L1).

Example: If the information is not found in the first level, the CPU waits for 10 cycles, and if it is found only in the third level, it costs another 100 cycles.

Level 1 Level 3Level 2CPU


Cache Performance

CPU time = (CPU execution cycles + Memory stall cycles) × cycle time

Memory stall cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty

Memory stall cycles = Memory accesses × Miss rate × Miss penalty

CPU time = IC × (CPIexecution + Mem accesses per instruction × Miss rate × Miss penalty) × Clock cycle time

Misses per instruction = Memory accesses per instruction × Miss rate

CPU time = IC × (CPIexecution + Misses per instruction × Miss penalty) × Clock cycle time


Example

Consider a program that executes 10x106 instructions and with CPI=1.

Each instruction causes (in average) 0.5 accesses to data 95% of the accesses hit L1 50% of the accesses to L2 are misses and so need to be looked

up in L3. What is the slowdown due to memory hierarchy?

Solution Program generates 15x106 accesses to memory that could be

executed in 10x106 cycles if all the information was at level-1 0.05* 15x106 = 750000 accesses to L2 and 375000 accesses to L3. New cycles = 10x106 + 10*750,000 + 100*375,000 = 55*106

It is 5.5 times slowdown!!!!!


The first level of the memory hierarchy: Cache memories – Main Idea

At this point we assume only two levels of memory hierarchy: Main memory and cache memory

For the simplicity we also assume that all the program (data and instructions) is placed in the main memory.

The cache memory(ies) is part of the Processor – Same technology– Speed: same order of magnitude as accessing Registers

Relatively small and expensive Acts like an HASH function: holds parts of the programs’

address spaces. It needs to achieve:

– Fast access time– Fast search mechanism– Fast replacement mechanism– High Hit-Ratio


Cache - Main Idea (cont) When processor needs instruction or data it first looks to

find it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there.

Address space (or main memory) is partitioned into blocks – Typical block size is 32, 64 or 128 bytes– Block address is the address of the first byte in the block

block address is aligned (multiple of the block size)

Cache holds lines, each line holds a block– Need to determine which line the block is mapped to (if at all)– A block may not exist in the cache - cache miss

If we miss the Cache– Entire block is fetched into a line fill buffer (may require few bus

cycles), and then put into the cache– Before putting the new block in the cache, another block may

need to be evicted from the cache (to make room for the new block)


Memory Hierarchy: Terminology

For each memory level we can define the following:– Hit: data appears in the memory level

– Hit Rate: the fraction of memory accesses which are hits

– Hit Time: Time to access the memory level (includes also the time to determine hit/miss)

– Miss: data needs to be retrieved from the lower level

– Miss Rate = 1 - (Hit Rate)

– Miss Penalty: Time to replace a block in the current level + Time to deliver the data to the processor

Average memory-access time = teffective = (Hit time Hit Rate) + (Miss Time Miss rate)

= (Hit time Hit Rate) + (Miss Time (1- Hit rate))

– If hit rate is close to 1 teffective is close to Hit time


Four Questions for Memory Hierarchy Designers

In order to increase efficiently, we are moving data in blocks between different levels of memories; e.g., pages in main memory.

In order to achieve that we need to answer (at least) 4 questions:

Q1: Where can a block be placed when brought? (Block placement)

Q2: How is a block found when needed? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)


Q1-2: Where can a block be placed and how can we find it?

Direct Mapped: Each block has only one place that it can appear in the cache.

Fully associative: Each block can be placed anywhere in the cache.

Set associative: Each block can be placed in a restricted set of places in the cache.– If there are n blocks in a set, the cache placement is called n-way set

associative

What is the associativity of a direct mapped cache?


Fully Associative Cache

Tag Line

Tag Array

Tag = Block# Line Offset

Address Fields

0431

Data array

031

=

=

=

hit data

An address is partitioned to– offset within block

– block number

Each block may be mapped to each of the cache lines– need to lookup the block in

all lines

Each cache line has a tag– tag is compared to the block

number

– If one of the tags matches the block# we have a hit and the line is accessed according to the line offset

– need a comparator per line


Fully associative - Cont

Advantages Good utilization of the area, since any block in the main memory can be

mapped to any cache line

Disadvantage A lot of hardware Complicated hardware that causes a slow down to the access time.


Direct Mapped Cache

The l.s.bits of the block number determine to which cache line the block is mapped - called the set number

– Each block is mapped to a single line in the cache

– If a block is mapped to the same line as another, it will replace it.

The rest of the block number bits are used as a tag

– Compared to the tag stored in the cache for the appropriate set

Line

Tag Array

Tag

Set#

Cache storage 031

Tag Set Line Offset

Address

041331 5Block number

29 =

512 sets


Direct Map Cache (cont)

Memory is conceptually divided into slices whose size is the cache size

Offset from from slice start indicates position in cache (set)

Addresses with the same offset map into the same line

One tag per line is kept Advantages

Easy hit/miss resolution Easy replacement algorithm Lowest power and complexity

Disadvantage Excessive Line replacement

due to “Conflict misses”

CacheSize

Line 1

Line 2

.

.

.

.

Line n

x

x

x

Map to theSame set X

CacheSize

CacheSize


LineTagTag Line

2-Way Set Associative Cache Each set holds two lines (way 0 and way 1) Each block can be mapped into one of two lines in the

appropriate set

Tag Set Line Offset

Address Fields04121231

Cachestorage

Way 1Tag Array

Set#

031Way 0Tag Array

Set#

031 Cachestorage

WAY #1WAY #0

Example:Line Size: 32 bytesCache Size 16KB# of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits

Address 0x123456780001 0010 0011 010001001 0110 0111 1000

Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 =

0x091A2


2-Way Cache - Hit Decision

Tag Set Line Offset041231

Way 0

Tag

Set#

Data

=

Hit/Miss

MUX

Data Out

DataTag

Way 1

=


2-Way Set Associative Cache (cont)

Memory is conceptually divided into slices whose size is 1/2 the cache size (way size)

Offset from from slice start indicates set#Each set contains now two potential lines!

Addresses with the same offset map into the same set

Two tags per set, one tag per line is needed

WaySize

Line 1

Line 2

.

.

.

.

Line n

x

x

x

Map to theSame set X

WaySize

WaySize


What happens on a Cache miss?

Read miss– Cache line file - fetch the entire block that contains the missing data from memory

– Block is fetched into the cache line fill buffer

– May take a few bus cycles to complete the fetch e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles

– Once the entire line is fetched it is moved from the fill buffer into the cache

What happens on a write miss ?– The processor does not wait for data continues its work

– 2 options: write allocate and no write allocate

– Write allocate: fetch the line into the cache Assumes that we may read from the line soon Goes with write back policy (hoping that subsequent writes to the line hit the cache)

– Write no allocate: do not fetch line into the cache on a write miss Goes with write through policy (subsequent writes would update memory anyhow)


Replacement

Each line contains Valid indication Direct map: simple, line can be brought to only one place

– Old line is evicted (re-written to cache, if needed)

n-Ways: need to choose among ways in set Options: FIFO, LRU, Random, Pseudo LRU LRU is the best (in average) LRU

– 2 ways: requires 1 bit per set to mark latest accessed

– 4-ways: Need to save full ordering

– Fully associative: Full ordering cannot be saved (too many bits) approximate LRU


Implementing LRU in a k-way set associative cache

For each set hold a kk matrix – Initialization:

0 0 0 0 …. 0

1 1 0 0 ….. 0

2 1 1 0 ….. 0

N 1 1 1 …. 1 0

When line j (1 j k) is accessed Set all bits on row J to 1

(done in parallel by hardware)

THEN Reset all bits on column J to

0 (at the same cycle)

i 0 0 :

j 1...101...111 : 0

Evict row with ALL “0”


Pseudo LRU We will use as an example a 4-way set associative cache. Full LRU records the full-order of way access in each set

(which way was most recently accessed, which was second, and so on).

Pseudo LRU (PLRU) records a partial order, using 3 bits per-set:– Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2 and

3.

– Bit1 specifies which of ways 0 and 1 was least recently used

– Bit2 specifies which of ways 2 and 3 was least recently used

For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1

0 1 2 3

0 1

0 01 1

bit0

bit2bit1


Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory

– Write buffer is just a FIFO Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory

– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Store frequency (w.r.t. time) > 1 / DRAM write cycle– If exists for a long period of time (CPU cycle time too quick and/or too

many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Write combining: combine writes in the write buffer On cache miss need to lookup write buffer

ProcessorCache

Write Buffer

DRAM


Improving Cache Performance

Separating data cache from instruction cache (will be discussed in future lectures)

Reduce the miss rate – In order to reduce the misses, we need to understand why misses happen

Reduce the miss penalty– Bring the information to the processor as soon as possible.

Reduce the time to hit in the cache – Using Amdahl law, since most of the time we hit the cache it is important

to make sure we are accelerating the hit process.


Classifying Misses: 3 Cs Compulsory

– The first access to a block is not in the cache, so the block must be brought into the cache

– Also called cold start misses or first reference misses

– Misses in even an Infinite Cache

– Solution: for a fixed cache-line size -> prefetching Capacity

– cache cannot contain all blocks needed during program execution (it also termed the working set of the program is too big) blocks are evicted and later retrieved

– Solution: increase cache size, stream buffers, software solution Conflict

– Occurs in set associative or direct mapped caches when too many blocks map to the same set

– Also called collision misses or interference misses

– Solution: increase associativity, victim cache, linker optimizations


Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Compulsory vanishinglysmall


How Can We Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size is not changed: What happens if:

1) Change Block Size: Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?


Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

Reduce Misses via Larger Block Size


Reduce Misses via Higher Associativity

We have two conflicting trends here: Higher associativity

improve the hit ratio

BUT Increase the access time Slow down the replacement Increase complexity

Most of the modern cache memory systems are using at least 4-way associative cache memories


Example: Avg. Memory Access Time vs. Miss Rate

Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

Effective access time to cache(Red means -> not improved by more associativity)

Note this is for a specific example


Reducing Miss Penalty byCritical Word First

Don’t wait for full block to be loaded before restarting CPU

– Early restart As soon as the requested word of the block arrives, send it to the CPU

and let the CPU continue execution

– Critical Word First Request the missed word first from memory and send it to the CPU as

soon as it arrives Let the CPU continue execution while filling the rest of the words in the

block Also called wrapped fetch and requested word first

Example: – 64 bit = 8 byte bus, 32 byte cache line 4 bus cycles to fill line

– Fetch date from 95H80H-87H 88H-8FH 90H-97H 98H-9FH

1 24 3


Prefetchers

In order to avoid compulsory misses, we need to bring the information before it was requested by the program

We can use the locality of references behavior– Space -> bring the environment.

– Time -> same “patterns” repeats themselves.

Prefetching relies on having extra memory bandwidth that can be used without penalty

There are hardware and software prefetchers.


Hardware Prefetching

Instruction Prefetching–Alpha 21064 fetches 2 blocks on a miss

Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required

On miss check stream buffer

–Branch predictor directed prefetching Let branch predictor run ahead

Data Prefetching–Try to predict future data access

Next sequential Stride General pattern


Software Prefetching

Data Prefetch–Load data into register (HP PA-RISC loads)–Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)–Special prefetching instructions cannot cause faults;

a form of speculative execution How it is done

–Special prefetch intrinsic in the language–Automatically by the compiler

Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?–Higher superscalar reduces difficulty of issue bandwidth


Other techniques


Multi-ported cache and Banked Cache

A n-ported cache enables n cache accesses in parallel– Parallelize cache access in different pipeline stages

– Parallelize cache access in a super-scalar processors

Effectively doubles the cache die size Possible solution: banked cache

– Each line is divided to n banks

– Can fetch data from k n different banks (in possibly different lines)


Separate Code / Data Caches

Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)

Code cache is a read only cache– No need to write back line into memory when evicted

– Simpler to manage

What about self modified code? (X86 only)– Whenever executing a memory write need to snoop the code cache

– If the code cache contains the written address, the line in which the address is contained is invalidated

– Now the code cache is accessed both in the fetch stage and in the memory access stage Tags need to be dual ported to avoid stalling


Increasing the size with minimum latency loss - L2 cache

L2 is much larger than L1 (256K-1M in compare to 32K-64K)

Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip)– If L2 is on-chip, why not just make L1 larger?

Can be inclusive:– All addresses in L1 are also contained in L2– Data in L1 may be more updated than in L2– L2 is unified (code / data)

Most architectures do not require the caches to be inclusive (although, due to the size difference they are)


Victim Cache Problem: per set load may non-uniform

– some sets may have more conflict misses than others Solution: allocate ways to sets dynamically, according to the load When a line is evicted from the cache it placed on the victim cache

– If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1

On cache lookup, victim cache lookup is also performed (in parallel) On victim cache hit,

– line is moved back to cache

– evicted line moved to the victim cache

– Same access time as cache hit Especially effective for direct mapped cache

– Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses


Stream Buffers

Before inserting a new line into the cache put it in a stream buffer

Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future

Example:– Assume that we scan a very large array (much larger than the

cache), and we access each item in the array just once

– If we inset the array into the cache it will thrash the entire cache

– If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache


Backup


Compiler issues

Data Alignment– Misaligned access might span several cache lines

– Prohibited in some architectures (Alpha, SPARC)

– Very slow in others (x86)

Solution 1: add padding to data structures Solution 2: make sure memory allocations are aligned

Code Alignment– Misaligned instruction might span several cache lines

– x86 only. VERY slow.

Solution: insert NOPs to make sure instructions are aligned


Compiler issues 2

Overalignment– Alignment of an array can be a multiple of cache size

– Several arrays map to same cache lines

– Excessive conflict misses (thrashing)

for (int i=0; i<N; i++)

a[i] = a[i] + b[i] * c[i]

Solution 1: increase cache associativity Solution 2: break the alignment

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...

Documents

Transcript of Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...