Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...

Memory Hierarchy & Cache Memory© Avi Mendelson, 3/2005 1

MAMAS – Computer Architecture234367

Lectures 3-4 Memory Hierarchy and Cache Memories

Dr. Avi Mendelson

Some of the slides were taken from:

(1) Lihu Rapoport (2) Randi Katz and (3) Petterson

Technology Trends

DRAMYear Size Cycle Time1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1000:1 1 Mb 2:1 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 1.4x in 10 years

Disk 2x in 3 years 1.4x in 10 years

Processor-DRAM Memory Gap (latency)

Processor-MemoryPerformance Gap:(grows 50% / year)

Why can’t we build Memory at the same frequency as Logic?

1. It is too expensive to build large memory with that technology

2. The size of the memory determine its access time, the larger the slower.

We do not aim to achieve the best performance solution. We aim to achieve the best COST EFFECTIVE solution (best performance for a given amount of money).

Important observation – programs preserve locality (and we can help it)

Temporal Locality (Locality in Time):

– If an item is referenced, it will tend to be referenced again soon

– Example: code and variables in loops

=> Keep most recently accessed data items closer to the processor

Spatial Locality (Locality in Space):

– If an item is referenced, nearby items tend to be referenced soon

– Example: scanning an array

=> Move blocks of contiguous words closer to the processor

• Locality + smaller HW is faster + Amdahl’s law=> memory hierarchy

The Goal: illusion of large, fast, and cheap memory

Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast

(most of the time)?– Hierarchy:

Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest

Level 1 Level 4Level 3Level 2CPU

Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents

CapacityAccess TimeCost

Infinite storagesec-min10

Registers

Memory

Backup

Instr. Operands

Blocks

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Simple performance evaluation

Suppose we have a processor that can execute one instruction per cycle when working from the first level of the memory hierarchy (hits the L1).

Example: If the information is not found in the first level, the CPU waits for 10 cycles, and if it is found only in the third level, it costs another 100 cycles.

Level 1 Level 3Level 2CPU

Cache Performance

CPU time = (CPU execution cycles + Memory stall cycles) × cycle time

Memory stall cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty

Memory stall cycles = Memory accesses × Miss rate × Miss penalty

CPU time = IC × (CPIexecution + Mem accesses per instruction × Miss rate × Miss penalty) × Clock cycle time

Misses per instruction = Memory accesses per instruction × Miss rate

CPU time = IC × (CPIexecution + Misses per instruction × Miss penalty) × Clock cycle time

Example

Consider a program that executes 10x106 instructions and with CPI=1.

Each instruction causes (in average) 0.5 accesses to data 95% of the accesses hit L1 50% of the accesses to L2 are misses and so need to be looked

up in L3. What is the slowdown due to memory hierarchy?

Solution Program generates 15x106 accesses to memory that could be

executed in 10x106 cycles if all the information was at level-1 0.05* 15x106 = 750000 accesses to L2 and 375000 accesses to L3. New cycles = 10x106 + 10*750,000 + 100*375,000 = 55*106

It is 5.5 times slowdown!!!!!

The first level of the memory hierarchy: Cache memories – Main Idea

At this point we assume only two levels of memory hierarchy: Main memory and cache memory

For the simplicity we also assume that all the program (data and instructions) is placed in the main memory.

The cache memory(ies) is part of the Processor – Same technology– Speed: same order of magnitude as accessing Registers

Relatively small and expensive Acts like an HASH function: holds parts of the programs’

address spaces. It needs to achieve:

– Fast access time– Fast search mechanism– Fast replacement mechanism– High Hit-Ratio

Cache - Main Idea (cont) When processor needs instruction or data it first looks to

find it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there.

Address space (or main memory) is partitioned into blocks – Typical block size is 32, 64 or 128 bytes– Block address is the address of the first byte in the block

block address is aligned (multiple of the block size)

Cache holds lines, each line holds a block– Need to determine which line the block is mapped to (if at all)– A block may not exist in the cache - cache miss

If we miss the Cache– Entire block is fetched into a line fill buffer (may require few bus

cycles), and then put into the cache– Before putting the new block in the cache, another block may

need to be evicted from the cache (to make room for the new block)

Memory Hierarchy: Terminology

For each memory level we can define the following:– Hit: data appears in the memory level

– Hit Rate: the fraction of memory accesses which are hits

– Hit Time: Time to access the memory level (includes also the time to determine hit/miss)

– Miss: data needs to be retrieved from the lower level

– Miss Rate = 1 - (Hit Rate)

– Miss Penalty: Time to replace a block in the current level + Time to deliver the data to the processor

Average memory-access time = teffective = (Hit time Hit Rate) + (Miss Time Miss rate)

= (Hit time Hit Rate) + (Miss Time (1- Hit rate))

– If hit rate is close to 1 teffective is close to Hit time

Four Questions for Memory Hierarchy Designers

In order to increase efficiently, we are moving data in blocks between different levels of memories; e.g., pages in main memory.

In order to achieve that we need to answer (at least) 4 questions:

Q1: Where can a block be placed when brought? (Block placement)

Q2: How is a block found when needed? (Block identification)

Q3: Which block should be replaced on a miss? (Block replacement)

Q4: What happens on a write? (Write strategy)

Q1-2: Where can a block be placed and how can we find it?

Direct Mapped: Each block has only one place that it can appear in the cache.

Fully associative: Each block can be placed anywhere in the cache.

Set associative: Each block can be placed in a restricted set of places in the cache.– If there are n blocks in a set, the cache placement is called n-way set

associative

What is the associativity of a direct mapped cache?

Fully Associative Cache

Tag Line

Tag Array

Tag = Block# Line Offset

Address Fields

Data array

hit data

An address is partitioned to– offset within block

– block number

Each block may be mapped to each of the cache lines– need to lookup the block in

all lines

Each cache line has a tag– tag is compared to the block

number

– If one of the tags matches the block# we have a hit and the line is accessed according to the line offset

– need a comparator per line

Fully associative - Cont

Advantages Good utilization of the area, since any block in the main memory can be

mapped to any cache line

Disadvantage A lot of hardware Complicated hardware that causes a slow down to the access time.

Direct Mapped Cache

The l.s.bits of the block number determine to which cache line the block is mapped - called the set number

– Each block is mapped to a single line in the cache

– If a block is mapped to the same line as another, it will replace it.

The rest of the block number bits are used as a tag

– Compared to the tag stored in the cache for the appropriate set

Tag Array

Cache storage 031

Tag Set Line Offset

Address

041331 5Block number

512 sets

Direct Map Cache (cont)

Memory is conceptually divided into slices whose size is the cache size

Offset from from slice start indicates position in cache (set)

Addresses with the same offset map into the same line

One tag per line is kept Advantages

Easy hit/miss resolution Easy replacement algorithm Lowest power and complexity

Disadvantage Excessive Line replacement

due to “Conflict misses”

CacheSize

Line 1

Line 2

Line n

Map to theSame set X

CacheSize

LineTagTag Line

2-Way Set Associative Cache Each set holds two lines (way 0 and way 1) Each block can be mapped into one of two lines in the

appropriate set

Tag Set Line Offset

Address Fields04121231

Cachestorage

Way 1Tag Array

031Way 0Tag Array

031 Cachestorage

WAY #1WAY #0

Example:Line Size: 32 bytesCache Size 16KB# of lines 512 lines#sets 256Offset bits 5 bitsSet bits 8 bitsTag bits 19 bits

Address 0x123456780001 0010 0011 010001001 0110 0111 1000

Offset: 1 1000 = 0x18 = 24Set: 1011 0011 = 0x0B3 = 179Tag: 000 1001 0001 1010 0010 =

0x091A2

2-Way Cache - Hit Decision

Tag Set Line Offset041231

Hit/Miss

Data Out

DataTag

2-Way Set Associative Cache (cont)

Memory is conceptually divided into slices whose size is 1/2 the cache size (way size)

Offset from from slice start indicates set#Each set contains now two potential lines!

Addresses with the same offset map into the same set

Two tags per set, one tag per line is needed

WaySize

Line 1

Line 2

Line n

Map to theSame set X

WaySize

What happens on a Cache miss?

Read miss– Cache line file - fetch the entire block that contains the missing data from memory

– Block is fetched into the cache line fill buffer

– May take a few bus cycles to complete the fetch e.g., 64 bit (8 byte) data bus, 32 byte cache line 4 bus cycles

– Once the entire line is fetched it is moved from the fill buffer into the cache

What happens on a write miss ?– The processor does not wait for data continues its work

– 2 options: write allocate and no write allocate

– Write allocate: fetch the line into the cache Assumes that we may read from the line soon Goes with write back policy (hoping that subsequent writes to the line hit the cache)

– Write no allocate: do not fetch line into the cache on a write miss Goes with write through policy (subsequent writes would update memory anyhow)

Replacement

Each line contains Valid indication Direct map: simple, line can be brought to only one place

– Old line is evicted (re-written to cache, if needed)

n-Ways: need to choose among ways in set Options: FIFO, LRU, Random, Pseudo LRU LRU is the best (in average) LRU

– 2 ways: requires 1 bit per set to mark latest accessed

– 4-ways: Need to save full ordering

– Fully associative: Full ordering cannot be saved (too many bits) approximate LRU

Implementing LRU in a k-way set associative cache

For each set hold a kk matrix – Initialization:

0 0 0 0 …. 0

1 1 0 0 ….. 0

2 1 1 0 ….. 0

N 1 1 1 …. 1 0

When line j (1 j k) is accessed Set all bits on row J to 1

(done in parallel by hardware)

THEN Reset all bits on column J to

0 (at the same cycle)

i 0 0 :

j 1...101...111 : 0

Evict row with ALL “0”

Pseudo LRU We will use as an example a 4-way set associative cache. Full LRU records the full-order of way access in each set

(which way was most recently accessed, which was second, and so on).

Pseudo LRU (PLRU) records a partial order, using 3 bits per-set:– Bit0 specifies whether LRU way is either one of 0 and 1 or one of 2 and

– Bit1 specifies which of ways 0 and 1 was least recently used

– Bit2 specifies which of ways 2 and 3 was least recently used

For example if order in which ways were accessed is 3,0,2,1, then bit0=1, bit1=1, bit2=1

0 1 2 3

0 01 1

bit2bit1

Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory

– Write buffer is just a FIFO Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory

– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Store frequency (w.r.t. time) > 1 / DRAM write cycle– If exists for a long period of time (CPU cycle time too quick and/or too

many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Write combining: combine writes in the write buffer On cache miss need to lookup write buffer

ProcessorCache

Write Buffer

Improving Cache Performance

Separating data cache from instruction cache (will be discussed in future lectures)

Reduce the miss rate – In order to reduce the misses, we need to understand why misses happen

Reduce the miss penalty– Bring the information to the processor as soon as possible.

Reduce the time to hit in the cache – Using Amdahl law, since most of the time we hit the cache it is important

to make sure we are accelerating the hit process.

Classifying Misses: 3 Cs Compulsory

– The first access to a block is not in the cache, so the block must be brought into the cache

– Also called cold start misses or first reference misses

– Misses in even an Infinite Cache

– Solution: for a fixed cache-line size -> prefetching Capacity

– cache cannot contain all blocks needed during program execution (it also termed the working set of the program is too big) blocks are evicted and later retrieved

– Solution: increase cache size, stream buffers, software solution Conflict

– Occurs in set associative or direct mapped caches when too many blocks map to the same set

– Also called collision misses or interference misses

– Solution: increase associativity, victim cache, linker optimizations

Cache Size (KB)

1 2 4 8

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Compulsory vanishinglysmall

How Can We Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict In all cases, assume total cache size is not changed: What happens if:

1) Change Block Size: Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?

Block Size (bytes)

Miss Rate

Reduce Misses via Larger Block Size

Reduce Misses via Higher Associativity

We have two conflicting trends here: Higher associativity

improve the hit ratio

BUT Increase the access time Slow down the replacement Increase complexity

Most of the modern cache memory systems are using at least 4-way associative cache memories

Example: Avg. Memory Access Time vs. Miss Rate

Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

Effective access time to cache(Red means -> not improved by more associativity)

Note this is for a specific example

Reducing Miss Penalty byCritical Word First

Don’t wait for full block to be loaded before restarting CPU

– Early restart As soon as the requested word of the block arrives, send it to the CPU

and let the CPU continue execution

– Critical Word First Request the missed word first from memory and send it to the CPU as

soon as it arrives Let the CPU continue execution while filling the rest of the words in the

block Also called wrapped fetch and requested word first

Example: – 64 bit = 8 byte bus, 32 byte cache line 4 bus cycles to fill line

– Fetch date from 95H80H-87H 88H-8FH 90H-97H 98H-9FH

1 24 3

Prefetchers

In order to avoid compulsory misses, we need to bring the information before it was requested by the program

We can use the locality of references behavior– Space -> bring the environment.

– Time -> same “patterns” repeats themselves.

Prefetching relies on having extra memory bandwidth that can be used without penalty

There are hardware and software prefetchers.

Hardware Prefetching

Instruction Prefetching–Alpha 21064 fetches 2 blocks on a miss

Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required

On miss check stream buffer

–Branch predictor directed prefetching Let branch predictor run ahead

Data Prefetching–Try to predict future data access

Next sequential Stride General pattern

Software Prefetching

Data Prefetch–Load data into register (HP PA-RISC loads)–Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)–Special prefetching instructions cannot cause faults;

a form of speculative execution How it is done

–Special prefetch intrinsic in the language–Automatically by the compiler

Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?–Higher superscalar reduces difficulty of issue bandwidth

Other techniques

Multi-ported cache and Banked Cache

A n-ported cache enables n cache accesses in parallel– Parallelize cache access in different pipeline stages

– Parallelize cache access in a super-scalar processors

Effectively doubles the cache die size Possible solution: banked cache

– Each line is divided to n banks

– Can fetch data from k n different banks (in possibly different lines)

Separate Code / Data Caches

Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)

Code cache is a read only cache– No need to write back line into memory when evicted

– Simpler to manage

What about self modified code? (X86 only)– Whenever executing a memory write need to snoop the code cache

– If the code cache contains the written address, the line in which the address is contained is invalidated

– Now the code cache is accessed both in the fetch stage and in the memory access stage Tags need to be dual ported to avoid stalling

Increasing the size with minimum latency loss - L2 cache

L2 is much larger than L1 (256K-1M in compare to 32K-64K)

Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip)– If L2 is on-chip, why not just make L1 larger?

Can be inclusive:– All addresses in L1 are also contained in L2– Data in L1 may be more updated than in L2– L2 is unified (code / data)

Most architectures do not require the caches to be inclusive (although, due to the size difference they are)

Victim Cache Problem: per set load may non-uniform

– some sets may have more conflict misses than others Solution: allocate ways to sets dynamically, according to the load When a line is evicted from the cache it placed on the victim cache

– If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1

On cache lookup, victim cache lookup is also performed (in parallel) On victim cache hit,

– line is moved back to cache

– evicted line moved to the victim cache

– Same access time as cache hit Especially effective for direct mapped cache

– Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses

Stream Buffers

Before inserting a new line into the cache put it in a stream buffer

Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future

Example:– Assume that we scan a very large array (much larger than the

cache), and we access each item in the array just once

– If we inset the array into the cache it will thrash the entire cache

– If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache

Backup

Compiler issues

Data Alignment– Misaligned access might span several cache lines

– Prohibited in some architectures (Alpha, SPARC)

– Very slow in others (x86)

Solution 1: add padding to data structures Solution 2: make sure memory allocations are aligned

Code Alignment– Misaligned instruction might span several cache lines

– x86 only. VERY slow.

Solution: insert NOPs to make sure instructions are aligned

Compiler issues 2

Overalignment– Alignment of an array can be a multiple of cache size

– Several arrays map to same cache lines

– Excessive conflict misses (thrashing)

for (int i=0; i<N; i++)

a[i] = a[i] + b[i] * c[i]

Solution 1: increase cache associativity Solution 2: break the alignment

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...

Documents

Transcript of Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367...

Memory Hierarchy II

The Memory Hierarchy

Chapter-5 Memory Hierarchy Design - VTU notes€¦ · Chapter-5 Memory Hierarchy Design • Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality -

Memory Hierarchy b

Memory Hierarchy - Colorado State Universitycs270/.Fall15/slides/... · Fast: Exploiting Memory Hierarchy — 9 Characteristics of the Memory Hierarchy Increasing distance from the

Memory Hierarchy

Memory Types & Hierarchy

Memory Hierarchy Review

Compressed Memory Hierarchy

Lecture6 memory hierarchy

12 memory hierarchy

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

Computer Memory Hierarchy

Memory Sub-System CT101 – Computing Systems. Memory Subsystem Memory Hierarchy Types of memory Memory organization Memory Hierarchy Design Cache.

Memory Hierarchy Design

Memory Hierarchy, Caching, Virtual Memory

Memory Hierarchy 2

9a. System Memory-Memory Hierarchy

Memory Organization (Memory Hierarchy)ggn.dronacharya.info/ECEDept/Downloads/QuestionBank/Vsem/... · 2013. 10. 15. · Memory Organization (Memory Hierarchy) Memory hierarchy in

Memory Hierarchy - cihlab.github.io