EECS 470 Lecture 15 Basic Caches · 2020. 10. 28. · Cache miss rao is 0.01 Cache miss penalty is...

Lecture 17 Slide 1 EECS 470

EECS 470

Lecture 15

Prefetching

Winter 2021

Jon Beaumont

http://www.eecs.umich.edu/courses/eecs470

Prefetch A3

11

Correlating Prediction Table

A3 A0,A1 A0

History Table

Latest

A1

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.


Administrative

HW #4 due Friday (4/2)

Let me know if there is any issues with other HW on Gradescope

Milestone III next week

• No submissions needed

• Should target to have simple programs (including memory ops) running correctly

• Remaining couple weeks should focus on testing and optimizing


Last time

Cache techniques to reduce cache misses and cache penalties


Today

Finish up cache enhancements

Reduce number of cache misses through prefetching


Large Blocks

Pros of large cache blocks: + Smaller tag overhead

+ Take advantage of spatial locality

Cons: - Takes longer to fill

- Wasted bandwidth if block size is larger than spatial locality

Poll: What are the advantages of large

cache blocks?


Large Blocks and Subblocking

Can get the best of both worlds Large cache blocks can take a long time to refill

refill cache line critical word first restart cache access before complete refill

Large cache blocks can waste bus bandwidth if block size is larger than spatial locality divide a block into subblocks associate separate valid bits for each subblock Only load subblock on access, but still have reduced tag overhead

tag subblock v subblock v subblock v


Multi-Level Caches

Processors getting faster w.r.t. main memory larger caches to reduce frequency of more costly misses but larger caches are too slow for processor => gradually reduce miss cost with multiple levels

tavg = thit + miss ratio x tmiss


Multi-Level Cache Design

L1I L1D

L2

Proc

different technology different requirements different choice of

capacity block size associativity

tavg-L1 = thit-L1 + miss-ratioL1 x tavg-L2

tavg-L2 = thit-L2 + miss-ratioL2 x tmemory

What is miss ratio? global: L2 misses / L1 accesses local: L2 misses / L1 misses


The Inclusion Property Inclusion means L2 is a superset of L1 (ditto for L3…)

Why? if an addr is in L1, then it must be frequently used

makes L1 writeback simpler

L2 can handle external coherence checks without L1

Inclusion takes effort to maintain L2 must track what is cached in L1

On L2 replacement, must flush corresponding blocks from L1

How can this happen?

Consider:

1. L1 block size < L2 block size

2. different associativity in L1

3. L1 filters L2 access sequence; affects LRU replacement order


Possible Inclusion Violation

a b a

b

a,b,c have same L1 idx bits b,c have the same L2 idx bits

a,{b,c} have different L2 idx bits

step 1. L1 miss on c

step 2. a displaced to L2

step 3. b replaced by c

c 2-way set asso. L1

direct mapped L2


Non-blocking Caches

Also known as lock-up free caches

Instead of stalling pending accesses to cache on a miss, keep track of misses in special registers and keep handling new requests

Key implementation problems handle reads to pending miss handle writes to pending miss keep multiple requests straight

Non-blocking

$

ld A hit ld B miss ld C miss ld D hit st B miss (pend.)

Miss Status Holding Registers

B

C

Memory Access Stream


EECS 470 Roadmap

Parallelize

Speedup Programs

Reduce Instruction Latency Reduce number of instructions

Instruction Level Parallelism Reduce average memory latency

Instruction Flow Caching Memory Flow

Prefetching

Lecture 17 Slide 13 EECS 470 13

The memory wall

Today: 1 mem access 500 arithmetic ops

How to reduce memory stalls for existing SW?

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

an

ce

Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed.

Processor

Memory


data

Conventional approach #1: Avoid main memory accesses

Cache hierarchies:

Trade off capacity for speed

Add more cache levels?

Diminishing locality returns

No help for shared data in MPs

CPU

64K

4M

Main memory

2 clk

20 clk

200 clk

Write data

CPU


Conventional approach #2: Hide memory latency

Out-of-order execution:

Overlap compute & mem stalls

Expand OoO instruction window?

Issue & load-store logic hard to scale

No help for dependent instructions

exec

uti

on

compute

mem stall

OoO In order


What is Prefetching?

• Fetch memory before it's needed

• Targets compulsory, capacity, & coherence misses

Big challenges:

1. knowing “what” to fetch • Fetching useless info wastes valuable resources

2. “when” to fetch it • Fetching too early clutters storage

• Fetching too late defeats the purpose of “pre”-fetching


Software Prefetching

Compiler/programmer places prefetch instructions requires ISA support

why not use regular loads?

found in recent ISA’s such as SPARC V-9

Prefetch into register (binding)

caches (non-binding)


Software Prefetching (Cont.)

e.g.,

for (I = 1; I < rows; I++)

for (J = 1; J < columns; J++)

{

prefetch(&x[I+1,J]);

sum = sum + x[I,J];

}


Hardware Prefetching

What to prefetch? one block spatially ahead?

use address predictors works for regular patterns (x, x+8, x+16,.)

When to prefetch? on every reference

on every miss

when prior prefetched data is referenced

Where to put prefetched data? auxiliary buffers

caches

Poll: Which cache is probably easier to

design a prefetcher for?

Poll: We've already seen one implicit form of prefetching. When?


Spatial Locality and Sequential Prefetching

Sequential prefetching Just grab the next few lines from memory

Works well for I-cache

Instruction fetching tend to access memory sequentially

Doesn’t work very well for D-cache More irregular access pattern

regular patterns may have non-unit stride (e.g. matrix code)

Relatively easy to implement Large cache block size already have the effect of prefetching

After loading one-cache line, start loading the next line automatically if the line is not in cache and the bus is not busy

If we know the typical basic block size (i.e. avg distance between branches), we can fetch the next several lines


Access pattern for a particular static load is more predictable

Reference Prediction Table

Remembers previously executed loads, their PC, the last address referenced, stride between the last two references

When executing a load, look up in RPT and compute the distance between the current data addr and the last addr

- if the new distance matches the old stride

found a pattern, go ahead and prefetch “current addr+stride”

- update “last addr” and “last stride” for next lookup

Load Inst. Last Address Last Flags

PC (tag) Referenced Stride

……. ……. ……

Stride Prefetchers

Load

Inst

PC


Stream Buffers [Jouppi] Each stream buffer holds one stream of sequentially

prefetched cache lines

On a load miss check the head of all stream buffers for an address match

if hit, pop the entry from FIFO, update the cache with data

if not, allocate a new stream buffer to the new miss address

(may have to recycle a stream buffer following LRU policy)

Stream buffer FIFOs are continuously topped-off with subsequent cache lines whenever there is room and

the bus is not busy

Stream buffers can incorporate stride prediction mechanisms to support non-unit-stride streams

FIFO

FIFO

FIFO

FIFO

DCache

Me

mo

ry inte

rface

No cache pollution


Generalized Access Pattern Prefetchers

How do you prefetch

1. Heap data structures?

2. Indirect array accesses?

3. Generalized memory access patterns?

Current proposals:

• Precomputation prefetchers (runahead execution)

• Address correlating prefetchers (temporal memory streaming)

• Spatial pattern prefetchers (spatial memory streaming)


Runahead Prefetchers

Proposed for I/O prefetching first (Gibson et al.)

Duplicate the program

• Only execute the address generating stream

• Let it run ahead

May run as a thread on

• A separate processor

• The same multithreaded processor

Or custom address generation logic

Many names: slipstream, precomp., runahead, …

Main Prefetch

Thread Thread


Runahead Prefetcher

To get ahead:

• Must avoid waiting

• Must compute less

Predict

1. Control flow thru branch prediction

2. Data flow thru value prediction

3. Address generation computation only

+ Prefetch any pattern (need not be repetitive)

― Prediction only as good as branch + value prediction

How much prefetch lookahead?


Correlation-Based Prefetching Consider the following history of Load addresses emitted by a

processor A, B, C, D, C, E, A, C, F, F, E, A, A, B, C, D, E, A ,B C, D, C

After referencing a particular address (say A or E), are some addresses more likely to be referenced next

A B C

D E F 1.0

.33 .5

.2

1.0 .6 .2

.67 .6

.5

.2

.2

Markov

Model


Track the likely next addresses after seeing a particular addr.

Prefetch accuracy is generally low so prefetch up to N next addresses to increase coverage (but this wastes bandwidth)

Prefetch accuracy can be improved by using longer history Decide which address to prefetch next by looking at the last K load addresses

instead of just the current one

e.g. index with the XOR of the data addresses from the last K loads

Using history of a couple loads can increase accuracy dramatically

This technique can also be applied to just the load miss stream

Load Data Addr Prefetch Confidence …. Prefetch Confidence

(tag) Candidate 1 …. Candidate N

……. ……. …… .… ……. ……

….

Correlation-Based Prefetching

Load

Data

Addr


More info on Prefetching?

Professor Wenisch (professor here, currently working at Google) wrote a great summary of the state of the art here (available through umich IP addresses):

https://www.morganclaypool.com/doi/abs/10.2200/S00581ED1V01Y201405CAC028




Improving Cache Performance: Summary

Miss rate large block size

higher associativity

victim caches

skewed-/pseudo-associativity

hardware/software prefetching

compiler optimizations

Miss penalty give priority to read misses over

writes/writebacks

subblock placement

early restart and critical word first

non-blocking caches

multi-level caches

Hit time (difficult?) small and simple caches

avoiding translation during L1 indexing (later)

pipelining writes for fast write hits

subblock placement for fast write hits in write through caches


Next Time

Multicore!

Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD

https://bit.ly/3oSr5FD

EECS 470 Lecture 15 Basic Caches · 2020. 10. 28. · Cache miss rao is 0.01 Cache miss penalty is...

Documents

Transcript of EECS 470 Lecture 15 Basic Caches · 2020. 10. 28. · Cache miss rao is 0.01 Cache miss penalty is...