Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

24
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    2

Transcript of Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Page 1: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Predictor-Directed Stream Buffers

Timothy Sherwood

Suleyman Sair

Brad Calder

Page 2: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 2

Overview

• Introduction

• Past Stream Buffer work

• Predictor-Directed Stream Buffers

• Policy Improvements

• Results

• Contribution

Page 3: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 3

Introduction

• Memory Wall

• Latency reduction through prefetching– without eating too much bandwidth

• Stream Buffers are one of the most used– simple to implement– very efficient

• Pointer based codes

Page 4: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 4

Past Stream Buffer work

• Jouppi 1990 – consecutive cache line FIFO

• Palacharla and Kessler 1994– non-unit stride (based on memory chunk)– allocation filters

• Farkas et. al. 1997– PC-based stride– fully associative / non-overlapping

Page 5: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 5

Past Stream Buffer work

tag cache block comparator

• • •

PredictedStride

LastAddress

tag cache block comparator

from/to next lower level of memoryN buffe

rs

store predict_stridein streaming buffer

on allocation

to data cache, register file, and MSHRs

Page 6: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 6

Past Stream Buffer work

• Past work targeted at streaming in arrays– either in sequential order– or stride order (multidimensional array)

• Could not handle Pointer Codes– repetitive non-striding references

• Need a more General Predictor

Page 7: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 7

Predictor-Directed Stream Buffer

• The Goal: Simple and efficient hardware based prefetching of complex but predictable streams

• Approach: Take a general predictor and hook it up to the well established stream buffer front end.

• Separate the predictor from the prefetcher• Can use almost any predictor

– 2 Delta– Context– Markov

Page 8: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 8

PSB Generalized Architecture

Load PCHistoryStride

ConfidenceLast Address

Prediction Info

tag cache block comparator• • •

tag cache block comparator

AddressPredictor

load info (PC, address)fromwrite-backstage

from/to next lower level of memory

subset of prediction info

predicted address

predicted address

N buffers

to data cache, register file, and MSHRs

updateprediction

information

Page 9: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 9

PSB Stages

• Allocation

• Prediction

• Probe

• Prefetching

• Lookup

Page 10: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 10

Stage Descriptions

• Allocation– Stream Buffer is allocated to a particular load– the buffer is initialized– subject to Allocation Filters

• Prediction– an empty buffer entry asks for an address– subject to limited predictor speed.

Page 11: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 11

Stage Descriptions (Continued)

• Probe– if there are free ports remove useless prefetches

– not mandatory

• Prefetching– subject to scheduling for ports and priority, prefetches

are sent to memory

• Lookup– when a load performs an L1 access, the Stream Buffers

are checked in parallel

Page 12: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 12

PSB Implementation

• Tried many different address predictors

• Best is Stride Filtered Markov– similar to Joseph and Grunwald’s Predictor– first order Markov– striding behavior is filtered out

• Difference is stored to reduce size

Page 13: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 13

Difference Storing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15 17 19

Number of bits

Perc

en

t o

f L

1 M

isses

burgdeltagssisturb3dhealth

Page 14: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 14

PSB with SFM

tag cache block comparator• • •

tag cache block comparator

from/to next lower level of memory

predictedaddress

last address

if hit, returnpredicted address

8 buffers

store predictedstride in

streaming buffer on allocation

MarkovPredictor

load info (PC, address)from write-back stage

StridePredictor

MUXmarkov

hit?

PredictedStride

LastAddress

predicted markov address

predicted stride address

to data cache, register file, and MSHRs

Page 15: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 15

Methods

• SimpleScalar 3.0• Rewrote memory hierarchy• Model bandwidth between all levels• Added perfect store sets• Ran over set of Pointer Benchmarks• 2K entry predictor table• 8 buffers x 4 entry Stream Buffers• 32k 4-way associative cache

Page 16: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 16

Speedup from PSB

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

health burg deltablue gs sis turb3d

Per

cen

t S

pee

du

p

PC-StridePSB w/ SFM

Page 17: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 17

Allocation Filtering

• Farkas et.al. showed how two miss filtering– prevents too many streams requesting resources

• Does not work as well for pointer codes– irregular miss patterns

• We use Priority and Accuracy Counters– track behavior of Loads– allocate to Loads that are Behaving well

Page 18: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 18

Allocation Filtering Speedup

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

health burg deltablue gs sis turb3d

Per

cen

t S

pee

du

p

PC-Stride2 MissConfAlloc

Page 19: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 19

Stream Buffer Priority

• Round Robin– give each active buffer equal resources– predictor and prefetching

• Priority Counters– uses small counters with each buffer– use the counters to rank buffer– more resources to better performing buffers

Page 20: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 20

Priority Scheduling Speedup

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

health burg deltablue gs sis turb3d

Per

cen

t S

pee

du

p

PC-Stride

2Miss-RR

2Miss-Priority

ConfAlloc-RR

ConfAlloc-Priority

Page 21: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 21

Latency Reduction

0

2

4

6

8

10

12

health burg deltablue gs sis turb3d

Avg

. Acc

ess

Lat

ency

(cyc

les)

BasePC-Stride2Miss-RR2Miss-PriConf-RRConfAlloc-Priority

Page 22: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 22

Contributions

• Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation

• Using accuracy based allocation filtering and priority scheduling can make a large difference in performance

• With some simple compression, even small Markov tables can be very effective

Page 23: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 23

Accuracy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

health burg deltablue gs sis turb3d

Per

cen

t Acc

ura

cy

PC-Stride2Miss-RR2Miss-PriorityConfAlloc-RRConfAlloc-Priority

Page 24: Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Sherwood, Sair, and Calder 24

Bus Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%B

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

riB

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

riB

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

ri

health burg deltablue gs sis turb3d

L1

to L

2 B

us

Uti

liza

tio

n

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

L2

to M

em B

us

Uti

liza

tio

n

L1 to L2 Bus UtilizationL2 to Mem Bus Utilization