Temporal Memory Streaming

© 2005 Babak Falsafi

Temporal Memory StreamingTemporal Memory Streaming

Babak Falsafi

Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos

STEMSSTEMSComputer Architecture Lab Carnegie Mellonhttp://www.ece.cmu.edu/CALCM

2© 2005 Babak Falsafi

The Memory Wall

Logic/DRAM speed gap continues to increase!

0.33

10

0.04

80

6

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

ins

tru

cti

on

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

DR

AM

ac

ce

ss

Core (s)Memory

VAX/1980 PPro/1996 2010+


Current Approach Cache hierarchies:• Trade off capacity for speed• Exploit “reuse”

But, in modern servers• Only 50% utilization of one proc.

[Ailamaki, VLDB’99]

• Much bigger problem in MPs

What is wrong?• Demand fetch/repl. data 100GDRAMDRAM

CCPPUU

L3 64M

L2 2M

100

0 clk

L1 64K

1 c

lk 1

0 cl

k 1

00 c

lk


Prior Work (SW-Transparent)

Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]

Simple patterns or low accuracy

Large Exec. Windows / Runahead [Mutlu 03]

Fetch dependent addresses serially

Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04]

Limited applicability (e.g., migratory)

Need solutions for arbitrary access patterns


Our Solution: Spatio-Temporal Memory StreamingSpatio-Temporal Memory Streaming

Observation:• Data spatially/temporally correlated • Arbitrary, yet repetitive, patterns

Approach Memory Streaming• Extract spat./temp. patterns• Stream data to/from CPU

Manage resources for multiple blocks Break dependence chains

• In HW, SW or both

L1

CPUCPU

AC

DB

..

STEMSSTEMS

ZX

W

stre

am

repl

ace

fetc

h

DRAM or DRAM or Other CPUsOther CPUs


Contribution #1: Temporal Shared-Memory Streaming

• Recent coherence miss sequences recur 50% misses closely follow previous sequence Large opportunity to exploit MLP

• Temporal streaming engine Ordered streams allow practical HW Performance improvement:

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.


Contribution #2: Last-touch Correlated Data Streaming

• Last-touch prefetchers Cache block deadtime >> livetime Fetch on a predicted “last touch” But, designs impractical (> 200MB on-chip)

• Last-touch correlated data streaming Miss order ~ last-touch order Stream table entries from off-chip Eliminates 75% of all L1 misses with ~200KB


Outline

• STEMS OverviewSTEMS Overview

• Example Temporal Streaming

1. Temporal Shared-Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary


• Record sequences of memory accesses• Transfer data sequences ahead of requests

Temporal Shared-Memory Streaming [ISCA’05]

Baseline System

CPU Mem

Miss A

Fill A

Streaming System

Miss B

Fill B

CPU MemMiss A

Fill A,B,C,…

• Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing


Relationship Between Misses

• Intuition: Miss sequences repeat Because code sequences repeat

• Observed for uniprocessors in [Chilimbi’02]

• Temporal Address Correlation Same miss addresses repeat in the same order

Correlated miss sequence = stream

Q W A B C D E R T A B C D E YMiss seq. …


Relationship Between Streams

• Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes

• Temporal Stream Locality Recent streams likely to recur

Q W A B C D E R

T A B C D E Y

Node 1

Node 2

Addr. correlation + stream locality = temporal correlation


Memory Level Parallelism

• Streams create MLP for dependent misses

• Not possible with larger windows / runahead

Temporal streaming breaks dependence chains

AB

C

Baseline

CPU

Must wait to follow pointers

Temporal Streaming

CPU

Fetch in parallel

AB

C


Temporal Streaming

Record

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2


Temporal Streaming

Record

Node 1


Directory Node 2

Miss AReq. A

Fill A


Temporal Streaming

Record

Locate

Node 1


Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D


Temporal Streaming

Node 1


Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D

Record

Locate

Stream

Fetch B, C, D

Hit B

Hit C


Temporal Streaming Engine

Record• Coherence Miss Order Buffer (CMOB)

~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses

CPU

$

Local Memory

CMOBQ W A B C D E R T Y

Fill E



Locate

• Annotate directory Already has coherence info for every block CMOB append send pointer to directory Coherence miss forward stream request

Directory

A

B

shared Node 4 @ CMOB[23]

modified Node 11 @ CMOB[401]



Stream• Fetch data to match use rate

Addresses in FIFO stream queue Fetch into streamed value buffer

F E D C BStream Queue

CPU

L1 $Streamed

Value Buffer

Node i: stream {A,B,C…}

A data

Fetch A

~32 entries


Practical HW Mechanisms

• Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends

• Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup


Methodology: Infrastructure

SimFlex [SIGMETRICS’04]

Statistically sampling → uArch sim. in minutes Full-system MP simulation (boots Linux & Solaris)

Uni, CMP, DSM timing models Real server software (e.g., DB2 & Oracle) Component-based FPGA board interface for hybrid

simulation/prototyping

Publicly available athttp://www.ece.cmu.edu/~simflex


Methodology: Benchmarks & Parameters

Model Parameters 16 4GHz SPARC CPUs

8-wide OoO; 8-stage pipe

256-entry ROB/LSQ

64K L1, 8MB L2

TSO w/ speculation

Benchmark Applications• Scientific

em3d, moldyn, ocean• OLTP: TPC-C 3.0 100 WH

IBM DB2 7.2 Oracle 10g

• SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3


TSE Coverage Comparison

0%

50%

100%

150%

200%

250%S

trid

eG

/DC

G/A

CT

SE

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

em3d moldyn ocean Apache DB2 Oracle Zeus

% C

oh

ere

nt

Re

ad

Mis

se

s Coverage Discards

TSE outperforms Stride and GHB for coherence misses


Stream Lengths

0%

20%

40%

60%

80%

100%0 1 2 4 8 16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

Length (# of streamed blocks)

Cu

m. %

of

All H

its

Apache DB2 Oracle Zeusem3d moldyn ocean

• Comm: Short streams; low base MLP (1.2-1.3)• Sci: Long streams; high base MLP (1.6-6.6)• Temporal Streaming addresses both cases


3.3

1.0

1.1

1.2

1.3

em3d

mol

dyn

ocea

n

Apa

che

DB

2O

racl

eZ

eus

TSE Performance Impact

-

0.2

0.4

0.6

0.8

1.0

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE


No

rma

lize

d T

ime

Busy Other Stalls Coherent Read Stalls


Time Breakdown Speedup

95% CI

• TSE eliminates 25%-95% of coherent read stalls6% to 230% performance improvement


TSE Conclusions

• Temporal Streaming Intuition: Recent coherence miss sequences recur Impact: Eliminates 50-100% of coherence misses

• Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.


Outline

• Big Picture

• Example Streaming Techniques

1. Temporal Shared Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary


Enhancing Lookahead

Observation [Mendelson, Wood&Hill]:

• Few live sets Use until last “hit” Data reuse high hit rate ~80% dead frames!

Exploit for lookahead:• Predict last “touch” prior to “death”• Evict, predict and fetch next line

L1 @ Time T1

L1 @ Time T2

Live

set

s D

ead

set

s


How Much Lookahead?

Predicting last-touches will eliminate all latency!

DR

AM

lat

ency

L2

late

ncy

Frame Deadtimes (cycles)


Dead-Block Prediction [ISCA’00 & ’01]

• Per-block trace of memory accesses to a block Predicts repetitive last-touch events

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

Acc

esse

s to

a b

lock

fram

e

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = A1 (PC1,PC3, PC3)

Last touch

First touch


Dead-Block Prefetcher (DBCP)

Evict A1Fetch A3

11

Correlation Table

A3A1,PC1,PC3,PC3PC1,PC3

History Table (HT)

PC3

Current Access

Latest

A1

• History & correlation tables History ~ L1 tag array Correlation ~ memory footprint

• Encoding truncated addition• Two bit saturating counter


0%

20%

40%

60%

80%

100%

120%

Olden SPEC INT SPEC FP

(%)

of

Cac

he

Mis

ses

early

train

incorrect

correct

DBCP Coverage with Unlimited Table Storage

• High average L1 miss coverage• Low misprediction (2-bit counters)


0%

20%

40%

60%

80%

100%

160K

B

640K

B

2M

B

10M

B

40M

B

160M

B

On-Chip Correlation Table Size

% o

f A

ch

ievab

le C

overa

ge

average

worst-case

Impractical On-Chip Storage Size

Needs over 150MB to achieve full potential!


Our Observation: Signatures are Temporally Correlated

Signatures need not reside on chip1. Last-touch sequences recur

• Much as cache miss sequences recur [Chilimbi’02]

• Often due to large structure traversals

2. Last-touch order ~ cache miss order• Off by at most L1 cache capacity

Key implications:• Can record last touches in miss order• Store & stream signatures from off-chip


Last-Touch Correlated Data Streaming(LT-CORDS)

• Streaming signatures on chip Keep all sigs. in sequences in off-chip DRAM Retain sequence “heads” on chip “Head” signals a stream fetch

• Small (~200KB) on-chip stream cache Tolerate order mismatch Lookahead for stream startup

DBCP coverage with moderate on-chip storage!


DBCP Mechanisms

Core L1L2

DRAM

HT

All signatures in random-access on-chip table

Sigs. (160MB)


Only a subset needed at a time“Head” as cue for the “stream”

Signatures stored off-chip

What LT-CORDS Does

Core L1L2

DRAM

HT

… and only in order


LT-CORDS Mechanisms

Core L1L2

DRAM

HT

SC

On-chip storage independent of footprint

Heads (10K) Sigs. (200K)


Methodology• SimpleScalar CPU model with Alpha ISA

SPEC CPU2000 & Olden benchmarks

• 8-wide out-of-order processor 2 cycle L1, 16 cycle L2, 180 cycle DRAM FU latencies similar to Alpha EV8 64KB 2-way L1D, 1MB 8-way L2

• LT-CORDS with 214KB on-chip storage• Apps. with significant memory stalls


0%

20%

40%

60%

80%

100%

120%D

BC

P

LT-

cord

s

DB

CP

LT-

cord

s

DB

CP

LT-

cord

s

Olden SPEC INT SPEC FP

(%)

of

Ca

ch

e M

iss

es

early

train

incorrect

correct

LT-CORDS vs. DBCP Coverage

LT-CORDS reaches infinite DBCP coverage


1

2

3

4

5

treead

dmgri

dsw

im

equa

keap

plu mcffm

a3d art

em3d

facere

c

wupwise bh

ammp gc

cpa

rser

sixtra

ck

Sp

ee

du

p

Infinite Cache

LT-CORDS

GHB (PC/DC)

LT-CORDS Speedup

LT-CORDS hides large fraction of memory latency


LT-CORDS Conclusions

• Intuition: Signatures temporally correlated Cache miss & last-touch sequences recur Miss order ~ last-touch order

• Impact: eliminates 75% of all misses Retains DBCP coverage, lookahead, accuracy On-chip storage indep. of footprint 2x less memory stalls over best prior work


For more informationVisit our website:http://www.ece.cmu.edu/CALCM

Temporal Memory Streaming

Documents

Transcript of Temporal Memory Streaming