Temporal Memory Streaming
description
Transcript of Temporal Memory Streaming
© 2005 Babak Falsafi
Temporal Memory StreamingTemporal Memory Streaming
Babak Falsafi
Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos
STEMSSTEMSComputer Architecture Lab Carnegie Mellonhttp://www.ece.cmu.edu/CALCM
2© 2005 Babak Falsafi
The Memory Wall
Logic/DRAM speed gap continues to increase!
0.33
10
0.04
80
6
0.01
0.1
1
10
100
1000
Clo
ck
s p
er
ins
tru
cti
on
0.01
0.1
1
10
100
1000
Clo
ck
s p
er
DR
AM
ac
ce
ss
Core (s)Memory
VAX/1980 PPro/1996 2010+
3© 2005 Babak Falsafi
Current Approach Cache hierarchies:• Trade off capacity for speed• Exploit “reuse”
But, in modern servers• Only 50% utilization of one proc.
[Ailamaki, VLDB’99]
• Much bigger problem in MPs
What is wrong?• Demand fetch/repl. data 100GDRAMDRAM
CCPPUU
L3 64M
L2 2M
100
0 clk
L1 64K
1 c
lk 1
0 cl
k 1
00 c
lk
4© 2005 Babak Falsafi
Prior Work (SW-Transparent)
Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]
Simple patterns or low accuracy
Large Exec. Windows / Runahead [Mutlu 03]
Fetch dependent addresses serially
Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04]
Limited applicability (e.g., migratory)
Need solutions for arbitrary access patterns
5© 2005 Babak Falsafi
Our Solution: Spatio-Temporal Memory StreamingSpatio-Temporal Memory Streaming
Observation:• Data spatially/temporally correlated • Arbitrary, yet repetitive, patterns
Approach Memory Streaming• Extract spat./temp. patterns• Stream data to/from CPU
Manage resources for multiple blocks Break dependence chains
• In HW, SW or both
L1
CPUCPU
AC
DB
..
STEMSSTEMS
ZX
W
stre
am
repl
ace
fetc
h
DRAM or DRAM or Other CPUsOther CPUs
6© 2005 Babak Falsafi
Contribution #1: Temporal Shared-Memory Streaming
• Recent coherence miss sequences recur 50% misses closely follow previous sequence Large opportunity to exploit MLP
• Temporal streaming engine Ordered streams allow practical HW Performance improvement:
7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.
7© 2005 Babak Falsafi
Contribution #2: Last-touch Correlated Data Streaming
• Last-touch prefetchers Cache block deadtime >> livetime Fetch on a predicted “last touch” But, designs impractical (> 200MB on-chip)
• Last-touch correlated data streaming Miss order ~ last-touch order Stream table entries from off-chip Eliminates 75% of all L1 misses with ~200KB
8© 2005 Babak Falsafi
Outline
• STEMS OverviewSTEMS Overview
• Example Temporal Streaming
1. Temporal Shared-Memory Streaming
2. Last-Touch Correlated Data Streaming
• Summary
9© 2005 Babak Falsafi
• Record sequences of memory accesses• Transfer data sequences ahead of requests
Temporal Shared-Memory Streaming [ISCA’05]
Baseline System
CPU Mem
Miss A
Fill A
Streaming System
Miss B
Fill B
CPU MemMiss A
Fill A,B,C,…
• Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing
10© 2005 Babak Falsafi
Relationship Between Misses
• Intuition: Miss sequences repeat Because code sequences repeat
• Observed for uniprocessors in [Chilimbi’02]
• Temporal Address Correlation Same miss addresses repeat in the same order
Correlated miss sequence = stream
Q W A B C D E R T A B C D E YMiss seq. …
11© 2005 Babak Falsafi
Relationship Between Streams
• Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes
• Temporal Stream Locality Recent streams likely to recur
Q W A B C D E R
T A B C D E Y
Node 1
Node 2
Addr. correlation + stream locality = temporal correlation
12© 2005 Babak Falsafi
Memory Level Parallelism
• Streams create MLP for dependent misses
• Not possible with larger windows / runahead
Temporal streaming breaks dependence chains
AB
C
Baseline
CPU
Must wait to follow pointers
Temporal Streaming
CPU
Fetch in parallel
AB
C
13© 2005 Babak Falsafi
Temporal Streaming
Record
Node 1
Miss AMiss BMiss CMiss D
Directory Node 2
14© 2005 Babak Falsafi
Temporal Streaming
Record
Node 1
Miss AMiss BMiss CMiss D
Directory Node 2
Miss AReq. A
Fill A
15© 2005 Babak Falsafi
Temporal Streaming
Record
Locate
Node 1
Miss AMiss BMiss CMiss D
Directory Node 2
Miss AReq. A
Fill ALocate A
Stream B, C, D
16© 2005 Babak Falsafi
Temporal Streaming
Node 1
Miss AMiss BMiss CMiss D
Directory Node 2
Miss AReq. A
Fill ALocate A
Stream B, C, D
Record
Locate
Stream
Fetch B, C, D
Hit B
Hit C
17© 2005 Babak Falsafi
Temporal Streaming Engine
Record• Coherence Miss Order Buffer (CMOB)
~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses
CPU
$
Local Memory
CMOBQ W A B C D E R T Y
Fill E
18© 2005 Babak Falsafi
Temporal Streaming Engine
Locate
• Annotate directory Already has coherence info for every block CMOB append send pointer to directory Coherence miss forward stream request
Directory
A
B
shared Node 4 @ CMOB[23]
modified Node 11 @ CMOB[401]
19© 2005 Babak Falsafi
Temporal Streaming Engine
Stream• Fetch data to match use rate
Addresses in FIFO stream queue Fetch into streamed value buffer
F E D C BStream Queue
CPU
L1 $Streamed
Value Buffer
Node i: stream {A,B,C…}
A data
Fetch A
~32 entries
20© 2005 Babak Falsafi
Practical HW Mechanisms
• Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends
• Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup
21© 2005 Babak Falsafi
Methodology: Infrastructure
SimFlex [SIGMETRICS’04]
Statistically sampling → uArch sim. in minutes Full-system MP simulation (boots Linux & Solaris)
Uni, CMP, DSM timing models Real server software (e.g., DB2 & Oracle) Component-based FPGA board interface for hybrid
simulation/prototyping
Publicly available athttp://www.ece.cmu.edu/~simflex
22© 2005 Babak Falsafi
Methodology: Benchmarks & Parameters
Model Parameters 16 4GHz SPARC CPUs
8-wide OoO; 8-stage pipe
256-entry ROB/LSQ
64K L1, 8MB L2
TSO w/ speculation
Benchmark Applications• Scientific
em3d, moldyn, ocean• OLTP: TPC-C 3.0 100 WH
IBM DB2 7.2 Oracle 10g
• SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3
23© 2005 Babak Falsafi
TSE Coverage Comparison
0%
50%
100%
150%
200%
250%S
trid
eG
/DC
G/A
CT
SE
Str
ide
G/D
CG
/AC
TS
E
Str
ide
G/D
CG
/AC
TS
E
Str
ide
G/D
CG
/AC
TS
E
Str
ide
G/D
CG
/AC
TS
E
Str
ide
G/D
CG
/AC
TS
E
Str
ide
G/D
CG
/AC
TS
E
em3d moldyn ocean Apache DB2 Oracle Zeus
% C
oh
ere
nt
Re
ad
Mis
se
s Coverage Discards
TSE outperforms Stride and GHB for coherence misses
24© 2005 Babak Falsafi
Stream Lengths
0%
20%
40%
60%
80%
100%0 1 2 4 8 16
32
64
128
256
512
1K
2K
4K
8K
16K
32K
64K
128K
Length (# of streamed blocks)
Cu
m. %
of
All H
its
Apache DB2 Oracle Zeusem3d moldyn ocean
• Comm: Short streams; low base MLP (1.2-1.3)• Sci: Long streams; high base MLP (1.6-6.6)• Temporal Streaming addresses both cases
25© 2005 Babak Falsafi
3.3
1.0
1.1
1.2
1.3
em3d
mol
dyn
ocea
n
Apa
che
DB
2O
racl
eZ
eus
TSE Performance Impact
-
0.2
0.4
0.6
0.8
1.0
ba
seT
SE
ba
seT
SE
ba
seT
SE
ba
seT
SE
ba
seT
SE
ba
seT
SE
ba
seT
SE
em3d moldyn ocean Apache DB2 Oracle Zeus
No
rma
lize
d T
ime
Busy Other Stalls Coherent Read Stalls
em3d moldyn ocean Apache DB2 Oracle Zeus
Time Breakdown Speedup
95% CI
• TSE eliminates 25%-95% of coherent read stalls6% to 230% performance improvement
26© 2005 Babak Falsafi
TSE Conclusions
• Temporal Streaming Intuition: Recent coherence miss sequences recur Impact: Eliminates 50-100% of coherence misses
• Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement
7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.
27© 2005 Babak Falsafi
Outline
• Big Picture
• Example Streaming Techniques
1. Temporal Shared Memory Streaming
2. Last-Touch Correlated Data Streaming
• Summary
28© 2005 Babak Falsafi
Enhancing Lookahead
Observation [Mendelson, Wood&Hill]:
• Few live sets Use until last “hit” Data reuse high hit rate ~80% dead frames!
Exploit for lookahead:• Predict last “touch” prior to “death”• Evict, predict and fetch next line
L1 @ Time T1
L1 @ Time T2
Live
set
s D
ead
set
s
29© 2005 Babak Falsafi
How Much Lookahead?
Predicting last-touches will eliminate all latency!
DR
AM
lat
ency
L2
late
ncy
Frame Deadtimes (cycles)
30© 2005 Babak Falsafi
Dead-Block Prediction [ISCA’00 & ’01]
• Per-block trace of memory accesses to a block Predicts repetitive last-touch events
PC3: load/store A1
PC1: load/store A1
PC3: load/store A1
PC5: load/store A3
Acc
esse
s to
a b
lock
fram
e
(miss)
(hit)
(hit)
(miss)
PC0: load/store A0 (hit)
Trace = A1 (PC1,PC3, PC3)
Last touch
First touch
31© 2005 Babak Falsafi
Dead-Block Prefetcher (DBCP)
Evict A1Fetch A3
11
Correlation Table
A3A1,PC1,PC3,PC3PC1,PC3
History Table (HT)
PC3
Current Access
Latest
A1
• History & correlation tables History ~ L1 tag array Correlation ~ memory footprint
• Encoding truncated addition• Two bit saturating counter
32© 2005 Babak Falsafi
0%
20%
40%
60%
80%
100%
120%
Olden SPEC INT SPEC FP
(%)
of
Cac
he
Mis
ses
early
train
incorrect
correct
DBCP Coverage with Unlimited Table Storage
• High average L1 miss coverage• Low misprediction (2-bit counters)
33© 2005 Babak Falsafi
0%
20%
40%
60%
80%
100%
160K
B
640K
B
2M
B
10M
B
40M
B
160M
B
On-Chip Correlation Table Size
% o
f A
ch
ievab
le C
overa
ge
average
worst-case
Impractical On-Chip Storage Size
Needs over 150MB to achieve full potential!
34© 2005 Babak Falsafi
Our Observation: Signatures are Temporally Correlated
Signatures need not reside on chip1. Last-touch sequences recur
• Much as cache miss sequences recur [Chilimbi’02]
• Often due to large structure traversals
2. Last-touch order ~ cache miss order• Off by at most L1 cache capacity
Key implications:• Can record last touches in miss order• Store & stream signatures from off-chip
35© 2005 Babak Falsafi
Last-Touch Correlated Data Streaming(LT-CORDS)
• Streaming signatures on chip Keep all sigs. in sequences in off-chip DRAM Retain sequence “heads” on chip “Head” signals a stream fetch
• Small (~200KB) on-chip stream cache Tolerate order mismatch Lookahead for stream startup
DBCP coverage with moderate on-chip storage!
36© 2005 Babak Falsafi
DBCP Mechanisms
Core L1L2
DRAM
HT
All signatures in random-access on-chip table
Sigs. (160MB)
37© 2005 Babak Falsafi
Only a subset needed at a time“Head” as cue for the “stream”
Signatures stored off-chip
What LT-CORDS Does
Core L1L2
DRAM
HT
… and only in order
38© 2005 Babak Falsafi
LT-CORDS Mechanisms
Core L1L2
DRAM
HT
SC
On-chip storage independent of footprint
Heads (10K) Sigs. (200K)
39© 2005 Babak Falsafi
Methodology• SimpleScalar CPU model with Alpha ISA
SPEC CPU2000 & Olden benchmarks
• 8-wide out-of-order processor 2 cycle L1, 16 cycle L2, 180 cycle DRAM FU latencies similar to Alpha EV8 64KB 2-way L1D, 1MB 8-way L2
• LT-CORDS with 214KB on-chip storage• Apps. with significant memory stalls
40© 2005 Babak Falsafi
0%
20%
40%
60%
80%
100%
120%D
BC
P
LT-
cord
s
DB
CP
LT-
cord
s
DB
CP
LT-
cord
s
Olden SPEC INT SPEC FP
(%)
of
Ca
ch
e M
iss
es
early
train
incorrect
correct
LT-CORDS vs. DBCP Coverage
LT-CORDS reaches infinite DBCP coverage
41© 2005 Babak Falsafi
1
2
3
4
5
treead
dmgri
dsw
im
equa
keap
plu mcffm
a3d art
em3d
facere
c
wupwise bh
ammp gc
cpa
rser
sixtra
ck
Sp
ee
du
p
Infinite Cache
LT-CORDS
GHB (PC/DC)
LT-CORDS Speedup
LT-CORDS hides large fraction of memory latency
42© 2005 Babak Falsafi
LT-CORDS Conclusions
• Intuition: Signatures temporally correlated Cache miss & last-touch sequences recur Miss order ~ last-touch order
• Impact: eliminates 75% of all misses Retains DBCP coverage, lookahead, accuracy On-chip storage indep. of footprint 2x less memory stalls over best prior work
43© 2005 Babak Falsafi
For more informationVisit our website:http://www.ece.cmu.edu/CALCM