Teaching Old Caches New Tricks: Predictor Virtualization
description
Transcript of Teaching Old Caches New Tricks: Predictor Virtualization
![Page 1: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/1.jpg)
Teaching Old Caches New Tricks:Predictor Virtualization
Andreas MoshovosUniv. of Toronto
Ioana Burcea’s Thesis workSome parts joint with Stephen Somogyi (CMU) and Babak
Falsafi (EPFL)
![Page 2: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/2.jpg)
2Prediction The Way Forward
CPU Predictors
PrefetchingBranch Target and DirectionCache ReplacementCache Hit
• Application footprints grow• Predictors need to scale to remain effective• Ideally, fast, accurate predictions•Can’t have this with conventional technology
Prediction has proven useful – Many forms – Which to choose?
![Page 3: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/3.jpg)
3The Problem with Conventional Predictors
Predictor Virtualization Approximate Large, Accurate, Fast Predictors
Predictor
Hardware Cost
Accuracy Latency
• What we have• Small• Fast • Not-so-accurate
• What we want• Small• Fast • Accurate
![Page 4: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/4.jpg)
4Why Now?
L2 Cache
Physical Memory
CPU CPU CPU CPU
10-100MB
I$D$ I$D$ I$D$ I$D$
Extra Resources: CMPs with Large Caches
![Page 5: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/5.jpg)
5
L2 Cache
Predictor Virtualization (PV)
Use the on-chip cache to store metadataReduce cost of dedicated predictors
Physical Memory
CPU CPU CPU CPU
I$D$ I$D$ I$D$ I$D$
![Page 6: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/6.jpg)
6
L2 Cache
Predictor Virtualization (PV)
Use the on-chip cache to store metadataImplement otherwise impractical predictors
Physical Memory
CPU CPU CPU CPU
I$D$ I$D$ I$D$ I$D$
![Page 7: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/7.jpg)
7Research Overview• PV breaks the conventional predictor design trade offs
– Lowers cost of adoption– Facilitates implementation of otherwise impractical predictors
• Freeloads on existing resources– Adaptive demand
• Key Design Challenge: – How to compensate for the longer latency to metadata
• PV in action– Virtualized “Spatial Memory Streaming”– Virtualized Branch Target Buffers
![Page 8: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/8.jpg)
8
• PV Architecture• PV in Action
– Virtualizing “Spatial Memory Streaming”– Virtualizing Branch Target Buffers
• Conclusions
Talk Roadmap
![Page 9: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/9.jpg)
9PV Architecture
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PredictorTable
request prediction
Virtualize
![Page 10: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/10.jpg)
10PV Architecture
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PVCache
request predictionPVProxy
PVTable
Requires access to L2Back-side of L1Not as performance critical
![Page 11: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/11.jpg)
11PV Challenge: Prediction Latency
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PVCache
request predictionPVProxy
PVTable
CommonCase
Infrequentlatency: 12-18 cycles
Rarelatency: 400 cycles
Key: How to pack metadata in L2 cache blocks to amortize costs
![Page 12: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/12.jpg)
12To Virtualize or Not To Virtualize• Predictors redesigned with PV in mind
• Overcoming the latency challenge– Metadata reuse
• Intrinsic: one entry used for multiple predictions• Temporal: one entry reused in the near future• Spatial: one miss overcome by several subsequent hits
– Metadata access pattern predictability• Predictor metadata prefetching
– Looks similar to designing caches• BUT:
– Does not have to be correct all the time– Time limit on usefullnes
![Page 13: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/13.jpg)
13PV in Action
• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]
• Within 1% performance• Hardware cost from 60KB down to < 1KB
• Branch prediction– Virtualize branch target buffers
• Increase the perceived BTB accuracy• Up to 12.75% IPC improvement with 8% hardware overhead
![Page 14: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/14.jpg)
14Spatial Memory Streaming [ISCA06]M
emor
y
1100001010001…
1101100000001…
spatial patterns
Pattern History Table
[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
![Page 15: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/15.jpg)
15Spatial Memory Streaming (SMS)
Detector Predictordata accessstream
patterns
trigger access
pattern
prefetches
~1KB ~60KBVirtualize
![Page 16: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/16.jpg)
16Virtualizing SMS
VirtualTable
patterntag patterntag patterntag...
unused
11 ways
1K sets
PVCache
11 ways
8 sets
L2 cache line
Region-level prefetching is naturally tolerant of longer prediction latenciesSimply pack predictor entries spatially
![Page 17: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/17.jpg)
17Experimental Methodology• SimFlex:
– full-system, cycle-accurate simulator• Baseline processor configuration
– 4-core CMP - OoO– L1D/L1I 64KB 4-way set-associative– UL2 8MB 16-way set-associative
• Commercial Workloads– Web servers: Apache and Zeus– TPC-C: DB2 and Oracle– TPC-H: several queries– Developed by Impetus group at CMU
• Anastasia Ailamaki & Babak Falsafi PIs
![Page 18: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/18.jpg)
SMS Performance Potential
0
20
40
60
80
100
120
Infin
ite1K
- 16
a1K
- 11
a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 11
a8
- 11a
Infin
ite1K
- 16
a1K
- 11
a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 11
a8
- 11a
Infin
ite1K
- 16
a1K
- 11
a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 11
a8
- 11a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
Per
cent
age
L1 R
ead
Mis
es (%
)
Conventional Predictor Degrades with Limited Storage
![Page 19: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/19.jpg)
19Virtualized SMS
Hardware CostOriginal Prefetcher ~ 60KBVirtualized Prefetcher < 1KB
Spe
edup
better
![Page 20: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/20.jpg)
Impact of Virtualization on L2 Requests
0
5
10
15
20
25
30
35
40
45
Apache Oracle Qry 17
PV-8 PV-16P
erce
ntag
e In
crea
se L
2 R
eque
sts
(%)
![Page 21: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/21.jpg)
Impact of Virtualization on Off-Chip Bandwidth
0%
1%
1%
2%
2%
3%
3%
4%
4%
5%
PV-8 PV-16 PV-8 PV-16 PV-8 PV-16
Apache Oracle Qry17
L2 Misses L2 Write backs
Off-
Chi
p B
andw
idth
Incr
ease
![Page 22: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/22.jpg)
22PV in Action
• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]
• Same performance• Hardware cost from 60KB down to < 1KB
• Branch prediction– Virtualize branch target buffers
• Increase the perceived BTB capacity• Up to 12.75% IPC improvement with 8% hardware overhead
![Page 23: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/23.jpg)
23 The Need for Larger BTBs
Bra
nch
MP
KI
better
Commercial applications benefit from large BTBs
BTB entries
![Page 24: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/24.jpg)
24
L2 Cache
Virtualizing BTBs: Phantom-BTB
BTBPC Virtual
Table
• Latency challenge• Not latency tolerant to longer prediction latencies
• Solution: predictor metadata prefetching• Virtual table decoupled from the BTB• Virtual table entry: temporal group
Small and Fast Large and Slow
![Page 25: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/25.jpg)
Facilitating Metadata Prefetching• Intuition: Programs follow mostly similar paths
Detection path Subsequent path
![Page 26: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/26.jpg)
26Temporal Groups
Past misses Good indicator of future missesDedicated Predictor acts as a filter
![Page 27: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/27.jpg)
27Fetch Trigger
Preceding miss triggers temporal group fetchNot precise region around miss
![Page 28: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/28.jpg)
28Temporal Group Prefetching
![Page 29: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/29.jpg)
29Temporal Group Prefetching
![Page 30: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/30.jpg)
30
L2 Cache
TemporalGroup Generator
Phantom-BTB Architecture
BTBPC
Prefetch Engine
• Temporal Group Generator• Generates and installs temporal groups in the L2 cache
• Prefetch Engine• Prefetches temporal groups
![Page 31: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/31.jpg)
31
Branch Stream
TemporalGroup Generator
Temporal Group Generation
BTBPC
Prefetch Engine
L2 Cache
miss
BTB misses generate temporal groupsBTB hits do not generate any PBTB activity
Miss
Hit
![Page 32: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/32.jpg)
32
Prefetch Engine
miss
Branch Metadata Prefetching
BTBPC
TemporalGroup Generator
VirtualTableL2 Cache
PrefetchBuffer
BTB misses trigger metadata prefetchesParallel lookup in BTB and prefetch buffer
Branch Stream
Miss
Hit
Prefetch Buffer Hits
![Page 33: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/33.jpg)
33Phantom-BTB Advantages
• “Pay-as-you-go” approach– Practical design– Increases the perceived BTB capacity– Dynamic allocation of resources
• Branch metadata allocated on demand– On-the-fly adaptation to application demands
• Branch metadata generation and retrieval performed on BTB misses• Only if the application sees misses• Metadata survives in the L2 as long as there is sufficient capacity and
demand
![Page 34: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/34.jpg)
34Experimental Methodology
• Flexus cycle-accurate, full-system simulator• Uniprocessor - OoO
– 1K-entry conventional BTB– 64KB 2-way ICache/DCache– 4MB 16-way L2 Cache
• Phantom-BTB– 64-entry prefetch buffer– 6-entry temporal group – 4K-entry virtual table
• Commercial Workloads
![Page 35: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/35.jpg)
35PBTB vs. Conventional BTBs
Spee
dup
better
Performance within 1% of a 4K-entry BTB with 3.6x less storage
![Page 36: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/36.jpg)
36Phantom-BTB with Larger Dedicated BTBs
Spe
edup
better
PBTB remains effective with larger dedicated BTBs
![Page 37: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/37.jpg)
37Increase in L2 MPKI
L2 M
PK
I
Marginal increase in L2 misses
better
![Page 38: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/38.jpg)
38Increase in L2 Accesses
L2 A
cces
ses
per K
I
• PBTB follows application demand for BTB capacity
better
![Page 39: Teaching Old Caches New Tricks: Predictor Virtualization](https://reader035.fdocuments.in/reader035/viewer/2022062410/5681656e550346895dd80161/html5/thumbnails/39.jpg)
39Summary
• Predictor metadata stored in memory hierarchy– Benefits
• Reduces dedicated predictor resources• Emulates large predictor tables for increased predictor accuracy
– Why now?• Large on-chip caches / CMPs / need for large predictors
– Predictor virtualization advantages• Predictor adaptation• Metadata sharing
• Moving Forward– Virtualize other predictors– Expose predictor interface to software level