Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching
Adaptive Insertion Policies for High-Performance Caching
description
Transcript of Adaptive Insertion Policies for High-Performance Caching
![Page 1: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/1.jpg)
1
Adaptive Insertion Policies for High-Performance
CachingMoinuddin K.
QureshiYale N. Patt
International Symposium on Computer Architecture (ISCA) 2007
Aamer JaleelSimon C. Steely
Jr.Joel Emer
![Page 2: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/2.jpg)
2
Background
L1 misses Short latency, can be hidden
L2 misses Long-latency, hurts performance
Important to reduce Last Level (L2) cache misses
MemoryL2 miss
Proc L2 L1
Fast processor + Slow memory Cache hierarchy
(~2 cycles) (~10 cycles)(~300 cycles)
![Page 3: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/3.jpg)
3
Motivation
L1 for latency, L2 for capacity
Traditionally L2 managed similar to L1 (typically LRU)
L1 filters temporal locality Poor locality at L2
LRU causes thrashing when working set > cache size Most lines remain unused between insertion and
eviction
![Page 4: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/4.jpg)
4
Dead on Arrival (DoA) Lines
DoA Lines: Lines unused between insertion and eviction
For the 1MB 16-way L2, 60% of lines are DoA
Ineffective use of cache space
(%
) D
oA
Lin
es
![Page 5: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/5.jpg)
5
Why DoA Lines ? Streaming data Never reused. L2 caches don’t help.
Working set of application greater than cache size
Soln: if working set > cache size, retain some working set
art
Mis
ses p
er
10
00
in
str
ucti
on
s
Cache size in MB
mcf
Mis
ses p
er
10
00
in
str
ucti
on
s
Cache size in MB
![Page 6: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/6.jpg)
6
Overview
Problem: LRU replacement inefficient for L2 caches
Goal: A replacement policy that has:1. Low hardware overhead2. Low complexity3. High performance4. Robust across workloads
Proposal: A mechanism that reduces misses by 21% and has total storage overhead < two bytes
![Page 7: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/7.jpg)
7
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
![Page 8: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/8.jpg)
8
Cache Insertion Policy
Simple changes to insertion policy can greatly improve cache performance for memory-intensive workloads
Two components of cache replacement:
1. Victim Selection: Which line to replace for incoming line? (E.g. LRU, Random, FIFO, LFU)
2. Insertion Policy: Where is incoming line placed in replacement list? (E.g. insert incoming line at MRU position)
![Page 9: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/9.jpg)
9
LRU-Insertion Policy (LIP)
a b c d e f g hMRU LRU
i a b c d e f g
Reference to ‘i’ with traditional LRU policy:
a b c d e f g i
Reference to ‘i’ with LIP:
Choose victim. Do NOT promote to MRU
Lines do not enter non-LRU positions unless reused
![Page 10: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/10.jpg)
10
if ( rand() < ) Insert at MRU position;
elseInsert at LRU position;
Bimodal-Insertion Policy (BIP)
LIP does not age older lines
Infrequently insert lines in MRU position
Let Bimodal throttle parameter
For small , BIP retains thrashing protection of LIP while responding to changes in working set
![Page 11: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/11.jpg)
11
Circular Reference Model
For small , BIP retains thrashing protection of LIP while adapting to changes in working set
Policy (a1 a2 a3 …
aT)N
(b1 b2 b3 …
bT)N
LRU 0 0
OPT (K-1)/(T-1) (K-1)/(T-1)
LIP (K-1)/T 0
BIP (small ) ≈ (K-1)/T ≈ (K-1)/T
Reference stream has T blocks and repeats N times.
Cache has K blocks (K<T and N>>T)
[Smith & GoodmanISCA’84]
![Page 12: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/12.jpg)
12
Results for LIP and BIP
Changes to insertion policy increases misses for LRU-friendly workloads
LIP BIP1/32)
(%)
Red
uct
ion
in
L2
MPK
I
![Page 13: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/13.jpg)
13
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
![Page 14: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/14.jpg)
14
Dynamic-Insertion Policy (DIP)
Two types of workloads: LRU-friendly or BIP-friendly
DIP can be implemented by:
1. Monitor both policies (LRU and BIP)
2. Choose the best-performing policy
3. Apply the best policy to the cache
Need a cost-effective implementation “Set Dueling”
![Page 15: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/15.jpg)
15
LRU-sets
Follower Sets
BIP-sets
DIP via “Set Dueling”
Divide the cache in three:– Dedicated LRU sets– Dedicated BIP sets – Follower sets (winner of
LRU,BIP)
n-bit saturating counter misses to LRU-sets: counter++misses to BIP-set: counter--
Counter decides policy for Follower sets:– MSB = 0, Use LRU– MSB = 1, Use BIP
n-bit cntr+
miss
–miss
MSB = 0?
YES No
Use LRU Use BIP
monitor choose apply
(using a single counter)
![Page 16: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/16.jpg)
16
Bounds on Dedicated Sets
How many dedicated sets required for “Set Dueling”?
μLRU, σLRU, μBIP, σBIP = Avg. misses and stdev. for LRU and BIP
P(Best) = probability of selecting best policy
P(Best) = P(Z< r√n)
n = number of dedicated setsZ = standard Gaussian variable
r = |μLRU- μBIP|/√(σLRU2 + σBIP
2)
(For majority workloads r > 0.2)32-64 dedicated sets sufficient
![Page 17: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/17.jpg)
17
Results for DIP
DIP reduces average MPKI by 21% and requires < two bytes storage
overhead
DIP (32 dedicated sets)BIP
(%)
Red
uct
ion
in
L2
MPK
I
![Page 18: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/18.jpg)
18
DIP vs. Other Policies
0
5
10
15
20
25
30
35
(LRU+RND) (LRU+LFU) (LRU+MRU) DIP OPT Double
% R
educ
tion
in
aver
age
MPK
I
DIP bridges two-thirds of gap between LRU and OPT
DIP OPT Double(2MB)
(LRU+RND) (LRU+LFU) (LRU+MRU)
(%)
Red
uct
ion
in
L2
MPK
I
![Page 19: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/19.jpg)
19
IPC Improvement
Processor: 4 wide, 32-entry windowMemory 270 cycles. L2: 1MB 16-way LRU
IPC
Im
pro
vem
ent
wit
h D
IP (
%)
DIP Improves IPC by 9.3% on average
![Page 20: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/20.jpg)
20
Outline
Introduction
Static Insertion Policies
Dynamic Insertion Policies
Summary
![Page 21: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/21.jpg)
21
Summary
LRU inefficient for L2 caches. Most lines remain unused between insertion and eviction
Proposed changes to cache insertion policy (DIP) has:
1. Low hardware overheadRequires < two bytes storage overhead
2. Low complexityTrivial to implement. No changes to cache
structure
3. High performanceReduces misses by 21%. Two-thirds as good
as OPT
4. Robust across workloadsAlmost as good as LRU for LRU-friendly
workloads
![Page 23: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/23.jpg)
23
DIP vs. LRU Across Cache Sizes
LRU DIP
}
1MB
} }}
8MB2MB4MB
MPK
I R
ela
tive t
o 1
MB
LR
U
(%)
(Sm
alle
r is
bett
er)
art mcf equake swim health Avg_16
MPKI reduces till workload fits in the cache
![Page 24: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/24.jpg)
24
DIP with 1MB 8-way L2 Cache
MPKI reduction with 8-way (19%) similar to 16-way (21%)
0
10
20
30
40
50
(%)
Red
uct
ion
in
L2
MPK
I
![Page 25: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/25.jpg)
25
Interaction with Prefetching(%
) R
ed
uct
ion
in
L2
MPK
I
DIP-NoPrefLRU-Pref
DIP-Pref
DIP also works well in presence of prefetching
(PC-based stride prefetcher)
![Page 26: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/26.jpg)
26
mcf snippet
![Page 27: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/27.jpg)
27
art snippet
![Page 28: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/28.jpg)
28
health mpki
![Page 29: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/29.jpg)
29
swim mpki
![Page 30: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/30.jpg)
30
DIP Bypass
![Page 31: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/31.jpg)
31
DIP (design and implementation)
![Page 32: Adaptive Insertion Policies for High-Performance Caching](https://reader035.fdocuments.in/reader035/viewer/2022081503/56815731550346895dc4cf25/html5/thumbnails/32.jpg)
32
Random Replacement (Success Function)
Cache contains K blocks and reference stream contains T
Prob that a block in cache survives 1 eviction = (1-1/K)Total number of evictions = (T-1)*Pmiss
Phit = (1-1/K)^(T-1)*Pmiss)Phit = (1-1/K)^(T-1)(1-Phit)
Iterative solution: Start at Phit=0
1. Phit = (1-1/K)^T