Cooperative Cache Scrubbing
description
Transcript of Cooperative Cache Scrubbing
![Page 1: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/1.jpg)
Cooperative Cache Scrubbing
Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^
PACT 2014
* ^
![Page 2: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/2.jpg)
Multicore Challenge
Chip
memory (DRAM)p. 2
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
Objects rapidly allocated and
short-lived
LLC
![Page 3: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/3.jpg)
Problem: Allocation Wall
Chip
memory (DRAM)p. 3
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEAD
DEADDEAD
DEAD
DEAD
Objects rapidly allocated and
short-lived
LLC
![Page 4: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/4.jpg)
Problem: Bandwidth & Power Wall
Chip
memory (DRAM)p. 4
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEAD
DEADDEAD
DEAD
DEAD 00000000000000
Objects rapidly allocated and
short-lived
Zero initialization
LLC
![Page 5: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/5.jpg)
Cooperative Cache Scrubbing
Chip
LLC
memory (DRAM)p. 5
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
00000000000000
Objects rapidly allocated and
short-lived
Zero initialization
DEADDEAD
DEADDEAD DEAD
DEADwrite read
LLC
![Page 6: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/6.jpg)
Generational Garbage Collection
Young objects die quickly Nursery
Traced for live objects Copy to mature space Reclaimed ‘en masse’
NurseryMature
LLC
8MBp. 6
DEADDEADDEADDEAD DEAD
DEAD
![Page 7: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/7.jpg)
Dead Lines in LLC (8MB)
p. 7
![Page 8: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/8.jpg)
Dead Data Written Back?
Chip
LLC
memory (DRAM)p. 8
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEADDEAD
DEAD
DEAD
DEAD
![Page 9: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/9.jpg)
Useless Write Backs (8MB LLC)
p. 9
![Page 10: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/10.jpg)
Cooperative Cache Scrubbing
Communicate managed language’s semantic information to hardware
Caches ‘Scrub’ dead lines
Invalidate Unset dirty bit
Zero lines without fetch Result
Better cache management Avoid traffic to DRAM Save DRAM energy
p. 10
writes
reads
![Page 11: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/11.jpg)
Dead Data Written in Cache?
Young objects die quickly Nursery
Traced for live objects Copy to mature space Reclaimed ‘en masse’
NurseryMature
LLCDEADDEAD
DEAD DEAD
DEADDEAD
DEAD
DEAD
p. 11
0000000
![Page 12: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/12.jpg)
Dead Lines Written in LLC (8MB)
p. 12
![Page 13: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/13.jpg)
SW-HW Cooperative Scrubbing
Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world)
After nursery collection, call scrub instruction on each line in entire range
Call zero instructions to zero region (32KB)
Hardware
p. 13
![Page 14: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/14.jpg)
SW-HW Cooperative Scrubbing
Software Hardware
Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU
Zeroing (L2) clzero: zero cache line without fetch
Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip)
p. 14
PowerPC’s dcbi, ARM
PowerPC’s dcbz
![Page 15: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/15.jpg)
MESI Coherence Transitions
p. 15
M E
I S
clclean/-
clinvalidate/- clin
valid
ate/
-
clclean/-
clclean/-
clinv
alida
te/-
clinvalidate/-clclean/-
![Page 16: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/16.jpg)
MESI Coherence Transitions
p. 16
M E
I S
clzero/-clzero/-
clze
ro/B
usIn
valid
ate
clzero/BusInvalidateB
usIn
valid
ate
BusIn
valid
ate
BusInvalidate
external: from another LLC
![Page 17: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/17.jpg)
Methodology
Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM
Works with JIT compiler Emulate system calls (futex & nanosleep)
JVM-simulator communication with new instruction
Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2nd invocation
p. 17
![Page 18: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/18.jpg)
DRAM Writes (8MB nursery)
p. 18
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
![Page 19: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/19.jpg)
DRAM Writes (8MB nursery)
p. 19
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
![Page 20: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/20.jpg)
DRAM Writes (8MB nursery)
p. 20
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
![Page 21: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/21.jpg)
DRAM Reads (8MB nursery)
p. 21
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
![Page 22: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/22.jpg)
DRAM Reads (8MB nursery)
p. 22
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
![Page 23: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/23.jpg)
DRAM Reads (8MB nursery)
p. 23
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
![Page 24: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/24.jpg)
DRAM Reads (8MB nursery)
p. 24
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
![Page 25: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/25.jpg)
DRAM Reads (8MB nursery)
p. 25
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
![Page 26: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/26.jpg)
Dynamic DRAM Energy (8MB nursery)
p. 26
Mean0
10
20
30
40
50
60
70
80
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
![Page 27: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/27.jpg)
Dynamic DRAM Energy (8MB nursery)
p. 27
Mean0
10
20
30
40
50
60
70
80
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
![Page 28: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/28.jpg)
Total DRAM Energy
p. 28
4M 8M 16M
-5
0
5
10
15
20
25
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
-22%
![Page 29: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/29.jpg)
Total DRAM Energy
p. 29
4M 8M 16M
-5
0
5
10
15
20
25
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
-22%
![Page 30: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/30.jpg)
Total DRAM Traffic
p. 30
4M 8M 16M
-50
-25
0
25
50
75
100
clinvalidateclundirtyclcleanclzeroclclean+clzero
Tra
ffic
Re
du
cti
on
(%
)
-14x
![Page 31: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/31.jpg)
clclean+clzero Improvements
p. 31
DRAM R
eads
DRAM W
rites
Total
DRAM
Tra
ffic
LLC m
isses
Execu
tion
time
Dynam
ic DRAM
Ene
rgy
Total
DRAM
Ene
rgy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4MB 8MB 16MB
![Page 32: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/32.jpg)
Related Work
Cooperative cache management ESKIMO by Isen & John, Micro 09
Useless reads and writes to DRAM by sequential C programs
Reduce energy Require large map in hardware, extra cache bits
Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 C & Fortran static analysis to give cache hints to evict or
keep data
Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth
p. 32
![Page 33: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/33.jpg)
Conclusions
Software-hardware cooperative cache scrubbing
Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation
methodology Reductions 59% traffic 14% DRAM energy 4.6% execution time
p. 33
http://users.elis.ugent.be/~jsartor/
0000000DEAD
![Page 34: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/34.jpg)
p. 34
![Page 35: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/35.jpg)
Execution Time (8MB nursery)
p. 35
Mean0
1
2
3
4
5
6
7
clinvalidateclundirtyclcleanclzeroclclean+clzero
Ex
ec
uti
on
Tim
e R
ed
uc
tio
n (
%)
![Page 36: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/36.jpg)
Changes to MESI coherence protocol
State clinvalidate clundirty/clclean
clzero BusInvalidate
M invalidate L1/L2 (no WB) I
invalidate L1/L2 (no WB) E(clclean LRU)
⁄ invalidate L1/L2 (no WB) I
E invalidate L1/L2 I
invalidate L1/L2 (clclean LRU)
M invalidate L1/L2 I
S invalidate L1/L2 I
invalidate L1/L2 (clclean LRU)
BusInvalidate M
invalidate L1/L2 I
I ⁄ ⁄ BusInvalidate M
⁄
p. 36
![Page 37: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/37.jpg)
Total DRAM Energy (8MB nursery)
p. 37
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
-10
0
10
20
30
40
50
60
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
![Page 38: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/38.jpg)
Execution Time Across Nurseries
p. 38
![Page 39: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/39.jpg)
Execution Time
p. 39
![Page 40: Cooperative Cache Scrubbing](https://reader033.fdocuments.in/reader033/viewer/2022061610/56813dab550346895da77093/html5/thumbnails/40.jpg)
Dynamic DRAM Energy 8MB Nursery
p. 40