BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES Chiachen Chou, Georgia Tech Aamer Jaleel,...
-
Upload
delilah-ferguson -
Category
Documents
-
view
215 -
download
1
Transcript of BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES Chiachen Chou, Georgia Tech Aamer Jaleel,...
BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES
Chiachen Chou, Georgia Tech
Aamer Jaleel, NVIDIA*
Moinuddin K. Qureshi, Georgia Tech
ISCA 2015Portland, OR
June 15 , 2015
3D DRAM HELPS MITIGATE BANDWIDTH WALL
3D DRAM: Hybrid Memory Cube (HMC), High Bandwidth Memory (HBM)
courtesy: Micron, JEDEC, Intel, NVIDIA 2
Stacked DRAM provides 4-8X bandwidth, but has limited capacity
Intel Xeon Phi NVIDIA Pascal
Off-chip DRAM
3D DRAM IS USED AS A CACHE (DRAM CACHE)
3
DRAM Cache
DRAM$ stores tags in 3D DRAM for scalability
3D DRAM
Mem
ory
Hie
rarc
hy
fast
slow
CPU
L1$
L2$
L3$
CPU
L2$
L1$
1GB DRAM$ 16M Cache Lines 4B Tags
64MB Tag Storage
CAN DRAM CACHE PROVIDE 4X BANDWIDTH?
4
Hit (Good Use of BW)✔
✘ Miss Detection4X
Secondary Operations(Waste BW)
1X
✘ Miss Fill
✘ Writeback Detection
✘ Writeback Fill
DRAM Cache(Tags + Data)
Memory
CPUDATA
DATATAG
DRAM$ does not utilize full bandwidth
AGENDA
• Introduction
• Background– DRAM Cache Designs– Secondary Operations– Bloat Factor
• BEAR
• Results
• Summary
5
DRAM CACHE HAS NARROW BUS
6
DRAM$ accesses tag and data via a narrow bus
16-byte buses
CPU
DRAM Cache
[Qureshi and Loh MICRO’12]
2KB Row Buffer
8B Tag 64B Data
Alloy Cache
Useful Secondary
Hit (HIT) Miss Detection (MD), Miss Fill (MF), WB Detection (WD), WB Fill (WF)
CACHE REQUIRES MAINTENANCE OPERATIONS
7
L3$
DRAM Cache
Hit
Miss Fill
Memory
Miss WB Detection/Fill
DRAM$ bandwidth is used for secondary operations
Line X
Line X
Dirty Line Y
QUANTIFYING THE BANDWIDTH USAGE
8
MDMFHITHIT WDWF HIT
HIT HIT HIT
Transfer on Bus
Bloat Factor indicates the bandwidth inefficiency
Useful
Baseline Ideal0
0.51
1.52
2.53
3.54
Bloat Factor Breakdown
WB FillWB DetectionMiss FillMiss DetectionHit (Tag+Data)HitBl
oat F
acto
r
BLOAT FACTOR BREAKDOWN
9
Baseline has a Bloat Factor of 3.8
HIT 1.25
MD
MF
0.7
0.7
WD
WF
0.6
0.6
8-core, 8MB shared L3$, 1GB DRAM$, 16GB memorySPEC2006: 16 rate and 38 mix workloads
POTENTIAL PERFORMANCE OF 22%
10
Reducing Bloat Factor improves performance
Hit Latency0
50
100
150
200
250
300
Hit Latency
BaselineIdeal
Hit L
aten
c (c
ycle
)
Ideal1
1.1
1.2
1.3
Performance
Spee
dup
8-core, 8MB shared L3$, 1GB DRAM$, 16GB memorySPEC2006: 16 rate and 38 mix workloads
NOT ALL OPERATIONS ARE CREATED EQUAL
11
Opportunities to remove Secondary Operations
1. Operations to improve cache performance
2. Operations to ensure correctness
We propose BEAR to exploit these opportunities
DRAM Cache
DATA
Insert
Request
Exist?
AGENDA
• Introduction
• Background
• BEAR: Bandwidth-Efficient ARchitecture1. Bandwidth-Efficient Miss Fill
2. Bandwidth-Efficient Writeback Detection
3. Bandwidth-Efficient Miss Detection
• Results
• Summary
12
BANDWIDTH-EFFICIENT MISS FILL
13
How to enable bypass without hit rate degradation?
DRAM$
returns from memory
Throwaway
1-P
P
Insert
Line X
-15
-10
-5
0
5
Hit R
ate
Chan
ge (%
)
Insert
P=90%
mcf lbmso
plex milcmix1 mix2 mix3 mix4 mix6 mix7 AVG
05
1015202530
Hit L
aten
cy
Redu
ction
(%)
-10-505
1015
Perf
orm
ance
Impr
ovem
ent
12%
-5%
+10%
BAB LIMITS THE HIT RATE LOSS
14
DRAM$
Insert Set
Bypass Set
Hit Rate
Hit Rate
X - Y < Δ
no bypass
probabilistic bypass90% bypass
No Bypass
Probabilistic Bypass
+-
+
False
True
Use Probabilistic Bypass when hit rate loss is small
Bandwidth-Aware Bypass (BAB)
Baseline BAB0
0.5
1
1.5
2
2.5
3
3.5
4
Bloat Factor Breakdown
Bloa
t Fac
tor
BAB IMPROVE PERFORMANCE BY 5%
15
Hit Rate: Alloy 64%, BAB 62%
BAB trades off small hit rate for 5% improvement
BEAR1
1.05
1.1
Performance
BAB
Spee
dup
HIT 1.25
MD
MF
0.7
0.7
WD
WF
0.6
0.6
0.1
DRAM Cache Line Yold
WHAT IS A WRITEBACK DETECTION?
16
L3$
How can we remove Writeback Detection?
(WB Detection)
Dirty Line Ynew
Exist?
DRAM CACHE PRESENCE FOR WB DETECTION
17
L3$
DRAM Cache
Exist?
V D ?
OnlyWB Fill
WB Detection+
WB Fill
DRAM Cache Presence(DCP)
True False
DRAM Cache Presence reduces WB Detection
Line Yold
Dirty Line Ynew
Baseline BAB BAB+DCP0
0.5
1
1.5
2
2.5
3
3.5
4
Bloat Factor BreakdownBl
oat F
acto
r
DCP IMPROVES PERFORMANCE BY 4%
18
DCP provides 4% improvement in addition to BAB
BEAR1
1.05
1.1
Performance
BAB+DCP
BABSpee
dup
HIT 1.25
MD
MF
0.7
0.1
WD
WF
0.6
0.6
0.1
DRAM Cache(Tag + Data)
Line X
WHAT IS A MISS DETECTION?
19
L3$
Can we detect a miss w/o using BW?
(Miss Detection)
Missing Line X
Exist?
X
NEIGHBOR’S TAG COMES FREE WITH DEMAND
20
Address X
Tag+Data+Tag (8+64+8=80Bytes)
Demand Neighbor
DRAM Row Buffer 2KB
Neighboring Tag Cache(NTC)
TAD TAD
Neighboring Tag Cache saves Miss Detection
Baseline BAB BAB+DCP BEAR0
0.5
1
1.5
2
2.5
3
3.5
4
Bloat Factor BreakdownBl
oat F
acto
r
NTC SHOWS 2% PERFORMANCE IMPROVEMENT
21
NTC improves performance by additional 2%
BEAR1
1.05
1.1
1.15
Performance
BAB+DCP+NTC
BAB+DCP
BAB
Spee
dup
HIT 1.25
MD
MF
0.7
0.1
WD
WF
0.1
0.6
0.5
AGENDA
• Introduction
• Background
• BEAR
• Results
• Summary
22
Core Chips:• 8 cores 3.2 GHz• 2-wide OOO• 8MB 16-way L3
shared cache
METHODOLOGY
23
Stacked DRAM Off-chip DRAMCPU
DRAM Cache
Off-chip DRAM
Capacity 1GB 16GB
BusDDR3.2GHz,
128-bitDDR1.6GHz,
64-bit
Channel4 channels,16 banks/ch
2 channels 8 banks/ch
• Baseline: Alloy Cache [MICRO’12]
• SPEC2006 (16 memory intensive apps): 16 rate and 38 mix workloads
BEAR REDUCES BLOAT FACTOR BY 32%
24
ALL540
0.5
1
1.5
2
2.5
3
3.5
4
Bloat Factor
Baseline BEAR Ideal
Blo
at
Fa
cto
r
ALL541
1.1
1.2
1.3
Performance
BEAR Ideal
Sp
ee
du
p
BEAR improves performance by 11%
BW BLOAT IN TAGS-IN-SRAM DESIGNS
25
Tags-In-SRAM (TIS) Designs:(1) storage overhead (64MB) and (2) access latency
DRAM$
CPU
Tags in SRAM
Hit
MF
WF
Alloy BEAR TIS (64MB)0
0.5
1
1.5
2
2.5
3
3.5
4
Bloat Factor
Bloa
t Fac
tor
Tags-in-SRAM also has bandwidth bloat problem
64MB
TAGS-IN-SRAM PERFORMS SIMILAR TO BEAR
26
BEAR can be applied to reduce BW bloat in Tags-in-SRAM DRAM$ designs
Alloy BEAR TIS (64MB) SC1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Performance
Spee
dup
SUMMARY
• 3D DRAM as a cache mitigates the memory wall.
• In DRAM caches, secondary operations cause slow down to the critical data.
• We propose BEAR, which targets three sources of bandwidth bloat in DRAM cache.1. Bandwidth-Efficient Miss Fill2. Bandwidth-Efficient Writeback Detection3. Bandwidth-Efficient Miss Detection
• Overall, BEAR reduces the bandwidth bloat by 32%, and improves the performance by 11%
27
THANK YOU
Computer Architecture and Emerging Technologies Lab, Georgia Tech
29
Backup Slides
THE OVERHEAD OF BEAR IS NEGLIGIBLE SMALL
30
Design Cost Total
Bandwidth-Aware Bypass 8 bytes per thread 64 bytes
DRAM Cache Presence One bit per line in LLC 16K bytes
Neighboring Tag Register 44 bytes per bank 3.2K bytes
Total Cost 19.2K bytes
Overall, BEAR incurs HW overhead of 19.2KB
COMPARISON TO OTHER DRAM$ DESIGNS
31
Tags-In-DRAM Designs
RATE MIX ALL1
1.2
1.4
1.6
1.8
2
Performance
LH-cacheAlloyIncl-AlloyBEAR
Spee
dup
(w.r.
t No
L4)
BEAR outperforms other DRAM$ designs
28% 11%