BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES Chiachen Chou, Georgia Tech Aamer Jaleel,...

BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES

Chiachen Chou, Georgia Tech

Aamer Jaleel, NVIDIA*

Moinuddin K. Qureshi, Georgia Tech

ISCA 2015Portland, OR

June 15 , 2015

3D DRAM HELPS MITIGATE BANDWIDTH WALL

3D DRAM: Hybrid Memory Cube (HMC), High Bandwidth Memory (HBM)

courtesy: Micron, JEDEC, Intel, NVIDIA 2

Stacked DRAM provides 4-8X bandwidth, but has limited capacity

Intel Xeon Phi NVIDIA Pascal

Off-chip DRAM

3D DRAM IS USED AS A CACHE (DRAM CACHE)

3

DRAM Cache

DRAM$ stores tags in 3D DRAM for scalability

3D DRAM

Mem

ory

Hie

rarc

hy

fast

slow

CPU

L1$

L2$

L3$

CPU

L2$

L1$

1GB DRAM$ 16M Cache Lines 4B Tags

64MB Tag Storage

CAN DRAM CACHE PROVIDE 4X BANDWIDTH?

4

Hit (Good Use of BW)✔

✘ Miss Detection4X

Secondary Operations(Waste BW)

1X

✘ Miss Fill

✘ Writeback Detection

✘ Writeback Fill

DRAM Cache(Tags + Data)

Memory

CPUDATA

DATATAG

DRAM$ does not utilize full bandwidth

AGENDA

• Introduction

• Background– DRAM Cache Designs– Secondary Operations– Bloat Factor

• BEAR

• Results

• Summary

5

DRAM CACHE HAS NARROW BUS

6

DRAM$ accesses tag and data via a narrow bus

16-byte buses

CPU

DRAM Cache

[Qureshi and Loh MICRO’12]

2KB Row Buffer

8B Tag 64B Data

Alloy Cache

Useful Secondary

Hit (HIT) Miss Detection (MD), Miss Fill (MF), WB Detection (WD), WB Fill (WF)

CACHE REQUIRES MAINTENANCE OPERATIONS

7

L3$

DRAM Cache

Hit

Miss Fill

Memory

Miss WB Detection/Fill

DRAM$ bandwidth is used for secondary operations

Line X

Line X

Dirty Line Y

QUANTIFYING THE BANDWIDTH USAGE

8

MDMFHITHIT WDWF HIT

HIT HIT HIT

Transfer on Bus

Bloat Factor indicates the bandwidth inefficiency

Useful

Baseline Ideal0

0.51

1.52

2.53

3.54

Bloat Factor Breakdown

WB FillWB DetectionMiss FillMiss DetectionHit (Tag+Data)HitBl

oat F

acto

r

BLOAT FACTOR BREAKDOWN

9

Baseline has a Bloat Factor of 3.8

HIT 1.25

MD

MF

0.7

0.7

WD

WF

0.6

0.6

8-core, 8MB shared L3$, 1GB DRAM$, 16GB memorySPEC2006: 16 rate and 38 mix workloads

POTENTIAL PERFORMANCE OF 22%

10

Reducing Bloat Factor improves performance

Hit Latency0

50

100

150

200

250

300

Hit Latency

BaselineIdeal

Hit L

aten

c (c

ycle

)

Ideal1

1.1

1.2

1.3

Performance

Spee

dup

8-core, 8MB shared L3$, 1GB DRAM$, 16GB memorySPEC2006: 16 rate and 38 mix workloads

NOT ALL OPERATIONS ARE CREATED EQUAL

11

Opportunities to remove Secondary Operations

1. Operations to improve cache performance

2. Operations to ensure correctness

We propose BEAR to exploit these opportunities

DRAM Cache

DATA

Insert

Request

Exist?

AGENDA

• Introduction

• Background

• BEAR: Bandwidth-Efficient ARchitecture1. Bandwidth-Efficient Miss Fill

2. Bandwidth-Efficient Writeback Detection

3. Bandwidth-Efficient Miss Detection

• Results

• Summary

12

BANDWIDTH-EFFICIENT MISS FILL

13

How to enable bypass without hit rate degradation?

DRAM$

returns from memory

Throwaway

1-P

P

Insert

Line X

-15

-10

-5

0

5

Hit R

ate

Chan

ge (%

)

Insert

P=90%

mcf lbmso

plex milcmix1 mix2 mix3 mix4 mix6 mix7 AVG

05

1015202530

Hit L

aten

cy

Redu

ction

(%)

-10-505

1015

Perf

orm

ance

Impr

ovem

ent

12%

-5%

+10%

BAB LIMITS THE HIT RATE LOSS

14

DRAM$

Insert Set

Bypass Set

Hit Rate

Hit Rate

X - Y < Δ

no bypass

probabilistic bypass90% bypass

No Bypass

Probabilistic Bypass

+-

+

False

True

Use Probabilistic Bypass when hit rate loss is small

Bandwidth-Aware Bypass (BAB)

Baseline BAB0

0.5

1

1.5

2

2.5

3

3.5

4

Bloat Factor Breakdown

Bloa

t Fac

tor

BAB IMPROVE PERFORMANCE BY 5%

15

Hit Rate: Alloy 64%, BAB 62%

BAB trades off small hit rate for 5% improvement

BEAR1

1.05

1.1

Performance

BAB

Spee

dup

HIT 1.25

MD

MF

0.7

0.7

WD

WF

0.6

0.6

0.1

DRAM Cache Line Yold

WHAT IS A WRITEBACK DETECTION?

16

L3$

How can we remove Writeback Detection?

(WB Detection)

Dirty Line Ynew

Exist?

DRAM CACHE PRESENCE FOR WB DETECTION

17

L3$

DRAM Cache

Exist?

V D ?

OnlyWB Fill

WB Detection+

WB Fill

DRAM Cache Presence(DCP)

True False

DRAM Cache Presence reduces WB Detection

Line Yold

Dirty Line Ynew

Baseline BAB BAB+DCP0

0.5

1

1.5

2

2.5

3

3.5

4

Bloat Factor BreakdownBl

oat F

acto

r

DCP IMPROVES PERFORMANCE BY 4%

18

DCP provides 4% improvement in addition to BAB

BEAR1

1.05

1.1

Performance

BAB+DCP

BABSpee

dup

HIT 1.25

MD

MF

0.7

0.1

WD

WF

0.6

0.6

0.1

DRAM Cache(Tag + Data)

Line X

WHAT IS A MISS DETECTION?

19

L3$

Can we detect a miss w/o using BW?

(Miss Detection)

Missing Line X

Exist?

X

NEIGHBOR’S TAG COMES FREE WITH DEMAND

20

Address X

Tag+Data+Tag (8+64+8=80Bytes)

Demand Neighbor

DRAM Row Buffer 2KB

Neighboring Tag Cache(NTC)

TAD TAD

Neighboring Tag Cache saves Miss Detection

Baseline BAB BAB+DCP BEAR0

0.5

1

1.5

2

2.5

3

3.5

4

Bloat Factor BreakdownBl

oat F

acto

r

NTC SHOWS 2% PERFORMANCE IMPROVEMENT

21

NTC improves performance by additional 2%

BEAR1

1.05

1.1

1.15

Performance

BAB+DCP+NTC

BAB+DCP

BAB

Spee

dup

HIT 1.25

MD

MF

0.7

0.1

WD

WF

0.1

0.6

0.5

AGENDA

• Introduction

• Background

• BEAR

• Results

• Summary

22

Core Chips:• 8 cores 3.2 GHz• 2-wide OOO• 8MB 16-way L3

shared cache

METHODOLOGY

23

Stacked DRAM Off-chip DRAMCPU

DRAM Cache

Off-chip DRAM

Capacity 1GB 16GB

BusDDR3.2GHz,

128-bitDDR1.6GHz,

64-bit

Channel4 channels,16 banks/ch

2 channels 8 banks/ch

• Baseline: Alloy Cache [MICRO’12]

• SPEC2006 (16 memory intensive apps): 16 rate and 38 mix workloads

BEAR REDUCES BLOAT FACTOR BY 32%

24

ALL540

0.5

1

1.5

2

2.5

3

3.5

4

Bloat Factor

Baseline BEAR Ideal

Blo

at

Fa

cto

r

ALL541

1.1

1.2

1.3

Performance

BEAR Ideal

Sp

ee

du

p

BEAR improves performance by 11%

BW BLOAT IN TAGS-IN-SRAM DESIGNS

25

Tags-In-SRAM (TIS) Designs:(1) storage overhead (64MB) and (2) access latency

DRAM$

CPU

Tags in SRAM

Hit

MF

WF

Alloy BEAR TIS (64MB)0

0.5

1

1.5

2

2.5

3

3.5

4

Bloat Factor

Bloa

t Fac

tor

Tags-in-SRAM also has bandwidth bloat problem

64MB

TAGS-IN-SRAM PERFORMS SIMILAR TO BEAR

26

BEAR can be applied to reduce BW bloat in Tags-in-SRAM DRAM$ designs

Alloy BEAR TIS (64MB) SC1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Performance

Spee

dup

SUMMARY

• 3D DRAM as a cache mitigates the memory wall.

• In DRAM caches, secondary operations cause slow down to the critical data.

• We propose BEAR, which targets three sources of bandwidth bloat in DRAM cache.1. Bandwidth-Efficient Miss Fill2. Bandwidth-Efficient Writeback Detection3. Bandwidth-Efficient Miss Detection

• Overall, BEAR reduces the bandwidth bloat by 32%, and improves the performance by 11%

27

THANK YOU

Computer Architecture and Emerging Technologies Lab, Georgia Tech

29

Backup Slides

THE OVERHEAD OF BEAR IS NEGLIGIBLE SMALL

30

Design Cost Total

Bandwidth-Aware Bypass 8 bytes per thread 64 bytes

DRAM Cache Presence One bit per line in LLC 16K bytes

Neighboring Tag Register 44 bytes per bank 3.2K bytes

Total Cost 19.2K bytes

Overall, BEAR incurs HW overhead of 19.2KB

COMPARISON TO OTHER DRAM$ DESIGNS

31

Tags-In-DRAM Designs

RATE MIX ALL1

1.2

1.4

1.6

1.8

2

Performance

LH-cacheAlloyIncl-AlloyBEAR

Spee

dup

(w.r.

t No

L4)

BEAR outperforms other DRAM$ designs

28% 11%

BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES Chiachen Chou, Georgia Tech Aamer Jaleel,...

Documents

Transcript of BEAR: MITIGATING BANDWIDTH BLOAT IN GIGASCALE DRAM CACHES Chiachen Chou, Georgia Tech Aamer Jaleel,...