THE DYNAMIC GRANULARITY MEMORY SYSTEM

37
THE DYNAMIC GRANULARITY MEMORY SYSTEM Doe Hyun Yoon IIL, HP Labs Michael Sullivan Min Kyu Jeong Mattan Erez ECE, UT Austin

Transcript of THE DYNAMIC GRANULARITY MEMORY SYSTEM

Page 1: THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Doe Hyun Yoon

IIL, HP Labs

Michael Sullivan

Min Kyu Jeong

Mattan Erez

ECE, UT Austin

Page 2: THE DYNAMIC GRANULARITY MEMORY SYSTEM

MEMORY ACCESS GRANULARITY • The size of block for accessing main memory

– Often, equal to last-level cache line size

• Modern systems use coarse-grained (CG)

memory access

– 64B or larger

– Amortize control & ECC overhead

– Prefetching

2

Page 3: THE DYNAMIC GRANULARITY MEMORY SYSTEM

CG ACCESS MAY WASTE BW

• Waste BW on unused data

3

for( i=0; i<N; i++ ) {

a[ b[i] ] += x;

}

GUPS microbenchmark Buffer a

Initialized with random

numbers

Page 4: THE DYNAMIC GRANULARITY MEMORY SYSTEM

CAN WE WASTE BW? • CG access often improves performance

– Large cache lines reduce miss rate

due to prefetching

• Off-chip BW doesn’t scale with # cores

• Power is the limiting factor

• We shouldn’t waste the finite off-chip BW

4

Page 5: THE DYNAMIC GRANULARITY MEMORY SYSTEM

HOW TO EFFICIENTLY UTILIZE OFF-CHIP BW?

• Prior work: AGMS [ISCA’11]

– Combine CG and FG accesses

– Need SW help for ECC support

• Source code, compiler, OS, virtual memory, …

• DGMS

– HW-only variant of AGMS

– Truly dynamic granularity adaptation

5

Page 6: THE DYNAMIC GRANULARITY MEMORY SYSTEM

ADAPTIVE GRANULARITY

MEMORY SYSTEM [ISCA’11]

6

Page 7: THE DYNAMIC GRANULARITY MEMORY SYSTEM

AGMS [ISCA’11] • Combine coarse-grained (CG) and

fine-grained (FG) accesses

• CG for high spatial locality regions

• FG for low spatial locality regions

• Higher throughput

• Lower DRAM power

7

Page 8: THE DYNAMIC GRANULARITY MEMORY SYSTEM

SUB-RANKED DRAM MODULE • Independently control individual DRAM chips

• Access granularity = 8bit x burst 8 = 8B

8

DBUS

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

Page 9: THE DYNAMIC GRANULARITY MEMORY SYSTEM

9

Burst 8

B0

B8

B16

B24

B32

B40

B48

B56

B1

B9

B17

B25

B33

B41

B49

B57

B2

B10

B18

B26

B34

B42

B50

B58

B3

B11

B19

B27

B35

B43

B51

B59

B4

B12

B20

B28

B36

B44

B52

B60

B5

B13

B21

B29

B37

B45

B53

B61

B6

B14

B22

B30

B38

B46

B54

B62

B7

B15

B23

B31

B39

B47

B55

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

64-bit data + 8-bit ECC (SEC-DED)

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

Page 10: THE DYNAMIC GRANULARITY MEMORY SYSTEM

10

B0

B1

B2

B3

B4

B5

B6

B7

E0

E1

E2

E3

E4

E5

E6

E7

B8

B9

B10

B11

B12

B13

B14

B15

E8

E9

E10

E11

E12

E13

E14

E15

B16

B17

B18

B19

B20

B21

B22

B23

E16

E17

E18

E19

E20

E21

E22

E23

B24

B25

B26

B27

B28

B29

B30

B31

E24

E25

E26

E27

E28

E29

E30

E31

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

8-bit data + 5-bit SEC-DED or 8-bit DEC

Burst 8

Page 11: THE DYNAMIC GRANULARITY MEMORY SYSTEM

SOFTWARE SUPPORT IN AGMS • Different data/ECC layouts for CG & FG

• Requires software help

– Extend virtual memory interface

– OS/runtime manages CG&FG pages

– Programmer/compiler annotates preferred granularity

• Need to change every level of system hierarchy!

11

Page 12: THE DYNAMIC GRANULARITY MEMORY SYSTEM

DYNAMIC GRANULARITY

MEMORY SYSTEM

12

Page 13: THE DYNAMIC GRANULARITY MEMORY SYSTEM

DGMS • Unified data/ECC layout for CG & FG

– No SW support

• HW-only variant of AGMS

– Comparable or better performance

– Easier to implement

• Challenge:

– How to predict access granularity dynamically? 13

Page 14: THE DYNAMIC GRANULARITY MEMORY SYSTEM

UNIFIED DATA/ECC LAYOUT

14

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

64-bit data 8-bit ECC (SEC-DED)

Page 15: THE DYNAMIC GRANULARITY MEMORY SYSTEM

CG ACCESS • Access the whole 72B

15

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

Page 16: THE DYNAMIC GRANULARITY MEMORY SYSTEM

FG ACCESS • Access 8B data and 8B ECC

16

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

Page 17: THE DYNAMIC GRANULARITY MEMORY SYSTEM

AVOIDING CONTENTION ON ECC DRAM

17

ECC 8 B

SR 0 SR 1 SR 2 SR 3 SR 4 SR 5 SR 6 SR 7 SR 8

8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

Page 18: THE DYNAMIC GRANULARITY MEMORY SYSTEM

DGMS DESIGN

18

Last Level Cache

Memory Controller

Core 0

$I $D

L2

Core 1

$I $D

L2

Core N-1

$I $D

L2

Sector

cache

Sub-ranked memory

w/ unified data/ECC layout DRAM

Page 19: THE DYNAMIC GRANULARITY MEMORY SYSTEM

GRANULARITY PREDICTION

19

Last Level Cache

Memory Controller

Core 0

$I $D

L2

SPP

Core 1

$I $D

L2

SPP

Core N-1

$I $D

L2

SPP

Spatial

Pattern

Predictor

Sub-ranked Memory

[Chen; HPCA’04]

Which words within a cache line will be used?

Page 20: THE DYNAMIC GRANULARITY MEMORY SYSTEM

SPATIAL PATTERN PREDICTOR [CHEN; HPCA’04]

20

Tag Status Data

L1 Data Cache

. . . . . .

. . .

00101101 01001011 00001000 10000000

00110000 11010001

Used Idx

CPT

. . . . . .

Idx Status 01000000

Pattern

00001000 00001110 01001100

11010001 01110000

. . . . . .

. . .

PHT

Load/Store

Evicted or

Subsector miss

Update CPT

Request To L2

PHT hit

Default

PC DA +

F

T

Tag Status Data

L1 Data Cache

. . . . . .

. . .

Load/Store

Page 21: THE DYNAMIC GRANULARITY MEMORY SYSTEM

SPP ACCURACY

21

0.0

0.2

0.4

0.6

0.8

1.0

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN streamcluster stream

Not predicted, but Referenced Predicted & Referenced

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream

Page 22: THE DYNAMIC GRANULARITY MEMORY SYSTEM

SPP LIMITATIONS • Case 1)

– Application accesses 5~7 words per cache line

• Case 2)

– App1 has low spatial locality, MPKI is 1

– App2 has high spatial locality, MPKI is 20

• Minimizing traffic doesn’t always

improve performance 22

Page 23: THE DYNAMIC GRANULARITY MEMORY SYSTEM

23

PREDICTION CONTROLLER

Last Level Cache

Memory Controller

Core 0

$I $D

L2

SPP

GPC

Sub-ranked Memory

Core 1

$I $D

L2

SPP

Core N-1

$I $D

L2

SPP

Ignore SPP

if AvgRefWord > 3.75, or

if row-buffer hirate > 0.8

Treat all requests are CG

if CG requests are dominant

(more than 80% of MC queue)

LPC & GPC prevent performance degradation in some CG-friendly apps

LPC LPC LPC

Page 24: THE DYNAMIC GRANULARITY MEMORY SYSTEM

EVALUATION

24

Page 25: THE DYNAMIC GRANULARITY MEMORY SYSTEM

EVALUATION • Zesto simulator

– 8 out-of-order x86 cores

– Private caches: 32kB I/D L1, 256kB unified L2

– Shared last-level cache: 8MB

• DrSim: detailed DDR3 DRAM model

• Memory intensive multiprogrammed workloads

25

Page 26: THE DYNAMIC GRANULARITY MEMORY SYSTEM

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

26

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG

Page 27: THE DYNAMIC GRANULARITY MEMORY SYSTEM

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

27

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS

Page 28: THE DYNAMIC GRANULARITY MEMORY SYSTEM

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

28

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

Page 29: THE DYNAMIC GRANULARITY MEMORY SYSTEM

LOW SPATIAL LOCALITY APPS

29

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp

We

igh

ted

Sp

ee

du

p

We

igh

ted

Sp

ee

du

p

CG

AGMS

DGMS

Page 30: THE DYNAMIC GRANULARITY MEMORY SYSTEM

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

30

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

Page 31: THE DYNAMIC GRANULARITY MEMORY SYSTEM

HIGH SPATIAL LOCALITY APPS

31 mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3

We

igh

ted

Sp

ee

du

p

6

5

4

3

2

1

0

CG AGMS DGMS

Page 32: THE DYNAMIC GRANULARITY MEMORY SYSTEM

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

32

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

Page 33: THE DYNAMIC GRANULARITY MEMORY SYSTEM

MIXED CASES

33 cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

4

3

2

1

0

6 MIX1: SSCA2 x2, mst x2, em3d x2, canneal x2 MIX2: SSCA2 x2, canneal x2, mcf x2, OCEAN x2 MIX3: canneal x2, mcf x2, bzip2 x2, hmmer x2

MIX4: mcf x4, omnetpp x4 MIX5: SSCA2 x2, canneal x2, mcf x2, streamcluster x2

CG

AGMS

DGMS

Page 34: THE DYNAMIC GRANULARITY MEMORY SYSTEM

POWER EFFICIENCY

34

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

3.2 2.8

No

rma

lize

d T

hro

ug

hp

ut/

Po

we

r

CG AGMS DGMS

Page 35: THE DYNAMIC GRANULARITY MEMORY SYSTEM

35

0

1

2

3

4

5

6

7

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

SYSTEM THROUGHPUT (NO ECC)

Page 36: THE DYNAMIC GRANULARITY MEMORY SYSTEM

CONCLUSIONS • Dynamic Granularity Memory System

– HW-only variant of AGMS

– Truly dynamic granularity adaptation

– Higher performance [31% vs. CG]

– Lower DRAM power [13% vs. CG]

• More in the paper – Reg/demux and address/command bus bandwidth

– LPC&GPC details

– DGMS with chipkill-correct support 36

Page 37: THE DYNAMIC GRANULARITY MEMORY SYSTEM

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Doe Hyun Yoon

IIL, HP Labs

Michael Sullivan

Min Kyu Jeong

Mattan Erez

ECE, UT Austin