Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan...

Scaling the Bandwidth Wall:Challenges in and Avenues for

CMP Scalability

36th International Symposium on Computer Architecture

Brian Rogers†‡, Anil Krishna†‡, Gordon Bell‡,

Ken Vu‡, Xiaowei Jiang†, Yan Solihin†

NC STATE UNIVERSITY

† ‡

2 Scaling the Bandwidth Wall -- ISCA 2009

As Process Technology Scales …

P

P P

P

$ $ $

DRAM

P

DRAM

PP

PP

PP

PP

PP

PP

PP

P

$

$

$

$

$

$ $

$ $

$$

$


Problem Core growth >> Memory bandwidth growth

Cores: ~ exponential growth (driven by Moore’s Law) Bandwidth: ~ much slower growth (pin and power limitations)

At each relative technology generation (T): (# Cores = 2T) >> (Bandwidth = BT)

Some key questions (Our contributions): How constraining is increasing gap between # of cores and

available memory bandwidth? How should future CMPs be designed; how should we

allocate transistors to caches and cores? What techniques can best reduce memory traffic demand?

Build Analytical CMP Memory Bandwidth Model


Agenda Background / Motivation Assumptions / Scope CMP Memory Traffic Model Alternate Views of Model Memory Traffic Reduction Techniques

Indirect Direct Dual

Conclusions


Assumptions / Scope

Homogenous cores Single-threaded cores (multi-threading adds to problem) Co-scheduled sequential applications

Multi-threaded apps with data sharing evaluated separately

Enough work to keep all cores busy Workloads static across technology generations Equal amount of cache per core Power/Energy constraints outside scope of this study




Conclusions


SPEC 2006 Average

0.01

0.10

1.00

1 10 100 1000# of Cache Ways (8KB/way)

Mis

s R

ate

average

Pow er (average)

Commercial Average

0.001

0.010

0.100

1.000

1 10 100 1000

# of Cache Ways (8KB/way)

Mis

s R

ate

average

Pow er (average)

Cache Miss Rate vs. Cache Size Relationship follows the Power Law, Hartstein et al. (√2 Rule)

M = M0 * R-αR = New cache size / Old cache size

α = Sensitivity of workload to cache size change


CMP Traffic Model Express chip area in terms of Core Equivalent Areas

(CEAs) Core = 1 CEA, Unit_of_Cache = 1 CEA P = # cores, C = # cache CEAs, N = P+C, S = C/P

Assume that non-core and non-cache components require constant fraction of area

Add # of cores term for CMP model:

α

SSMPM

00


CMP Traffic Model (2)

Going from CMP1=<P1,C1> to CMP2=<P2,C2>

Remove common terms, express M2 in terms of M1

0

101

0

202

1

2

SSMP

SSMP

M

M

11

2

1

22 M

S

S

P

PM

11

2

1

22 M

S

S

P

PM

P = # cores, C = # cache CEAS N = P+C, S = C/P


One Generation of Scaling Baseline Processor: 8 cores, 8 cache CEAs

N1=16, P1=8, C1=8, S1=1, and ~ fully utilized BW α = 0.5

How many cores possible if 32 CEAS now available? Ideal Scaling = 2X # of cores at each successive technology generation

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 91

01

11

21

31

41

51

61

71

81

92

02

12

22

32

42

52

62

72

82

93

03

13

2

Number of Cores

No

rma

lize

d T

raff

ic (

8 C

ore

, 8 C

ac

he

)

Base Traffic (8_8)New Chip Traffic

Ideal Scaling

BW Limited Scaling




Conclusions


CMP Design Constraint If available off-chip BW grows by factor of B:

Total memory traffic should grow by at most a factor of B each generation

Write S2 in terms of P2 and N2:

New technology: N2 CEAs, B bandwidth => solve for P2 numerically

12 MBM

1

2

1

2

S

S

P

PB

1

222

1

2

S

PPN

P

PB

P2 is # of cores that can be supported

P = # cores, C = # cache CEAS N = P+C, S = C/P


Scaling Under Area Constraints

With an increasing # of CEAs available, how many cores can be supported at constant BW requirement

0

10

20

30

40

50

60

1x 2x 4x 8x 16x 32x 64x 128x

Scaling Ratio

Nu

mb

er

of

Co

res

0%

10%

20%

30%

40%

50%

60%

Po

rtio

n o

f C

hip

Are

a

# of Cores% of Chip Area for Cores • 2x die area: 1.4x cores

• 4x die area: 1.9x cores



• …




Conclusions


Categories of Techniques

Indirect

• Cache Compression• DRAM Caches• 3D-stacked Cache• Unused Data Filter• Smaller Cores

Direct

• Link Compression• Sectored Caches

Dual

• Cache+Link Compress• Small Cache Lines• Data Sharing

11

2

1

22

1M

RS

S

P

PM


Indirect – DRAM Cache

F – Influenced by Increased Density

0

5

10

15

20

25

SRAM L2 DRAM L2 (4x) DRAM L2 (8x) DRAM L2 (16x)

L2 Cache Configuration

Nu

mb

er

of

CM

P C

ore

s Pessimistic

Realistic

Optimistic

Ideal Scaling

11

2

1

22 M

S

SF

P

PM


Direct – Link Compression

R – Influenced by Compression Ratio11

2

1

22

1M

RS

S

P

PM

0

5

10

15

20

25

NoCompress

1.25x 1.50x 1.75x 2.0x 2.5x 3.0x 3.5x 4.0x

Compression Effectiveness

Nu

mb

er

of

CM

P C

ore

s

Pessimistic

Realistic

Optimistic

Ideal Scaling


Dual – Small Cache Lines

F,R – Influenced by % Unused Data1

1

2

1

22

1M

RS

SF

P

PM

0

5

10

15

20

25

30

0% 10%

20%

40%

80%

Average Amount of Unused Data

Nu

mb

er

of

CM

P C

ore

s

Pessimistic

Realistic

Optimistic

Ideal Scaling


Dual – Data Sharing

0%

50%

100%

150%

200%

250%

300%

350%

400%

450%

500%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of Shared Data

No

rma

lize

d T

raff

ic

16 Cores

32 Cores

64 Cores

128 Cores

Please see paper for details on modeling of sharing Data sharing unlikely to provide a scalable solution


Summary of Individual Techniques

0

20

40

60

80

100

120

140

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

32 64 128

256

IDEAL BASE CC DRAM 3D Fltr SmCo LC Sect SmCl CC/LC

Nu

mb

er o

f S

up

por

tab

le C

ores



Summary of Combined Techniques

0

50

100

150

200

250

300

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

2x

4x

8x

16x

IDEAL BASE CC +DRAM+ 3D

CC/LC+

DRAM

CC +3D +Fltr

CC/LC+ Fltr

DRAM+ 3D +

LC

DRAM+ Fltr +

LC

DRAM+ LC +Sect

3D +Fltr +LC

SmCl+ LC

CC/LC+ SmCl

DRAM+ 3D +SmCl

CC/LC+

DRAM+ SmCl

CC/LC+ 3D +SmCl

CC/LC+

DRAM+ 3D

CC/LC+

DRAM+ 3D +SmCl

Nu

mb

er

of

Su

pp

ort

ab

le C

ore

s


Conclusions Contributions

Simple, powerful analytical CMP memory traffic model Quantify significance of memory BW wall problem

10% chip area for cores in 4 generations if constant traffic req. Guide design (cores vs. cache) of future CMPs

Given fixed chip area and BW scaling, how many cores? Evaluate memory traffic reduction techniques

Combinations can enable ideal scaling for several generations

Need bandwidth-efficient computing: Hardware/Architecture level: DRAM caches, cache/link

compression, prefetching, smarter memory controllers, etc. Technology level: 3D chips, optical interconnects, etc. Application level: working set reduction, locality

enhancement, data vs. pipelined parallelism, computation vs. communication, etc.


Questions ?

Thank You

Brian [email protected]

Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan...

Documents

Transcript of Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan...