Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan...
-
Upload
erica-king -
Category
Documents
-
view
26 -
download
2
description
Transcript of Brian Rogers †‡ , Anil Krishna †‡ , Gordon Bell ‡ , Ken Vu ‡ , Xiaowei Jiang † , Yan...
Scaling the Bandwidth Wall:Challenges in and Avenues for
CMP Scalability
36th International Symposium on Computer Architecture
Brian Rogers†‡, Anil Krishna†‡, Gordon Bell‡,
Ken Vu‡, Xiaowei Jiang†, Yan Solihin†
NC STATE UNIVERSITY
† ‡
2 Scaling the Bandwidth Wall -- ISCA 2009
As Process Technology Scales …
P
P P
P
$ $ $
DRAM
P
DRAM
PP
PP
PP
PP
PP
PP
PP
P
$
$
$
$
$
$ $
$ $
$$
$
3 Scaling the Bandwidth Wall -- ISCA 2009
Problem Core growth >> Memory bandwidth growth
Cores: ~ exponential growth (driven by Moore’s Law) Bandwidth: ~ much slower growth (pin and power limitations)
At each relative technology generation (T): (# Cores = 2T) >> (Bandwidth = BT)
Some key questions (Our contributions): How constraining is increasing gap between # of cores and
available memory bandwidth? How should future CMPs be designed; how should we
allocate transistors to caches and cores? What techniques can best reduce memory traffic demand?
Build Analytical CMP Memory Bandwidth Model
4 Scaling the Bandwidth Wall -- ISCA 2009
Agenda Background / Motivation Assumptions / Scope CMP Memory Traffic Model Alternate Views of Model Memory Traffic Reduction Techniques
Indirect Direct Dual
Conclusions
5 Scaling the Bandwidth Wall -- ISCA 2009
Assumptions / Scope
Homogenous cores Single-threaded cores (multi-threading adds to problem) Co-scheduled sequential applications
Multi-threaded apps with data sharing evaluated separately
Enough work to keep all cores busy Workloads static across technology generations Equal amount of cache per core Power/Energy constraints outside scope of this study
6 Scaling the Bandwidth Wall -- ISCA 2009
Agenda Background / Motivation Assumptions / Scope CMP Memory Traffic Model Alternate Views of Model Memory Traffic Reduction Techniques
Indirect Direct Dual
Conclusions
7 Scaling the Bandwidth Wall -- ISCA 2009
SPEC 2006 Average
0.01
0.10
1.00
1 10 100 1000# of Cache Ways (8KB/way)
Mis
s R
ate
average
Pow er (average)
Commercial Average
0.001
0.010
0.100
1.000
1 10 100 1000
# of Cache Ways (8KB/way)
Mis
s R
ate
average
Pow er (average)
Cache Miss Rate vs. Cache Size Relationship follows the Power Law, Hartstein et al. (√2 Rule)
M = M0 * R-αR = New cache size / Old cache size
α = Sensitivity of workload to cache size change
8 Scaling the Bandwidth Wall -- ISCA 2009
CMP Traffic Model Express chip area in terms of Core Equivalent Areas
(CEAs) Core = 1 CEA, Unit_of_Cache = 1 CEA P = # cores, C = # cache CEAs, N = P+C, S = C/P
Assume that non-core and non-cache components require constant fraction of area
Add # of cores term for CMP model:
α
SSMPM
00
9 Scaling the Bandwidth Wall -- ISCA 2009
CMP Traffic Model (2)
Going from CMP1=<P1,C1> to CMP2=<P2,C2>
Remove common terms, express M2 in terms of M1
0
101
0
202
1
2
SSMP
SSMP
M
M
11
2
1
22 M
S
S
P
PM
11
2
1
22 M
S
S
P
PM
P = # cores, C = # cache CEAS N = P+C, S = C/P
10 Scaling the Bandwidth Wall -- ISCA 2009
One Generation of Scaling Baseline Processor: 8 cores, 8 cache CEAs
N1=16, P1=8, C1=8, S1=1, and ~ fully utilized BW α = 0.5
How many cores possible if 32 CEAS now available? Ideal Scaling = 2X # of cores at each successive technology generation
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 91
01
11
21
31
41
51
61
71
81
92
02
12
22
32
42
52
62
72
82
93
03
13
2
Number of Cores
No
rma
lize
d T
raff
ic (
8 C
ore
, 8 C
ac
he
)
Base Traffic (8_8)New Chip Traffic
Ideal Scaling
BW Limited Scaling
11 Scaling the Bandwidth Wall -- ISCA 2009
Agenda Background / Motivation Assumptions / Scope CMP Memory Traffic Model Alternate Views of Model Memory Traffic Reduction Techniques
Indirect Direct Dual
Conclusions
12 Scaling the Bandwidth Wall -- ISCA 2009
CMP Design Constraint If available off-chip BW grows by factor of B:
Total memory traffic should grow by at most a factor of B each generation
Write S2 in terms of P2 and N2:
New technology: N2 CEAs, B bandwidth => solve for P2 numerically
12 MBM
1
2
1
2
S
S
P
PB
1
222
1
2
S
PPN
P
PB
P2 is # of cores that can be supported
P = # cores, C = # cache CEAS N = P+C, S = C/P
13 Scaling the Bandwidth Wall -- ISCA 2009
Scaling Under Area Constraints
With an increasing # of CEAs available, how many cores can be supported at constant BW requirement
0
10
20
30
40
50
60
1x 2x 4x 8x 16x 32x 64x 128x
Scaling Ratio
Nu
mb
er
of
Co
res
0%
10%
20%
30%
40%
50%
60%
Po
rtio
n o
f C
hip
Are
a
# of Cores% of Chip Area for Cores • 2x die area: 1.4x cores
• 4x die area: 1.9x cores
• 8x die area: 2.4x cores
• 16x die area: 3.2x cores
• …
14 Scaling the Bandwidth Wall -- ISCA 2009
Agenda Background / Motivation Assumptions / Scope CMP Memory Traffic Model Alternate Views of Model Memory Traffic Reduction Techniques
Indirect Direct Dual
Conclusions
15 Scaling the Bandwidth Wall -- ISCA 2009
Categories of Techniques
Indirect
• Cache Compression• DRAM Caches• 3D-stacked Cache• Unused Data Filter• Smaller Cores
Direct
• Link Compression• Sectored Caches
Dual
• Cache+Link Compress• Small Cache Lines• Data Sharing
11
2
1
22
1M
RS
S
P
PM
16 Scaling the Bandwidth Wall -- ISCA 2009
Indirect – DRAM Cache
F – Influenced by Increased Density
0
5
10
15
20
25
SRAM L2 DRAM L2 (4x) DRAM L2 (8x) DRAM L2 (16x)
L2 Cache Configuration
Nu
mb
er
of
CM
P C
ore
s Pessimistic
Realistic
Optimistic
Ideal Scaling
11
2
1
22 M
S
SF
P
PM
17 Scaling the Bandwidth Wall -- ISCA 2009
Direct – Link Compression
R – Influenced by Compression Ratio11
2
1
22
1M
RS
S
P
PM
0
5
10
15
20
25
NoCompress
1.25x 1.50x 1.75x 2.0x 2.5x 3.0x 3.5x 4.0x
Compression Effectiveness
Nu
mb
er
of
CM
P C
ore
s
Pessimistic
Realistic
Optimistic
Ideal Scaling
18 Scaling the Bandwidth Wall -- ISCA 2009
Dual – Small Cache Lines
F,R – Influenced by % Unused Data1
1
2
1
22
1M
RS
SF
P
PM
0
5
10
15
20
25
30
0% 10%
20%
40%
80%
Average Amount of Unused Data
Nu
mb
er
of
CM
P C
ore
s
Pessimistic
Realistic
Optimistic
Ideal Scaling
19 Scaling the Bandwidth Wall -- ISCA 2009
Dual – Data Sharing
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
500%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of Shared Data
No
rma
lize
d T
raff
ic
16 Cores
32 Cores
64 Cores
128 Cores
Please see paper for details on modeling of sharing Data sharing unlikely to provide a scalable solution
20 Scaling the Bandwidth Wall -- ISCA 2009
Summary of Individual Techniques
0
20
40
60
80
100
120
140
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
32 64 128
256
IDEAL BASE CC DRAM 3D Fltr SmCo LC Sect SmCl CC/LC
Nu
mb
er o
f S
up
por
tab
le C
ores
Indirect Direct Dual
21 Scaling the Bandwidth Wall -- ISCA 2009
Summary of Combined Techniques
0
50
100
150
200
250
300
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
2x
4x
8x
16x
IDEAL BASE CC +DRAM+ 3D
CC/LC+
DRAM
CC +3D +Fltr
CC/LC+ Fltr
DRAM+ 3D +
LC
DRAM+ Fltr +
LC
DRAM+ LC +Sect
3D +Fltr +LC
SmCl+ LC
CC/LC+ SmCl
DRAM+ 3D +SmCl
CC/LC+
DRAM+ SmCl
CC/LC+ 3D +SmCl
CC/LC+
DRAM+ 3D
CC/LC+
DRAM+ 3D +SmCl
Nu
mb
er
of
Su
pp
ort
ab
le C
ore
s
22 Scaling the Bandwidth Wall -- ISCA 2009
Conclusions Contributions
Simple, powerful analytical CMP memory traffic model Quantify significance of memory BW wall problem
10% chip area for cores in 4 generations if constant traffic req. Guide design (cores vs. cache) of future CMPs
Given fixed chip area and BW scaling, how many cores? Evaluate memory traffic reduction techniques
Combinations can enable ideal scaling for several generations
Need bandwidth-efficient computing: Hardware/Architecture level: DRAM caches, cache/link
compression, prefetching, smarter memory controllers, etc. Technology level: 3D chips, optical interconnects, etc. Application level: working set reduction, locality
enhancement, data vs. pipelined parallelism, computation vs. communication, etc.