11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix...
-
Upload
gervais-mcdowell -
Category
Documents
-
view
214 -
download
0
Transcript of 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix...
11
1
1University of Michigan
Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies
Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski,
David Blaauw, and Trevor Mudge
University of Michigan, Ann Arbor
HPCA 19
February 27, 2013
22
2
2University of Michigan
Crossbar or Ring
Mesh
2000 2002 2005 2008 20110
20
40
60
80
100
120
Year
# co
res
on
a c
hip
Many-Core Trend Thousand-core chips are in our future A scalable on-chip interconnect is required
TILE Gx100
Intel SCC
TILE64
33
3
3University of Michigan
Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs
Super-Star Super-StarX
Results Conclusion
44
4
4University of Michigan
Mesh Topology Popular in tiled-based many-core processors Low complexity Planar 2D layout properties
Tilera’s TILE64 64-core processorCan Mesh topology scale to 100s of cores?
55
5
5University of Michigan
R R R
RR R
R R R
RR R
R
R
R
High-Radix Topologies Alternative to low-radix topologies Concentration
R
6 tile
6 til
e
TileR
R R
RR R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
R R R R R R
Fewer hops improve latency, but links become bottlenecks
66
6
6University of Michigan
High-Radix Topologies Improve throughput
Additional Connectivity Parallel links Express links
High-RadixRouter
R RR R
R RR R
R RR R
R RR R
R RR R
R RR R
R RR R
R RR R
77
7
7University of Michigan
Traditional Matrix-Style Crossbar Separate crossbar & arbiter Not scalable as radix increases:
Routing to/from arbiter becomes more challenging
Arbitration logic grows more complex
Swizzle-Switch* Combines routing-dominated arbiter
with logic-dominated crossbar SRAM-like technology Scales to radix-64 in 32nm @ 1.5GHz
High-Radix Switch: Swizzle-Switch
*VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012
88
8
8University of Michigan
High-Radix Topologies
Low-Radix Router
ConventionalRouter Delay
Swizzle-SwitchRouter Delay
Hop Count
Local Communication
Global Communication
Global Communication
Local Communication
High-Radix Router
Del
ay
Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication
99
9
9University of Michigan
Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs
Super-Star Super-StarX
Results Conclusion
1010
10
10University of Michigan
Asymmetric High-Radix Topologies
LR LR LR
LR LR LR
LR LR LR
LR
LR LR
GRLR = Local RouterGR = Global Router
Asymmetric High-Radix merge best features of both
low-radix and high-radix topologies
Low-Radix Topologies optimize local communication
High-Radix Topologies optimize global communication
Fast, Low-Radix
Slow, High-Radix
1111
11
11University of Michigan
Asymmetric High-Radix Topologies Decouple local and global communication
Match router speed to wire speed
Local communication Short wires Fast Low-Radix
Global communication Long wires Slow High-Radix Routers Reduce Hop count
1212
12
12University of Michigan
Super-Star Each local router connects a cluster of tiles Each global router connects to all local routers
LR LR LR
LR LR LR
LR LR LR
LR
LR LR
GR
1313
13
13University of Michigan
Super-StarX Inter-cluster links further reduce local communication latency Locality-aware routing policy
LR LR LR
LR LR LR
LR LR LR
LR
LR LR
Inter-Cluster Links
GR
Low Load: Inter-Cluster Links
High Load: Inter-Cluster Links + Global Router
1414
14
14University of Michigan
Super-StarX Multiple global routers
Higher throughput, energy proportionality
LR LR LR
LR LR LR
LR LR LR
LR
LR LR
GRGRGRGR
1515
15
15University of Michigan
Super-StarX Layout18mm
14.4mm
21.6mm
25.2mm
10.8mm
7.2mm3.6mm
3.6m
m
3.6mm
21.6mm
21.6mm
Inter-Cluster Links
4 tile
4 ti
le
LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
GR
GR GR
GR
576 tilesin total
1616
16
16University of Michigan
Evaluation 576 tiles Synthetic uniform random traffic, 4-flit messages 128-bit Swizzle-Switch in 15nm 4 VCs/port, buffer depth 5 flits/VC Power & delay from SPICE modeling in 32nm, scaled to 15nm
Router Information & Link Dimensions
Topology # Routers Radix Network Area Avg. Link Length (mm)
Local Global Local Global Local Global
mesh 576 5 38.19 0.79
cmesh-low 144 8 13.18 1.28
cmesh-high 16 52 15.20 3.25
fbly 16 42 10.82 3.56
superstar 36 8 24 36 18.24 1.80 12.90
superstarX 36 8 28 36 21.45 2.11 11.30
superring 36 4 17 11 7.12 1.80 6.48
1717
17
17University of Michigan
Results: Latency Compared with Mesh topology, Super-Star topologies have
39% more throughput, 45% reduction in latency
0.00 0.05 0.10 0.15 0.2002468
101214161820
Low-Radix MeshSymmetric High-Radix (Fbfly)Asymmetric High-Radix (Super-StarX)
Injection Rate (packets/ns/node)
Av
g.
Ne
two
rk L
ate
nc
y
(pe
r p
ac
ke
t in
ns
)
1818
18
18University of Michigan
0.00 0.05 0.10 0.15 0.200
102030405060708090
100
Low-radix Mesh
Symmetric High-Radix (Fbfly)
Asymmetric High-Radix (Super-StarX)
Throughput (packets/ns/node)
Ne
two
rk P
ow
er
(Wa
tts
)
Results: Power Compared with Mesh topology, Super-Star topologies have
40% less power. At 30W, 3x more throughput
3x
2.3x
High Perf.
Lo
w P
ow
er
1919
19
19University of Michigan
0.000.020.040.060.080.100.120.140.160.180.200
102030405060708090
100
MeshGR 1GR 2GR 4GR 8
Throughput (packets/ns/node)
Ne
two
rk P
ow
er
(Wa
tts
)
Results: Energy Proportionality Available throughput can be tuned using global routers A single global router can provide full network connectivity
2020
20
20University of Michigan
0.00 0.05 0.10 0.15 0.20 0.250123456789
10
Super-Star Super-StarX
Injection Rate (packets/ns/node)
Av
g.
Ne
two
rk L
ate
nc
y
(pe
r p
ac
ke
t in
ns
)
Results: Localized Traffic Nearest neighbor traffic between LRs
Maximum one hop
Inter-Cluster Links
Inter-Cluster Links +
Global Routers
2121
21
21University of Michigan
Results: Applications Processor Configuration
576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) Private L1 cache; shared, distributed L2 cache
Workloads 4 workloads – 12 SPECCPU 2006 benchmarks each 1 workload – 8 SPLASH-2 benchmarks
Metrics Performance (execution time in cycles) Power
Results: Super-StarX Average over Mesh: 17% performance improvement, 39% less power Average over Fbfly: 32% performance improvement, 5% worse power
2222
22
22University of Michigan
Conclusion Goal: a scalable on-chip network topology for kilo-core chips
Made feasible by Swizzle-Switches
Asymmetric high-radix topologies: Super-Star and Super-StarX Fast low-radix local routers, slow high-radix global routers Multiple global routers for higher throughput and energy proportionality
Results: Super-StarX Average latency: 45% reduction over Mesh Power: 40% less over Mesh Throughput @ 30W TDP: 3x Mesh, 2.3x Fbfly
2323
23
23University of Michigan
Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies
Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski,
David Blaauw, and Trevor Mudge
University of Michigan, Ann Arbor
HPCA 19
February 27, 2013
Thank You!
2424
24
24University of Michigan
BACKUP SLIDES
2525
25
25University of Michigan
High-Radix Switch: Swizzle-Switch
Radix-64128-bit channels32nm1.5 GHz
2W of power~2mm2 of area
2626
26
26University of Michigan
Super-Star Layout18mm
14.4mm
21.6mm
25.2mm
10.8mm
7.2mm3.6mm
3.6m
m
3.6mm
21.6mm
21.6mm
4 tile
4 ti
le
LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
LR LR LR LR LR LR
GR
GR GR
GR
576 tilesIn total
2727
27
27University of Michigan
Super-Ring (Anti-design) Medium-radix local and global routers Limited connectivity hinders scalability
LR LR LR
LR
LR
LR
LR
LR
LR
GR
LR LR LR
LR
LR
LR
LR
LR
LR
GR
LR LR LR
LR
LR
LR
LR
LR
LR
GRLR LR LR
LR
LR
LR
LR
LR
LR
GR