11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix...

27
1 1 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High- Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge University of Michigan, Ann Arbor HPCA 19 February 27, 2013

Transcript of 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix...

Page 1: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

11

1

1University of Michigan

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies

Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski,

David Blaauw, and Trevor Mudge

University of Michigan, Ann Arbor

HPCA 19

February 27, 2013

Page 2: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

22

2

2University of Michigan

Crossbar or Ring

Mesh

2000 2002 2005 2008 20110

20

40

60

80

100

120

Year

# co

res

on

a c

hip

Many-Core Trend Thousand-core chips are in our future A scalable on-chip interconnect is required

TILE Gx100

Intel SCC

TILE64

Page 3: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

33

3

3University of Michigan

Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs

Super-Star Super-StarX

Results Conclusion

Page 4: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

44

4

4University of Michigan

Mesh Topology Popular in tiled-based many-core processors Low complexity Planar 2D layout properties

Tilera’s TILE64 64-core processorCan Mesh topology scale to 100s of cores?

Page 5: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

55

5

5University of Michigan

R R R

RR R

R R R

RR R

R

R

R

High-Radix Topologies Alternative to low-radix topologies Concentration

R

6 tile

6 til

e

TileR

R R

RR R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

R R R R R R

Fewer hops improve latency, but links become bottlenecks

Page 6: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

66

6

6University of Michigan

High-Radix Topologies Improve throughput

Additional Connectivity Parallel links Express links

High-RadixRouter

R RR R

R RR R

R RR R

R RR R

R RR R

R RR R

R RR R

R RR R

Page 7: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

77

7

7University of Michigan

Traditional Matrix-Style Crossbar Separate crossbar & arbiter Not scalable as radix increases:

Routing to/from arbiter becomes more challenging

Arbitration logic grows more complex

Swizzle-Switch* Combines routing-dominated arbiter

with logic-dominated crossbar SRAM-like technology Scales to radix-64 in 32nm @ 1.5GHz

High-Radix Switch: Swizzle-Switch

*VLSIC 2011, ISSCC 2012, DAC 2012, JETCAS 2012, HotChips 2012

Page 8: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

88

8

8University of Michigan

High-Radix Topologies

Low-Radix Router

ConventionalRouter Delay

Swizzle-SwitchRouter Delay

Hop Count

Local Communication

Global Communication

Global Communication

Local Communication

High-Radix Router

Del

ay

Symmetric high-radix topologies trade-off efficiency of local communication to achieve faster global communication

Page 9: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

99

9

9University of Michigan

Outline Motivation Symmetric Low-Radix and High-Radix Designs Asymmetric High-Radix Designs

Super-Star Super-StarX

Results Conclusion

Page 10: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1010

10

10University of Michigan

Asymmetric High-Radix Topologies

LR LR LR

LR LR LR

LR LR LR

LR

LR LR

GRLR = Local RouterGR = Global Router

Asymmetric High-Radix merge best features of both

low-radix and high-radix topologies

Low-Radix Topologies optimize local communication

High-Radix Topologies optimize global communication

Fast, Low-Radix

Slow, High-Radix

Page 11: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1111

11

11University of Michigan

Asymmetric High-Radix Topologies Decouple local and global communication

Match router speed to wire speed

Local communication Short wires Fast Low-Radix

Global communication Long wires Slow High-Radix Routers Reduce Hop count

Page 12: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1212

12

12University of Michigan

Super-Star Each local router connects a cluster of tiles Each global router connects to all local routers

LR LR LR

LR LR LR

LR LR LR

LR

LR LR

GR

Page 13: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1313

13

13University of Michigan

Super-StarX Inter-cluster links further reduce local communication latency Locality-aware routing policy

LR LR LR

LR LR LR

LR LR LR

LR

LR LR

Inter-Cluster Links

GR

Low Load: Inter-Cluster Links

High Load: Inter-Cluster Links + Global Router

Page 14: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1414

14

14University of Michigan

Super-StarX Multiple global routers

Higher throughput, energy proportionality

LR LR LR

LR LR LR

LR LR LR

LR

LR LR

GRGRGRGR

Page 15: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1515

15

15University of Michigan

Super-StarX Layout18mm

14.4mm

21.6mm

25.2mm

10.8mm

7.2mm3.6mm

3.6m

m

3.6mm

21.6mm

21.6mm

Inter-Cluster Links

4 tile

4 ti

le

LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

GR

GR GR

GR

576 tilesin total

Page 16: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1616

16

16University of Michigan

Evaluation 576 tiles Synthetic uniform random traffic, 4-flit messages 128-bit Swizzle-Switch in 15nm 4 VCs/port, buffer depth 5 flits/VC Power & delay from SPICE modeling in 32nm, scaled to 15nm

Router Information & Link Dimensions

Topology # Routers Radix Network Area Avg. Link Length (mm)

Local Global Local Global Local Global

mesh 576 5 38.19 0.79

cmesh-low 144 8 13.18 1.28

cmesh-high 16 52 15.20 3.25

fbly 16 42 10.82 3.56

superstar 36 8 24 36 18.24 1.80 12.90

superstarX 36 8 28 36 21.45 2.11 11.30

superring 36 4 17 11 7.12 1.80 6.48

Page 17: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1717

17

17University of Michigan

Results: Latency Compared with Mesh topology, Super-Star topologies have

39% more throughput, 45% reduction in latency

0.00 0.05 0.10 0.15 0.2002468

101214161820

Low-Radix MeshSymmetric High-Radix (Fbfly)Asymmetric High-Radix (Super-StarX)

Injection Rate (packets/ns/node)

Av

g.

Ne

two

rk L

ate

nc

y

(pe

r p

ac

ke

t in

ns

)

Page 18: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1818

18

18University of Michigan

0.00 0.05 0.10 0.15 0.200

102030405060708090

100

Low-radix Mesh

Symmetric High-Radix (Fbfly)

Asymmetric High-Radix (Super-StarX)

Throughput (packets/ns/node)

Ne

two

rk P

ow

er

(Wa

tts

)

Results: Power Compared with Mesh topology, Super-Star topologies have

40% less power. At 30W, 3x more throughput

3x

2.3x

High Perf.

Lo

w P

ow

er

Page 19: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

1919

19

19University of Michigan

0.000.020.040.060.080.100.120.140.160.180.200

102030405060708090

100

MeshGR 1GR 2GR 4GR 8

Throughput (packets/ns/node)

Ne

two

rk P

ow

er

(Wa

tts

)

Results: Energy Proportionality Available throughput can be tuned using global routers A single global router can provide full network connectivity

Page 20: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2020

20

20University of Michigan

0.00 0.05 0.10 0.15 0.20 0.250123456789

10

Super-Star Super-StarX

Injection Rate (packets/ns/node)

Av

g.

Ne

two

rk L

ate

nc

y

(pe

r p

ac

ke

t in

ns

)

Results: Localized Traffic Nearest neighbor traffic between LRs

Maximum one hop

Inter-Cluster Links

Inter-Cluster Links +

Global Routers

Page 21: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2121

21

21University of Michigan

Results: Applications Processor Configuration

576 nodes: 552 cores + 24 memory controllers (1 GHz frequency) Private L1 cache; shared, distributed L2 cache

Workloads 4 workloads – 12 SPECCPU 2006 benchmarks each 1 workload – 8 SPLASH-2 benchmarks

Metrics Performance (execution time in cycles) Power

Results: Super-StarX Average over Mesh: 17% performance improvement, 39% less power Average over Fbfly: 32% performance improvement, 5% worse power

Page 22: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2222

22

22University of Michigan

Conclusion Goal: a scalable on-chip network topology for kilo-core chips

Made feasible by Swizzle-Switches

Asymmetric high-radix topologies: Super-Star and Super-StarX Fast low-radix local routers, slow high-radix global routers Multiple global routers for higher throughput and energy proportionality

Results: Super-StarX Average latency: 45% reduction over Mesh Power: 40% less over Mesh Throughput @ 30W TDP: 3x Mesh, 2.3x Fbfly

Page 23: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2323

23

23University of Michigan

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies

Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski,

David Blaauw, and Trevor Mudge

University of Michigan, Ann Arbor

HPCA 19

February 27, 2013

Thank You!

Page 24: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2424

24

24University of Michigan

BACKUP SLIDES

Page 25: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2525

25

25University of Michigan

High-Radix Switch: Swizzle-Switch

Radix-64128-bit channels32nm1.5 GHz

2W of power~2mm2 of area

Page 26: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2626

26

26University of Michigan

Super-Star Layout18mm

14.4mm

21.6mm

25.2mm

10.8mm

7.2mm3.6mm

3.6m

m

3.6mm

21.6mm

21.6mm

4 tile

4 ti

le

LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

LR LR LR LR LR LR

GR

GR GR

GR

576 tilesIn total

Page 27: 11 1 1 University of Michigan Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies Nilmini Abeyratne, Reetuparna Das, Qingkun Li,

2727

27

27University of Michigan

Super-Ring (Anti-design) Medium-radix local and global routers Limited connectivity hinders scalability

LR LR LR

LR

LR

LR

LR

LR

LR

GR

LR LR LR

LR

LR

LR

LR

LR

LR

GR

LR LR LR

LR

LR

LR

LR

LR

LR

GRLR LR LR

LR

LR

LR

LR

LR

LR

GR