Multistage Switches are not Crossbars: Effects of Static ...

19
Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks - A Case Study with InfiniBand - Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA IEEE Cluster 2008 Tsukuba, Japan September, 29th 2008 Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Transcript of Multistage Switches are not Crossbars: Effects of Static ...

Page 1: Multistage Switches are not Crossbars: Effects of Static ...

Multistage Switches are not Crossbars: Effects

of Static Routing in High-Performance

Networks- A Case Study with InfiniBand -

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine

Open Systems LabIndiana UniversityBloomington, USA

IEEE Cluster 2008

Tsukuba, Japan

September, 29th 2008

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 2: Multistage Switches are not Crossbars: Effects of Static ...

Introduction

large-scale networks are common on HPC

huge variety of different technologies (IB, QSNet, Myrinet)

offering: offload, onload, OS bypass

we focus on topologies and routing!

Topologies

flat: Ring, Kautz, k-ary n-cubes (Torus, Hypercube)

MIN: Omega, Banyan, Clos, k-ary n-tree (Fat Tree)

Routing

oblivious: fully random, random paths, online, ...

adaptive: simple adaptive, probing adaptive, ...

⇒ focus on Fat Tree Topologies with oblivious routing!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 3: Multistage Switches are not Crossbars: Effects of Static ...

Why Fat Trees?

Fat Tree networks seem to have several advantages:

simple construction rule

Clos networks are a special case

high bandwidth at large scale

well understood since the 60s (Telephone)

used by many switch vendors

can be built with full bisection bandwidth (FBB)

maps many (all?) patterns optimally

simple deadlock-free routing

... so it seems ...

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 4: Multistage Switches are not Crossbars: Effects of Static ...

What is Bisection Bandwidth?

Definition: For a general network with N endpoints,

represented as a graph with a bandwidth of one on every edge,

BB is defined as the minimum number of edges that have to be

removed in order to split the graphs into two equallysized

unconnected parts.

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 5: Multistage Switches are not Crossbars: Effects of Static ...

Clos Networks

see [Clos’53] for details

can be built blocking, rearrangable non-blocking, strictly

non-blocking

rearrangable non-blocking is most widely usedN

2 + N N × N crossbarsN

2 · N endpoints and connections (“cables”)

8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8

8x88x88x88x8

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 6: Multistage Switches are not Crossbars: Effects of Static ...

k-ary n-trees (Fat Trees)

see [Leiserson’90] for details

“generalisation” of Clos networks

much more flexible in size and bandwidth

similar principles

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 7: Multistage Switches are not Crossbars: Effects of Static ...

Oblivious Routing and InfiniBand

Oblivious Routing

static routing without considering the traffic demands

e.g., Ethernet, InfiniBand, IP, ...

adaptive routing has limits (fast changing patterns with

small packets)

InfiniBand

Subnet Manager (SM) discovers topology and computes

routes

crossbars have destination-based forwarding tables

24-port crossbars -> Clos network has 288 ports

recursive Clos up to 41472 ports

biggest chassis has 3456 ports (Fat Tree)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 8: Multistage Switches are not Crossbars: Effects of Static ...

An FBB Network Pattern

1,5,9 (up)

2,6,10 (up)7,11,15 (up) 3,7,11 (up)

4,8,12 (up)

2,10,14 (up)

1,9,13 (up)

3,11,15 (up)

4,12,16 (up)

1 .. 4

....9 .. 12

8 portcrossbar

5,9,13 (up)2

2

12

6,10,14 (up)

8, 12, 16 (up)

1 (down)

2 (down)

6 (down)

5 (down)

(to 9+10)

13 (down)

14 (down)

3 (down)

4 (down)

7 (down)

8 (down)

15 (down)

16 (down)

5 .. 81

13 .. 16

1

2

1

2

1

1

(to 11+12)

crossbar8 port

8 portcrossbar

8 portcrossbar

8 portcrossbar

two distinct communications (1 to 6 and 4 to 14) in an FBB

network

⇒ no full bandwidth!

Bandwidth depends on traffic patterns, routing and topology!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 9: Multistage Switches are not Crossbars: Effects of Static ...

What is the essence of Bisection Bandwidth?

Is it an upper bound to real bandwidth?

no, see example on last slide!

Is it a lower bound to real bandwidth?

no, see:

Is it the expected (average) bandwidth?

not easy to assess

simulate different traffic patterns!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 10: Multistage Switches are not Crossbars: Effects of Static ...

The effective Bisection Bandwidth (eBB)

models real bandwidth as the average bandwidth of all

bisect patterns

constructing a bisect pattern:

divide network in two equal partitions A and B

find exactly one peer in B for each node in A(

PP

2

)

ways to partition P nodes

P

2 ! ways to pair P

2 nodes

→ huge number of patterns

many of them have full bandwidthno closed form yet, thus simulate

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 11: Multistage Switches are not Crossbars: Effects of Static ...

The Network Simulator

model physical network as a graph

construct random bisect patterns

simulate packet routing and record edge congestion

find maximum congestion c along each path

compute average bandwidth per path b = 1c

repeat simulation with many patterns

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 12: Multistage Switches are not Crossbars: Effects of Static ...

Simulating Real-World Systems

retrieved topology via ibnetdiscover and ibdiagnet

four large-scale InfiniBand systems queried:

Thunderbird at SNL - 4390 nodes

Atlas at LLNL - 1142 nodes

Ranger at TACC - 3908 nodes

CHiC at TUC - 566 nodes

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 13: Multistage Switches are not Crossbars: Effects of Static ...

Influence of Head of Line Blocking

0

1000

2000

3000

4000

5000

6000

7000

8000

0.001 0.01 0.1 1 10 100 1000 10000

Bandw

idth

(M

Bit/s

)

Datasize (kiB)

no congestioncongestion: 1congestion: 3congestion: 5congestion: 7

congestion: 10

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10 11 0

1000

2000

3000

4000

5000

6000

7000

8000

1-b

yte

Late

ncy (

us)

Peak B

andw

idth

(M

bps)

Congestion Factor

LatencyBandwidth

communication between different pairs (bisect)

laid out to cause congestion

24 ports → max. congestion is 11

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 14: Multistage Switches are not Crossbars: Effects of Static ...

Simulation and Reality

compare 512 node CHiC run with 566 node simulation

random bisect patterns, bins of size 50 MiB/s

measured and simulated > 99.9% into only 4 bins!

0

1

2

3

4

5

6

7

Num

ber

of O

ccurr

ences (

x 1

00,0

00)

627.4 MiB/s281.2 MiB/s181.2 MiB/s133.6 MiB/s

measuredsimulated

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 15: Multistage Switches are not Crossbars: Effects of Static ...

Results on other Systems

Effective Bisection Bandwidth

Ranger (3908 nodes, FBB): 57.5%

Atlas (1142 nodes, FBB): 55.6%

Thunderbird (4390 nodes, 1/2 FBB): 40.6%

Other Effects of Congestion?

bandwidth varies with comm. pattern

not easy to predict/model

effects on latency are not trivial (buffering etc.)

leads to network skew (problem at large scale)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 16: Multistage Switches are not Crossbars: Effects of Static ...

Influence on Applications?

Application Analysis

MPQC - MPI_Reduce, MPI_Bcast, MPI_Allreduce

MILC - Neighbor Exchange, MPI_Allreduce

POP - Neighbor Exchange, MPI_Allreduce

Octopus - Neighbor Exchange, MPI_Allreduce

Conclusions?

many applications use collective communication

nearest neighbor exchange is also collective

patterns can be scaled up!

simulate collective patterns: tree, dissemination, six

neighbor

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 17: Multistage Switches are not Crossbars: Effects of Static ...

Results on other Systems

Six Neighbor Bandwidth

Ranger: 62.4%

Atlas: 60.7%

Thunderbird: 37.4%

Tree Bandwidth

Ranger: 69.9%

Atlas: 71.3%

Thunderbird: 57.4%

Dissemination Bandwidth

Ranger: 41.9%

Atlas: 40.2%

Thunderbird: 27.4%

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 18: Multistage Switches are not Crossbars: Effects of Static ...

Conclusions and Future Work

Conclusions

bisection bandwidth does reflect practice well

effective bisection bandwidth is harder to assess but more

realistic

applications bandwidths suffer, even on FBB networks

Future Work

develop better oblivious routing for IB

analyze more systems and applications

look into adaptive routing options? or LMC??

look at other interconnects and topologies

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars

Page 19: Multistage Switches are not Crossbars: Effects of Static ...

Conclusions and Future Work

Conclusions

bisection bandwidth does reflect practice well

effective bisection bandwidth is harder to assess but more

realistic

applications bandwidths suffer, even on FBB networks

Future Work

develop better oblivious routing for IB

analyze more systems and applications

look into adaptive routing options? or LMC??

look at other interconnects and topologies

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine MINs are not Crossbars