The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology...

17
The Impact of Network Noise on Large-Scale Communication Performance Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA Workshop on Large-Scale Parallel Processing/IPDPS’09 Rome, Italy May, 29th 2009 Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Per

Transcript of The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology...

Page 1: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

The Impact of Network Noise on Large-Scale

Communication Performance

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine

Open Systems LabIndiana UniversityBloomington, USA

Workshop on Large-Scale Parallel Processing/IPDPS’09

Rome, Italy

May, 29th 2009

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 2: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Motivation

operating system noise is a known phenomenon

local interruptions by daemons, interrupts, ...

not problematic (<2%) for serial applications

noise propagation is problematic

can lower application performance significantly

pure system issues, often “simple” to solve

2 3 41 2 3 41

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 3: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Motivation

effects in the network can cause similar behavior

⇒ network noise (net noise)

management, filesystem, other application, ... traffic

such congestion causes delays

delays optimized communication patterns (collectives)

propagation can lead to delays

applications interfering with themselves is not net noise!

2 3 41 2 3 41 2 3 41

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 4: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Our Approach

OS noise is modelled with statistical or signal processing

methods

network noise depends on:

topology and routingnetwork technology, buffer policies, sizes etc.

number of PEs per endpoint (multicore)

communication pattern of all endpoints

⇒ not as easy to model

approach: benchmark + simulation

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 5: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Target Architecture

complex topologies

network noise can easily be avoided in tori/hypercubes

(make sure all allocations are convex sets)

other topologies (fat tree, Kautz) are not as simplewe focus on fat trees

random application/application interaction

most common

also models random filesystem traffic

collective communication patterns (including stencil)

most common in HPC scenarios

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 6: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Benchmark Method

create two random communicators

with given rat io and warm them up

MPI_Bcast(..);

syncronize clocks on all ranks and

start next step synchronously

t

t

0 72 5

random pattern

3 4 6 1

syncronize clocks on all ranks and

start next step synchronously

3 5

MPI_Bcast(..);

6

MPI_Bcast(..);

0 2 5

7

random pattern

1 2 4 0

syncronize clocks on all ranks and

start next step synchronously

syncronize clocks on all ranks and

start next step synchronously

MPI_Bcast(..);

3 5 6

create two random communicators

with given rat io and warm them up

pert

nopert

tpert

tnopert

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 7: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Benchmark Results

2 4 6 8 10 14 18 22 26 30

01

00

02

00

03

00

04

00

0

Nodes in collective

Tim

e [

us]

no perturbation

with perturbation

boxplot, 32 nodes, MPI_Allreduce (single MPI_DOUBLE)

Open MPI 1.2.8, SDR/IB, 566 node fat tree, FBB

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 8: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Benchmark Results

0.2 0.4 0.6 0.8

10

01

50

20

02

50

Perturbation Ratio

Slo

wd

ow

n r

ela

tive

to

un

pe

rtu

rba

ted

ru

n [

%]

Broadcast with 208 nodes

Reduce with 492 nodes

Allreduce with 492 nodes

different collectives, 128 measurements, average plotted

very high variance (only 128 samples, background load)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 9: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Benchmark Results

0 50 100 150 200 250

11

01

20

13

01

40

15

0

Application Communicator Size

Slo

wd

ow

n a

t R

atio

of

0.5

[in

%]

fixed perturbation ratio (0.5)

slowdown with increasing communicator size

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 10: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Simulation Methodology

needs to consider topology

and routing

use IB as a model

simple linear congestion

model

we model collective

operations as a set of

dependencies

(collective) level-wise

simulation

1

0

2

45 6

3

7

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 11: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Simulation Methodology

route every logical link through the network

record congestion on edges

4x4 4x4 4x4 4x4

4x4 4x4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

4,8,

12 5,9,13 6,10,14 7,11,15 0,8,12 1,9,13 2,10,14 3,11,15 0,4,12 1,5,13 2,6,14 3,7,150,4,8 1,5,9 2,6,10 3,7,11

0,

1

2,

3

4,

5

6,

7

8,

9

10,

11

12,

13

14,

15

0,

1

2,

3

4,

5

6,

7

8,

9

10,

11

12,

13

14,

15

r0 r1r2r3

r4 r5r6 r7

1+1

> 4 > 1 1 > 1 5 > 1 2 > 1 4 > 1 > 8 > 2> 7 > 0 > 9 > 1 0

1

1

1

1

1+1

1+1

1+1

+ 1

+ 1

+ 1

+ 2

+ 1

+ 1

+ 1 + 1+ 1

+ 1

+ 1

1

0

2

45 6

3

7

2

1

1

2

1

21

annotate collective graph with maximum path congestion

longest path from any root node is reported

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 12: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Simulation Systems

real-world system inputs (IB network maps)

Odin @ IU (128 nodes, FBB fat tree)

CHiC @ TUC (566 nodes, FBB fat tree)

Atlas @ LLNL (1142 nodes, FBB fat tree)

Ranger @ TACC (3908 nodes, FBB fat tree)

TBird @ SNL (4391 nodes, 1/2 BB fat tree)

⇒ your system? Please give us the maps!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 13: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Simulation Results

0.3 0.4 0.5 0.6 0.7

15

02

00

25

03

00

Perturbation Ratio

Slo

wd

ow

n r

ela

tive

to

un

pe

rtu

rba

ted

ru

n [

%]

Odin (128)

CHiC (566)

Atlas (1142)

Ranger (3908)

TBird (4391)

binomial tree pattern (small message Bcast, Reduce)

CHiC results reflect microbenchmark accurately!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 14: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Real Large-Scale Simulation Results

simulated large exteded generalized fat trees (XGFT)

24 port crossbars, full bisection bandwidth

fat tree optimized routing (OpenSM)

144 nodes (one level) to 20,736 nodes (three levels)

above: 144 nodes, below: 1152 nodes

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 15: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Simulation Results

0 5000 10000 15000 20000

15

02

00

25

03

00

35

0

Network Size

Slo

wd

ow

n a

t ra

tio

of

0.5

[in

%]

perturbation ratio 0.5, tree pattern

logarithmic shape reflects CHiC benchmarks!

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 16: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Conclusions and Future Works

Conclusions

network noise must be considered

significant impact, similar to OS noise

no known real-world analyses yet

network topology and routing are very important

Future Work

good process-to-node mapping could reduce problems

topology-aware communication algorithms

extend analysis to real applications (profiling, tracing)

analyze several network topologies and workarounds

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf

Page 17: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint

Thanks for your attention!

Questions?

Download the (research-quality) ORCS simulator at:

http://www.unixer.de/ORCS

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf