The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology...
Transcript of The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology...
![Page 1: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/1.jpg)
The Impact of Network Noise on Large-Scale
Communication Performance
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine
Open Systems LabIndiana UniversityBloomington, USA
Workshop on Large-Scale Parallel Processing/IPDPS’09
Rome, Italy
May, 29th 2009
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 2: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/2.jpg)
Motivation
operating system noise is a known phenomenon
local interruptions by daemons, interrupts, ...
not problematic (<2%) for serial applications
noise propagation is problematic
can lower application performance significantly
pure system issues, often “simple” to solve
2 3 41 2 3 41
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 3: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/3.jpg)
Motivation
effects in the network can cause similar behavior
⇒ network noise (net noise)
management, filesystem, other application, ... traffic
such congestion causes delays
delays optimized communication patterns (collectives)
propagation can lead to delays
applications interfering with themselves is not net noise!
2 3 41 2 3 41 2 3 41
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 4: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/4.jpg)
Our Approach
OS noise is modelled with statistical or signal processing
methods
network noise depends on:
topology and routingnetwork technology, buffer policies, sizes etc.
number of PEs per endpoint (multicore)
communication pattern of all endpoints
⇒ not as easy to model
approach: benchmark + simulation
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 5: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/5.jpg)
Target Architecture
complex topologies
network noise can easily be avoided in tori/hypercubes
(make sure all allocations are convex sets)
other topologies (fat tree, Kautz) are not as simplewe focus on fat trees
random application/application interaction
most common
also models random filesystem traffic
collective communication patterns (including stencil)
most common in HPC scenarios
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 6: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/6.jpg)
Benchmark Method
create two random communicators
with given rat io and warm them up
MPI_Bcast(..);
syncronize clocks on all ranks and
start next step synchronously
t
t
0 72 5
random pattern
3 4 6 1
syncronize clocks on all ranks and
start next step synchronously
3 5
MPI_Bcast(..);
6
MPI_Bcast(..);
0 2 5
7
random pattern
1 2 4 0
syncronize clocks on all ranks and
start next step synchronously
syncronize clocks on all ranks and
start next step synchronously
MPI_Bcast(..);
3 5 6
create two random communicators
with given rat io and warm them up
pert
nopert
tpert
tnopert
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 7: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/7.jpg)
Benchmark Results
2 4 6 8 10 14 18 22 26 30
01
00
02
00
03
00
04
00
0
Nodes in collective
Tim
e [
us]
no perturbation
with perturbation
boxplot, 32 nodes, MPI_Allreduce (single MPI_DOUBLE)
Open MPI 1.2.8, SDR/IB, 566 node fat tree, FBB
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 8: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/8.jpg)
Benchmark Results
0.2 0.4 0.6 0.8
10
01
50
20
02
50
Perturbation Ratio
Slo
wd
ow
n r
ela
tive
to
un
pe
rtu
rba
ted
ru
n [
%]
Broadcast with 208 nodes
Reduce with 492 nodes
Allreduce with 492 nodes
different collectives, 128 measurements, average plotted
very high variance (only 128 samples, background load)
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 9: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/9.jpg)
Benchmark Results
0 50 100 150 200 250
11
01
20
13
01
40
15
0
Application Communicator Size
Slo
wd
ow
n a
t R
atio
of
0.5
[in
%]
fixed perturbation ratio (0.5)
slowdown with increasing communicator size
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 10: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/10.jpg)
Simulation Methodology
needs to consider topology
and routing
use IB as a model
simple linear congestion
model
we model collective
operations as a set of
dependencies
(collective) level-wise
simulation
1
0
2
45 6
3
7
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 11: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/11.jpg)
Simulation Methodology
route every logical link through the network
record congestion on edges
4x4 4x4 4x4 4x4
4x4 4x4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4,8,
12 5,9,13 6,10,14 7,11,15 0,8,12 1,9,13 2,10,14 3,11,15 0,4,12 1,5,13 2,6,14 3,7,150,4,8 1,5,9 2,6,10 3,7,11
0,
1
2,
3
4,
5
6,
7
8,
9
10,
11
12,
13
14,
15
0,
1
2,
3
4,
5
6,
7
8,
9
10,
11
12,
13
14,
15
r0 r1r2r3
r4 r5r6 r7
1+1
> 4 > 1 1 > 1 5 > 1 2 > 1 4 > 1 > 8 > 2> 7 > 0 > 9 > 1 0
1
1
1
1
1+1
1+1
1+1
+ 1
+ 1
+ 1
+ 2
+ 1
+ 1
+ 1 + 1+ 1
+ 1
+ 1
1
0
2
45 6
3
7
2
1
1
2
1
21
annotate collective graph with maximum path congestion
longest path from any root node is reported
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 12: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/12.jpg)
Simulation Systems
real-world system inputs (IB network maps)
Odin @ IU (128 nodes, FBB fat tree)
CHiC @ TUC (566 nodes, FBB fat tree)
Atlas @ LLNL (1142 nodes, FBB fat tree)
Ranger @ TACC (3908 nodes, FBB fat tree)
TBird @ SNL (4391 nodes, 1/2 BB fat tree)
⇒ your system? Please give us the maps!
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 13: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/13.jpg)
Simulation Results
0.3 0.4 0.5 0.6 0.7
15
02
00
25
03
00
Perturbation Ratio
Slo
wd
ow
n r
ela
tive
to
un
pe
rtu
rba
ted
ru
n [
%]
Odin (128)
CHiC (566)
Atlas (1142)
Ranger (3908)
TBird (4391)
binomial tree pattern (small message Bcast, Reduce)
CHiC results reflect microbenchmark accurately!
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 14: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/14.jpg)
Real Large-Scale Simulation Results
simulated large exteded generalized fat trees (XGFT)
24 port crossbars, full bisection bandwidth
fat tree optimized routing (OpenSM)
144 nodes (one level) to 20,736 nodes (three levels)
above: 144 nodes, below: 1152 nodes
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 15: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/15.jpg)
Simulation Results
0 5000 10000 15000 20000
15
02
00
25
03
00
35
0
Network Size
Slo
wd
ow
n a
t ra
tio
of
0.5
[in
%]
perturbation ratio 0.5, tree pattern
logarithmic shape reflects CHiC benchmarks!
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 16: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/16.jpg)
Conclusions and Future Works
Conclusions
network noise must be considered
significant impact, similar to OS noise
no known real-world analyses yet
network topology and routing are very important
Future Work
good process-to-node mapping could reduce problems
topology-aware communication algorithms
extend analysis to real applications (profiling, tracing)
analyze several network topologies and workarounds
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf
![Page 17: The Impact of Network Noise on Large-Scale Communication ... · network noise depends on: topology and routing network technology, buffer policies, sizes etc. number of PEs per endpoint](https://reader035.fdocuments.in/reader035/viewer/2022081614/5fd01841251bfa1b6115ab02/html5/thumbnails/17.jpg)
Thanks for your attention!
Questions?
Download the (research-quality) ORCS simulator at:
http://www.unixer.de/ORCS
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine The Impact of Network Noise on Large-Scale Communication Perf