Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees
description
Transcript of Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees
![Page 1: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/1.jpg)
Collective Operations for Wide-Area Message Passing
Systems Using Adaptive Spanning Trees
Hideo Saito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo)
Grid 2005 (Nov. 13, 2005)
![Page 2: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/2.jpg)
2
Message Passing in WANs
Increase in bandwidth of Wide-Area Networks More opportunities to perform parallel comp.
using multiple clusters connected by a WAN Demands to use message passing for parallel
computation in WANs Existing programs written using message passing Familiar programming model
WAN
![Page 3: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/3.jpg)
3
Collective Operations
Communication operations in which all processors participate E.g., broadcast and reduction
Importance (Gorlatch et al. 2004) Easier to program using coll. operations than with just
point-to-point operations (i.e., send/receive) Libraries can provide a faster implementation than wit
h user-level send/recv Libraries can take advantage of knowledge of the underlying h
ardware
Broadcast
Root
![Page 4: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/4.jpg)
4
Coll. Ops. in LANs and WANs
Collective Operations for LANs Optimized under the assumption that all links have the
same latency/bandwidth
Collective Operations for WANs Wide-area links are much slower than local-area links Collective operations need to avoid links w/ high
latency or low bandwidth Existing methods use static Grid-aware trees
constructed using manually-supplied information
![Page 5: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/5.jpg)
5
Problems of Existing Methods
Large-scale, long-lasting applications More situations in which…
Different computational resources are used upon each invocation
Computational resources are automatically allocated by middleware
Processors are added/removed after app. startup Difficult to manually supply topology information Static trees used in existing methods won’t work
![Page 6: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/6.jpg)
6
Contribution of Our Work
Method to perform collective operations Efficient in clustered wide-area systems Doesn’t require manual configuration Adapts to new topologies when processes are
added/removed Implementation for the Phoenix Message
Passing System Implementation for MPI (work in progress)
![Page 7: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/7.jpg)
7
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
![Page 8: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/8.jpg)
8
MPICH
Thakur et al. 2005 Assumes that the latency and bandwidth of all lin
ks are the same Short messages: latency-aware algorithm Long messages: bandwidth-aware algorithm
Ring All-gather
Root
Binomial Tree
Short Root
Scatter
Long
![Page 9: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/9.jpg)
9
MagPIe
Kielmann et al. ’99 Separate static trees for
wide-area and local-area communication
Broadcast Root sends to the “coordi
nator node” of each cluster
Coordinator nodes perform an MPICH-like broadcast within each cluster
LAN
LAN
LANRoot
Coord.Node
Coord.Node
![Page 10: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/10.jpg)
10
Other Works on Coll. Ops. for WANs
Other works that rely on manually-supplied info. Network Performance-aware Collective Communicatio
n for Clustered Wide Area Systems (Kielmann et al. 2001)
MPICH-G2 (Karonis et al. 2003)
![Page 11: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/11.jpg)
11
Content Delivery Networks
Application-level multicast mechanisms using topology-aware overlay networks Overcast (Jannotti et al. 2000) SelectCast (Bozdog et al. 2003) Banerjee et al. 2002
Designed for content delivery; don’t work for message passing Data loss Single source Only 1-to-N operations
![Page 12: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/12.jpg)
12
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
![Page 13: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/13.jpg)
13
Phoenix
Taura et al. (PPoPP2003) Phoenix Programming Model
Message passing model Programs are written using send/receive
Messages are addressed to virtual nodes Strong support for addition/removal of processes during
execution
Phoenix Message Passing Library Message passing library based on the Phoenix
Programming Model Basis of our implementation
![Page 14: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/14.jpg)
14
Addition/Removal of Processes
Virtual node namespace Messages are addressed to virtual nodes instead of to
processes API to “migrate” virtual nodes supports
addition/removal of processes during execution
0-19 20-39
send(30)
30-39
migration
20-29
JOINsend(30)
![Page 15: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/15.jpg)
15
Broadcast in Phoenix
0 1 2 3 4 4
migration
Broadcast in MPI
root
Broadcast in Phoenix
root
may bedeliveredtogether
![Page 16: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/16.jpg)
16
Reduction in Phoenix
Reduction in MPI
root
0 1 2 3 4
Reduction in Phoenix
root
Op Op
![Page 17: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/17.jpg)
17
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
![Page 18: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/18.jpg)
18
Overview of Our Proposal
Create topology-aware spanning trees at run-time Latency-aware trees (for short messages) Bandwidth-aware trees (for long messages)
Perform broadcasts/reductions along the generated trees
Update the trees when processes are added/removed
![Page 19: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/19.jpg)
19
Spanning Tree Creation
Create a spanning tree for each process w/ that process at the root
Each process autonomously Measures the RTT (or bandwidth)
between itself and randomly selected other processes
Searches for a suitable parent for each spanning tree
Root
RTTRTT
RTT
![Page 20: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/20.jpg)
20
Latency-Aware Trees
Goal Few edges between cl
usters Moderate fan-out and
depth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
root
p
distp,root
![Page 21: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/21.jpg)
21
Latency-Aware Trees
Goal Few edges between cl
usters Moderate fan-out and
depth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
LAN LAN
LAN
Root
![Page 22: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/22.jpg)
22
Latency-Aware Trees
Goal Few edges between clu
sters Moderate fan-out and d
epth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
Root(w/in cluster)
Root
ParentChange
ParentChange
![Page 23: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/23.jpg)
23
Latency-Aware Trees
Goal Few edges between clu
sters Moderate fan-out and d
epth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
parent/root
p
p
cand
RTTp,parent
RTTp,cand
distcand,root
distp,root
![Page 24: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/24.jpg)
24
Bandwidth-Aware Trees
Goal Efficient use of bandwidth during a broadcast/reductio
n Bandwidth estimation
estp,cand =min(estcand, bwp2p/(nchildren+1))
Parent selection estp,cand > estp,parent
bwp2p
nchildren
estcand
cand’s parent
cand
p
![Page 25: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/25.jpg)
25
Broadcast
Short message (<128KB) Forward along a latency-
aware tree Long message (>128KB)
Pipeline along a bandwidth-aware tree
Include in the header The set of virtual nodes to
be forwarded via the receiving process
5
0 (root)
21
43
{2, 5}{1,3,4}
{3} {4} {5}
5
migration
![Page 26: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/26.jpg)
26
Reduction
Each processor Waits for a message from
all of its children Performs the specified
operation Forwards the result to its
parent Timeout mechanism
To avoid waiting forever for a child that has already sent its message to another process
Newparent
Oldparent
p
Timeout
Parent
ParentChange
![Page 27: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/27.jpg)
27
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
![Page 28: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/28.jpg)
28
Broadcast (1-byte)
MPI-like broadcast Mapped 1 virtual node
to each process 201 processes in 3
clusters
Our Implementation MagPIe-like (Static Grid-aware) Impl.
MPICH-like (Grid-unaware) Impl.
0
5
10
15
20
25
30
35
40
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
05
10152025303540
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
05
10152025303540
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
![Page 29: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/29.jpg)
29
Broadcast (Long)
MPI-like broadcast Mapped 1 virtual node to each process 137 processes in 4 clusters
0
100
200
300
400
500
600
700
800
1. E+04 1. E+05 1. E+06 1. E+07 1. E+08
Message Si ze (Bytes)
Band
widt
h (M
B/se
c)
Dynami c (Our I mp. )MPI CH- l i keMagPI e- l i keLi st
![Page 30: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/30.jpg)
30
Reduction
MPI-like Reduction Mapped 1 virtual node to each process 128 processes in 3 clusters
1.E+01
1.E+02
1.E+03
1.E+04
1.E+01 1.E+03 1.E+05 1.E+07Integers Summed
Com
plet
ion
Tim
e (m
icro
secs
)
Dynamic (Our Imp.)
MagPIe-like
MPICH-like
![Page 31: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/31.jpg)
31
0
100
200
300
400
500
600
0 30 60 90 120
Elapsed Time (secs)
Ban
dw
idth
(MB
/sec
)
Addition/Removal of Processes
Repeated 4-MB broadcasts 160 procs. in 4 clusters Added/removed procs. whil
e broadcasting t = 0 [s]
1 virtual node/process
t = 60 [s] Remove half of the processes
t = 90 [s] Re-add the removed processes
Rm. Add
![Page 32: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/32.jpg)
32
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
![Page 33: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/33.jpg)
33
Conclusion
Presented a method to perform broadcasts and reductions in WANs w/out manual configuration
Experiments Stable-state broadcast/reduction
1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe
Addition/removal of processes Effective execution resumed 8 seconds after adding/removi
ng processes
![Page 34: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/34.jpg)
34
Future Work
Optimize broadcast/reduction Reduce the gap between our method and static
Grid-enabled methods Other collective operations
All-to-all Barrier
![Page 35: Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees](https://reader035.fdocuments.in/reader035/viewer/2022062807/568151a6550346895dbfd6d2/html5/thumbnails/35.jpg)
35
Thank you!