A bandwidth latency tradeoff for broadcast and reduction

6
Information Processing Letters 86 (2003) 33–38 www.elsevier.com/locate/ipl A bandwidth latency tradeoff for broadcast and reduction Peter Sanders a,, Jop F. Sibeyn b a Max-Planck-Institut für Informatik, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany b Department of Computing Science, Umeå University, 901 87 Umeå, Sweden Received 9 August 2000; received in revised form 16 January 2002 Communicated by F. Dehne Abstract The “fractional tree” algorithm for broadcasting and reduction is introduced. Its communication pattern interpolates between two well known patterns—sequential pipeline and pipelined binary tree. The speedup over the best of these simple methods can approach two for large systems and messages of intermediate size. For networks which are not very densely connected the new algorithm seems to be the best known method for the important case that each processor has only a single (possibly bidirectional) channel into the communication network. 2002 Elsevier Science B.V. All rights reserved. Keywords: Collective communication; Broadcast; Reduction; Tree; Single ported; Half-duplex; Full-duplex; Parallel algorithms; Mesh; Hierarchical Crossbar 1. Introduction Consider P processing units, PUs, of a parallel machine. Broadcasting, the operation in which one processor has to send a message M to all other PUs, is a crucial building block for many parallel algorithms. Since it can be implemented once and for all in com- munication libraries such as MPI [7], it makes sense to invest into algorithms which are close to optimal for all P and all message lengths k . Since broadcast- ing is sometimes a bottleneck operation, even constant factors should be considered. In addition, by revers- * Corresponding author. E-mail addresses: [email protected] (P. Sanders), [email protected] (J.F. Sibeyn). URLs: http://www.mpi-sb.mpg.de/~sanders, http://www.cs.umu.se/~jopsi. ing the direction of communication, broadcasting al- gorithms can usually be turned into reduction algo- rithms. Reduction is the task to compute a general- ized sum i<P M i , where initially message M i is stored on PU i and where “” can be any associa- tive operator. Broadcasting and reduction are among the most important communication primitives. For ex- ample, some of the best algorithms for matrix multi- plication or dense matrix–vector multiplication have these two functions as their sole communication rou- tines [5]. We study broadcasting long messages for a simple synchronous, symmetric communication model which is intended as a least common denominator of practi- cal protocols able to support high bandwidth for long messages: It takes time t + k to transfer a message of size k regardless which PUs are involved. This is re- alistic on many modern machines where network la- 0020-0190/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S0020-0190(02)00473-8

Transcript of A bandwidth latency tradeoff for broadcast and reduction

Page 1: A bandwidth latency tradeoff for broadcast and reduction

Information Processing Letters 86 (2003) 33–38

www.elsevier.com/locate/ipl

A bandwidth latency tradeoff for broadcast and reduction

Peter Sandersa,∗, Jop F. Sibeynb

a Max-Planck-Institut für Informatik, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germanyb Department of Computing Science, Umeå University, 901 87 Umeå, Sweden

Received 9 August 2000; received in revised form 16 January 2002

Communicated by F. Dehne

Abstract

The “fractional tree” algorithm for broadcasting and reduction is introduced. Its communication pattern interpolates betweentwo well known patterns—sequential pipeline and pipelined binary tree. The speedup over the best of these simple methodscan approach two for large systems and messages of intermediate size. For networks which are not very densely connectedthe new algorithm seems to be the best known method for the important case that each processor has only a single (possiblybidirectional) channel into the communication network. 2002 Elsevier Science B.V. All rights reserved.

Keywords: Collective communication; Broadcast; Reduction; Tree; Single ported; Half-duplex; Full-duplex; Parallel algorithms; Mesh;Hierarchical Crossbar

1. Introduction

ConsiderP processing units, PUs, of a parallelmachine.Broadcasting, the operation in which oneprocessor has to send a messageM to all other PUs, isa crucial building block for many parallel algorithms.Since it can be implemented once and for all in com-munication libraries such as MPI [7], it makes senseto invest into algorithms which are close to optimalfor all P and all message lengthsk. Since broadcast-ing is sometimes a bottleneck operation, even constantfactors should be considered. In addition, by revers-

* Corresponding author.E-mail addresses: [email protected] (P. Sanders),

[email protected] (J.F. Sibeyn).URLs: http://www.mpi-sb.mpg.de/~sanders,

http://www.cs.umu.se/~jopsi.

ing the direction of communication, broadcasting al-gorithms can usually be turned into reduction algo-rithms. Reduction is the task to compute a general-ized sum

⊕i<P Mi , where initially messageMi is

stored on PUi and where “⊕” can be any associa-tive operator. Broadcasting and reduction are amongthe most important communication primitives. For ex-ample, some of the best algorithms for matrix multi-plication or dense matrix–vector multiplication havethese two functions as their sole communication rou-tines [5].

We study broadcasting long messages for a simplesynchronous, symmetric communication model whichis intended as a least common denominator of practi-cal protocols able to support high bandwidth for longmessages: It takes timet + k to transfer a message ofsizek regardless which PUs are involved. This is re-alistic on many modern machines where network la-

0020-0190/02/$ – see front matter 2002 Elsevier Science B.V. All rights reserved.doi:10.1016/S0020-0190(02)00473-8

Page 2: A bandwidth latency tradeoff for broadcast and reduction

34 P. Sanders, J.F. Sibeyn / Information Processing Letters 86 (2003) 33–38

tencies are small compared to the start-up overheadt .Both sender and receiver have to cooperate in trans-mitting a message. We are considering two variants.Our default is theduplex model where a PU can con-currently send a message to one partner and receivea message from a possibly different partner. We usethe name send|recv to denote this parallel operation inpseudo-code. The more restrictivehalf-duplex modelpermits only one communication direction per proces-sor. The broadcasting time for half-duplex is at mosttwice that for duplex communication for half as manyPUs.1

Let us start our discussion with a simple non-pipelined algorithm that is very good for short mes-sages withk = O(t). In the binomial tree algorithm,PU i sends data to processorsi+2z, i+2z−1, . . . , i+1if the binary representation of its processor index hasz + 1 trailing zeros. The resulting execution time

Tbinomial= (k + t)�logP � (1)

is rather large fork t because processor 0 hasto send all the data�logP � times. We begin ourdescription of pipelined algorithms in Section 2 byreviewing simple results. By arranging the PUs in asimple chain, execution time

T ∗∞ = k(1+ O

(√tP/k

)) + O(tP ) (2)

can be achieved. Except for very long messages, abetter approach is to arrange the PUs into a binary tree.This approach achieves broadcasting time2

T ∗1 = k

(2+ O

(√t log(P )/k

)) + O(t logP) (3)

(replace “2” by “3” for the half-duplex model). Wealso give lower bounds.

The main contribution of this paper is thefractionaltree algorithm described in Section 3. It is a general-ization of the two above algorithms and achieves anexecution time of

T ∗∗ = k

(1+ O

((t logP

k

)1/3))+ O(t logP), (4)

1 A couple of half-duplex PUs emulate each communication ofa duplex PU in two substeps. In the first substep one partner actsas a sender and the other as a receiver for communicating withother couples. In the second substep the previously received datais forwarded to the partner.

2 Throughout this paper logx stands for log2 x.

i.e., it combines the advantage of the chain algorithmto have a(1+o(1)) factor in thek dependent term withthe advantage of the binary tree algorithm to have alogarithmic dependence onP in the t dependent termof the execution time. For largeP and mediumk theimprovement over both simple algorithms approachesa factor two (3/2 for the half-duplex model).

For some powerful network topologies, somewhatbetter algorithms are known. For hypercubes, there isan elegant and fast algorithm which runs in time

T ∗HC = k

(1+ √

t log(P )/k)2

= k(1+ O

(√t log(P )/k

)) + O(t logP)

[1,4]. However, no similarly good algorithm wasknown for networks with low bisection3 bandwidth,e.g., meshes. Even for fully connected networks thebest known algorithms for arbitraryP are quitecomplicated [2,6]. The fractional tree algorithm doesnot have this problem. In Section 4 we explain howit can be adapted to several sparse topologies likehierarchical networks and meshes.

2. Basic results on broadcasting

2.1. Lower bounds

All non-source PUs must receive thek data ele-ments, and the whole broadcasting takes at least logP

steps. Thus in the duplex model there is a lower boundof

Tlower = k + t · logP. (5)

In the half-duplex model, all non-source PUs mustreceive thek data elements. Hence the total volumeof data received is at least(P − 1) · k. Since all thisdata also has to be sent, this implies a time bound of2(1−1/P )k even if all PUs are communicating all thetime. Also taking start-up time into account gives

Tlower, half-duplex

= 2 · (1− 1/P ) · k + t · (logP − 4). (6)

3 The bisection width of a network is the smallest number ofconnections one needs to cut in order to produce two disconnectedcomponents of size�P/2 and�P/2�.

Page 3: A bandwidth latency tradeoff for broadcast and reduction

P. Sanders, J.F. Sibeyn / Information Processing Letters 86 (2003) 33–38 35

Fig. 1. Examples for fractional trees withr ∈ {1,2,3,∞} where the last PU receives its first packet after 5 steps. The caser = 1 correspondsto plain binary trees and pipelines can be considered the caser = ∞. Part (e) shows the communication pattern in a group ofr PUs whichcooperate to supply two successor nodes with all the data they have to receive. Edges are labeled with the first time step when they are active.

The ‘−4’ stems from the fact that during a start-upphase where not all PUs have received data yet we cannevertheless perform some communication betweenthe other PUs. We omit the detailed derivation ofrelation (6) because it is somewhat tedious but notreally difficult.

These lower bounds hold in full generality. For alarge and natural class of algorithms, we can provea stronger bound though. Consider algorithms thatdivide the total data set of sizek in s packets of sizek/s each. All PUs operate synchronously, and in everystep they send or receive at most one packet. So, untilsteps − 1, there is still at least one packet known onlyto the source. Thus, for givens, at leasts − 1+ logP

steps are required in the duplex model. Each step takesk/s + t time. For givenk, t andP , the minimum isassumed fors = √

k · t/ logP :

T ∗lower = k

(1+ √

t log(P )/k)2

. (7)

2.2. Two simple pipelined algorithms

For k t , a central idea for fast broadcasting isto chop the message intos packets of sizek/s andto forward these packets in a pipelined fashion. Thesimplest pipelined algorithm arranges all PUs into achain of lengthP − 1. The head of the chain feedspackets downward. Interior PUs receive one packet inthe first step and then in each step receive the nextpacket and forward the previously received packet.Fig. 1(d) gives an example. It is easy to see that onegets an execution time ofT s∞ := (P −2+s) ·(t +k/s).The optimal choice fors is

√k(P − 2)/t. Substituting

this intoT s∞ yields

T ∗∞ := k(1+ √

t · (P − 2)/k)2

= k(1+ O

(√tP/k

)) + O(tP ).

For k tP the performance of this algorithm isquite close to the lower bound (5). However, sincet

is usually a large constant, on systems with largeP

we only get good performance for messages which areextremely large. We can reduce the dependence onP

by arranging the PUs into a binary tree. Now everyinterior node forwards every packet to both successors.This takes two steps per packet. The execution time isT s

1 := (d + 2s) · (t + k/s) whered is the time stepjust before the last leaf receives the first packet;d

is defined by the recurrencePi = 1 + Pi−1 + Pi−2,P0 = 1, P1 = 2. We haved = min{i: Pi � P } − 1 ≈log1.62P . For our purposes it is sufficient to note thatd = O(logP). Fig. 1(a) shows the tree withP5 = 18PUs. Choosings = √

2k · d/t , one gets

T ∗1 := k

(√2+ √

d · t/k)2

= k(2+ O

(√t log(P )/k

)) + O(t logP).

(For the half-duplex model replace the two by a three.)For small and mediumk this is much better than thechain algorithm, yet for largek it is almost two timesslower.

3. Fractional tree broadcasting

The starting point for this paper was the questionwhether there exists a communication pattern whichallows a more flexible tradeoff between the highbandwidth of a chain (i.e., a tree with degree one) andthe low latency of a binary tree. We present the familyof fractional trees communication patterns which havethis property.

The idea for fractional trees is to replace the nodeof a binary tree by a group ofr PUs forming a chain.The input is fed into the head of this chain. The data is

Page 4: A bandwidth latency tradeoff for broadcast and reduction

36 P. Sanders, J.F. Sibeyn / Information Processing Letters 86 (2003) 33–38

Procedure broadcastFT(r , s, 0� i < r :Integer;var D[0..s − 1]:Packet)recv(D[0]) — wait for first packetpipeDown(r , 0, D) — First phasefor k := r to s − r step r do — Remaining phases

sendRight|Recv(D[k − r + i], D[k])pipeDown(r , k, D)

sendRight(D[s − r + i])(* send packetsD[k..k + r − 1]; receive packetsD[k + 1..k + r − 1] *)Procedure pipeDown(r , k:Integer;var D[..]:Packet)

for j := k to k + r − 2 do sendDown|Recv(D[j ], D[j + 1])sendDown(D[k + r − 1])

Fig. 2. Pseudocode executed on each PU for fractional tree broadcasting, wherei is the index of the PU within its group,s is a multiple ofr , andthe arrayD is the input on the root and the output on the other PUs. For the root PU, receiving is a no-op. For the top PU of a group receivingmeans receiving from any PU in the predecessor group. For the other PUs it means receiving from the predecessor in the group. Sending downmeans sending to the next PU in the group respectively sending to the top PU of the successor group. Sending right means sending to the topPU of the right successor group. If the successor defined by this convention does not exist, sending is a no-op.

passed down the chain and on to a successor group asin the single chain algorithm. In addition, the PUs ofthe group cooperate in feeding the data to the head ofa second successor group. Fig. 1 shows the structureof a group and several examples.

All PUs execute the same code shown in Fig. 2.All timing considerations are naturally handled bythe synchronization implicit in synchronous point-to-point communication. The input is conceptuallysubdivided intos/r runs ofr packets each. The onlynontrivial point is that theith member of a group isresponsible for passing theith packet of every run ofrpackets to the right. The effect is that everyr +1 stepsthe head of the right successor gets a run ofr packetsin the correct order. The pause after this run is usedto pass the last packet downward. Packets are passedright while the next run arrives.

As in the special case of binary trees (r = 1), theright successors receive data one step later than thedownward successors. Therefore, optimal tree layoutsare somewhat skew. The number of nodes reachablewithin d + 1 steps is governed by the recurrencePi = r + Pi−r + Pi−r−1 (Pi = i + 1 for i � r)so thatd = min{i: Pi � P } − 1. This impliesd =O(r log(P/r)). Using this recurrence each processorcan find its place in the tree in time O(d) and withoutany communication.

Here we described the algorithm for the duplexmodel. As already outlined in the introduction, anyduplex algorithm can be turned into a half-duplex

algorithm running in double time. Optimizing forthis special case, replacing send|recv operations bysequences “send, recv” or “recv, send” such thatno delays or deadlocks occur, we obtain a fasterdirect implementation which is able to forward a runof r packets in 2r + 1 steps and hence is only afactor 2− 1/(r + 1) times slower than on the duplexmodel.

3.1. Performance analysis

Having established a smooth timing of the algo-rithm, the analysis can proceed analogously to thatof the simple algorithms from the introduction. Everycommunication step takes time(t + k/s) andd + s ·(1 + 1/r) steps are needed until alls/r runs havereached the last leaf group. We get a total time of

T sr :=

(d + s

(1+ 1

r

))(t + k

s

). (8)

Using calculus one getss = √kdr/(t (r + 1)) as an

optimal choice for the number of packets. Substitutingthis into Eq. (8) yields

T ∗r := k

(1+ 1

r

)(1+

√drt

k(r + 1)

)2

= k

(1+ 1

r+ O

(√rt logP

k

))+ O(rt logP). (9)

Page 5: A bandwidth latency tradeoff for broadcast and reduction

P. Sanders, J.F. Sibeyn / Information Processing Letters 86 (2003) 33–38 37

Since d depends onr in a complicated way, thereseems to be no closed form formula for an optimalr.But we get a close to optimal value forr by settingd = d ′ · (r + 1) and ignoring thatd ′ depends onr.We getr ≈ (k(r + 1)/(d · t))1/3. After rounding, thesevalues make sense fork(r +1) � d · t . In a program theequations can be solved numerically. For smallerk oneshould user = 1 or even a non-pipelined algorithm.Substitutingr and s into T s

r we get a broadcastingalgorithm with execution time

T ∗∗ � k

(1+

(d · t

k(r + 1)

)1/3)3

= k

(1+ O

((t logP

k

)1/3))+ O(t logP).

For k t logP the algorithm performs quite close tothe lower bound (5).

3.2. Performance examples

How does the algorithm compare to the two simplealgorithms? For example, consider the caseP =1024 andk/t = 4096. (On a machine with 10−5 sstart-up overhead and 108 byte/s peak bandwidththis choice corresponds to rather realistic messagesof length about 4096· 10−5 · 108 � 4 MB.) Wechoosed ′ = d/(r + 1) ≈ logP − 1 = 9 and getr ≈(k/(d ′ · t))1/3 ≈ 8. This yieldsd = 57 and we gets = √

4096· 57· 8/9 ≈ 456. With these valuesT sr ≈

1.389k. These choices are quite robust. For example,a better approximation of the optimalr yields r = 10ands = 503 but the resultingT s

r ≈ 1.387k is less than0.2% better. With our new approach the broadcastingtime is only about 2/3 of what would be needed whenapplying the linear pipeline or pipelined binary treebroadcasting. We would be a factor seven faster thanbinomial tree broadcasting.

Fig. 3 plots the achievable speedup for three dif-ferent machine sizes. Even for a medium size parallelcomputer (P = 64) an improvement of up to a factor1.29 can occur. For very large machines (P = 16384)the improvements reach up to factor of 1.8 and a sig-nificant improvement is observed over a large range ofmessage lengths.

Our conclusion is that fractional tree broadcastingyields a small improvement for “everyday parallelcomputing” yet constitutes a significant contribution

Fig. 3. Improvement of fractional tree broadcasting over the best ofpipelined binary tree and sequential pipeline algorithm as a functionof k/t .

to the difficult task of exploiting high end machinessuch as the ones currently build in the ASCI program.For example, Compaq plans to achieve 100 TFlopswith 16384 Alpha processors by the year 2004 [3].

4. Sparse interconnection networks

Since fractional trees are almost trees, it is notastonishing that they are relatively easy to embedinto many kinds of interconnection networks. Here weconcentrate on a simple yet important case and onlymention that similar results can be obtained for otherimportant networks such as two-dimensional meshes.

Compaq’s above mentioned 16384 PU system isexpected to consist of 256 SMP modules with 64PUs each. We view it as unlikely that it will get aninterconnection network with enough bisection widthto efficiently implement the hypercube algorithm.Rather, each module will only have a limited numberof channels to other modules. We call such a system a256× 64 hierarchical crossbar. Systems with similarproperties are currently build by several companies.

We now explain how a fractional tree with groupsize r can be embedded into ana × b hierarchicalcrossbar ifb � r and if each module supports at leasttwo incoming and outcoming channels to arbitraryother modules.4

4 A generalization to more than two levels of hierarchy is alsopossible. If less connections per module are available the method

Page 6: A bandwidth latency tradeoff for broadcast and reduction

38 P. Sanders, J.F. Sibeyn / Information Processing Letters 86 (2003) 33–38

Fig. 4. Embedding of a fractional tree withr = 3 into an 8× 16hierarchical crossbar. After 18 steps all PUs receive the first data.

First, one group in each module is connected toform aglobal binary tree witha nodes. Next, theb− r

remaining PUs in each module are connected to a formlocal fractional tree. What remains to be done is toconnect the local trees by the global tree. Groups inthe global tree with degree one can directly link withtheir local tree. Leaf groups in the global tree use oneof their free links to connect to their local tree. Theremaining free links are used to connect to the localtrees of modules with a group in the global tree of

can be adapted at the price of increasingd . For three connectionsper node,d = O(r log(P/r)) is still possible. If the modules arethemselves single-ported,d increases to O(r2 log(P/r)).

degree two. There will be one remaining unused linkwhich can be used to further optimize the structure.Fig. 4 gives an example where this link is used to builda tree where all PUs receive the first data at about thesame time.

References

[1] V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C. Ho,S. Kipnis, M. Snir, CCL: A portable and tunable collectivecommunication library for scalable parallel computers, IEEETrans. Parallel Distrib. Systems 6 (2) (1995) 154–164.

[2] A. Bar-Noy, S. Kipnis, Broadcasting multiple messages insimultaneous send/receive systems, in: 5th IEEE Symp. ParallelDistributed Processing, 1993, pp. 344–347.

[3] Compaq, AlphaServer SC series product brochure, 1999,http://www.digital.com/hpc/news/news_sc_launch.html.

[4] S.L. Johnsson, C.T. Ho, Optimum broadcasting and personal-ized communication in hypercubes, IEEE Trans. Comput. 38 (9)(1989) 1249–1268.

[5] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introductionto Parallel Computing. Design and Analysis of Algorithms,Benjamin/Cummings, New York, 1994.

[6] E.E. Santos, Optimal and near-optimal algorithms fork-itembroadcast, J. Parallel Distrib. Comput. 57 (1999) 121–139.

[7] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Don-garra, MPI—The Complete Reference, MIT Press, Cambridge,MA, 1996.