(How to Implement) Basic Communication...
Transcript of (How to Implement) Basic Communication...
26+29-03-2010 MVP'10 - Aalborg University 2
Overview
One-to-all broadcast & all-to-one reductionAll-to-all broadcast and reductionAll-reduce and prefix-sum operationsScatter and GatherAll-to-All Personalized CommunicationCircular ShiftImproving the Speed of Some Communication Operations
26+29-03-2010 MVP'10 - Aalborg University 3
Collective Communication Operations
Represent regular communication patterns.Used extensively in most data-parallel algorithms.Critical for efficiency.Available in most parallel libraries.Very useful to “get started” in parallel processing.Basic model: ts+mtw time for exchanging a m-word message with cut-through routing.
26+29-03-2010 MVP'10 - Aalborg University 4
Interesting:
To know:Data transfer time is roughly the same between all pairs of nodes.Homogeneity true on modern hardware (randomized routing, cut-through routing…).
ts+mtwAdjust tw for congestion: effective tw.
Model: bidirectional links, single port.Communication with point-to-point primitives.
26+29-03-2010 MVP'10 - Aalborg University 5
Broadcast/ReductionOne-to-all broadcast:
Single process sends identical data to all (or subset of) processes.
All-to-one reduction:Dual operation.P processes have m words to send to one destination.Parts of the message need to be combined.
26+29-03-2010 MVP'10 - Aalborg University 6
Broadcast/Reduction
Broadcast Reduce
26+29-03-2010 MVP'10 - Aalborg University 7
One-to-All Broadcast – Ring/Linear Array
Naïve approach: send sequentially.Bottleneck.Poor utilization of the network.
Recursive doubling:Broadcast in logp steps (instead of p).Divide-and-conquer type of algorithm.Reduction is similar.
26+29-03-2010 MVP'10 - Aalborg University 8
Recursive Doubling
0 1 2 3
7 6 5 4
1
4
2
6
2
2
3 3
33
1 3
7 5
26+29-03-2010 MVP'10 - Aalborg University 9
Example: Matrix*Vector
1) 1->all
2) Compute
3) All->1
26+29-03-2010 MVP'10 - Aalborg University 10
One-to-All Broadcast – MeshExtensions of the linear array algorithm.
Rows & columns = arrays.Broadcast on a row, broadcast on columns.Similar for reductions.Generalize for higher dimensions (cubes…).
26+29-03-2010 MVP'10 - Aalborg University 11
Broadcast on a Mesh
26+29-03-2010 MVP'10 - Aalborg University 12
One-to-All Broadcast – Hypercube
Hypercube with 2d nodes = d-dimensional mesh with 2 nodes in each direction.Similar algorithm in d steps.Also in logp steps.Reduction follows the same pattern.
26+29-03-2010 MVP'10 - Aalborg University 13
Broadcast on a Hypercube
26+29-03-2010 MVP'10 - Aalborg University 14
All-to-One BroadcastBalanced Binary Tree
Processing nodes = leaves.Hypercube algorithm maps well.Similarly good w.r.t. congestion.
26+29-03-2010 MVP'10 - Aalborg University 15
Broadcast on a Balanced Binary Tree
26+29-03-2010 MVP'10 - Aalborg University 16
AlgorithmsSo far we saw pictures.Not enough to implement.Precise description
to implement.to analyze.
Description for hypercube.Execute the following procedure on all the nodes.
26+29-03-2010 MVP'10 - Aalborg University 17
Broadcast Algorithm
000 001101
100
010
110 111
011
111
011 001 000
011
011
001
001 000
000
000000
Current dimension
26+29-03-2010 MVP'10 - Aalborg University 18
Broadcast Algorithm
000 001101
100
010
110 111
011
001 000
26+29-03-2010 MVP'10 - Aalborg University 19
Broadcast Algorithm
000 001101
100
010
110 111
011
000
26+29-03-2010 MVP'10 - Aalborg University 20
Algorithm For Any Source
26+29-03-2010 MVP'10 - Aalborg University 21
Reduce Algorithm
In a nutshell:reverse the previous one.
26+29-03-2010 MVP'10 - Aalborg University 22
Cost Analysis
p processes → logp steps (point-to-pointtransfers in parallel).Each transfer has a time cost ofts+twm.Total time: T=(ts+twm) logp.
26+29-03-2010 MVP'10 - Aalborg University 23
All-to-All Broadcast and Reduction
Generalization of broadcast:Each processor is a source and destination.Several processes broadcast different messages.
Used in matrix multiplication (and matrix-vector multiplication).Dual: all-to-all reduction.
26+29-03-2010 MVP'10 - Aalborg University 24
All-to-All Broadcast and Reduction
26+29-03-2010 MVP'10 - Aalborg University 25
All-to-All Broadcast – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3
4567
0 1 2
3456
7 7 0 1
2345
6
etc…
26+29-03-2010 MVP'10 - Aalborg University 26
All-to-All Broadcast Algorithm
Ring: mod p.Receive & send – point-to-point.Initialize the loop.
Forward msg.Accumulate result.
26+29-03-2010 MVP'10 - Aalborg University 27
All-to-All Reduce Algorithm
Accumulate and forward.
Last message for my_id.
26+29-03-2010 MVP'10 - Aalborg University 28
1 2 3 4
5670
All-to-All Reduce – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1
2 3 4 5
670
2 3 4 5
6701
26+29-03-2010 MVP'10 - Aalborg University 29
All-to-All Reduce – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
12
3 4 5 6
702
3 4 5 6
7012 1 0 7
6543
26+29-03-2010 MVP'10 - Aalborg University 30
All-to-All Broadcast – MeshesTwo phases:
All-to-all on rows – messages size m.Collect sqrt(p) messages.
All-to-all on columns – messages size sqrt(p)*m.
26+29-03-2010 MVP'10 - Aalborg University 31
All-to-All Broadcast – Meshes
26+29-03-2010 MVP'10 - Aalborg University 32
Algorithm
26+29-03-2010 MVP'10 - Aalborg University 33
All-to-All Broadcast - Hypercubes
Generalization of the mesh algorithm to logp dimensions.Message size doubles at every step.Number of steps: logp.
26+29-03-2010 MVP'10 - Aalborg University 34
All-to-All Broadcast – Hypercubes
26+29-03-2010 MVP'10 - Aalborg University 35
Algorithm
Loop on the dimensions
Exchange messages
Forward (double size)
26+29-03-2010 MVP'10 - Aalborg University 36
All-to-All Reduction – Hypercubes
Similar patternin reverse order.
Combine results
26+29-03-2010 MVP'10 - Aalborg University 37
Cost Analysis (Time)Ring:
T=(ts + twm)(p-1).Mesh:
T=(ts + twm)(√p-1)+(ts + twm√p) (√p-1)= 2ts(√p – 1) + twm(p-1).
Hypercube:
logp stepsmessage of size 2i-1m.
26+29-03-2010 MVP'10 - Aalborg University 38
Dense to Sparser: Congestion
26+29-03-2010 MVP'10 - Aalborg University 39
All-ReduceEach node starts with a buffer of size m.The final result is the same combination of all buffers on every node.Same as all-to-one reduce + one-to-all broadcast.Different from all-to-all reduce.
1 2 3 4 1234 1234 1234 1234
26+29-03-2010 MVP'10 - Aalborg University 40
All-Reduce AlgorithmUse all-to-all broadcast but
Combine messages instead of concatenating them.The size of the messages does not grow.Cost (in logp steps): T=(ts+twm) logp.
26+29-03-2010 MVP'10 - Aalborg University 41
Prefix-SumGiven p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i
k= 0 ni for all k between 0 and p-1.
Initially, nk is on the node labeled k, and at the end, the same node holds Sk.
26+29-03-2010 MVP'10 - Aalborg University 42
Prefix-Sum Algorithm
All-reduce
Prefix-sum
26+29-03-2010 MVP'10 - Aalborg University 43
Prefix-Sum
0 1
2 3
4
6 7
5
0 1
54
2
6 7
3
0 1
5
2
7
3
4
6 67
45
01
6
4
0
23
2
1 0
3 2
5 4
7 6
0 1
2 3
4 5
6 7
1 00 1
4 5 5 4
Buffer = all-reduce sum
26+29-03-2010 MVP'10 - Aalborg University 44
Scatter and GatherScatter: A node sends a unique message to every other node – unique per node.Gather: Dual operation but the target node does not combine the messages into one.
0 1 2 … 0 1 2 …
M0 M0
M1
M1
M2
M2
Scatter
Gather
26+29-03-2010 MVP'10 - Aalborg University 45
26+29-03-2010 MVP'10 - Aalborg University 46
Cost AnalysisNumber of steps: logp.Size transferred: pm/2, pm/4,…,m.
Geometric sum
Cost T=tslogp+twm(p-1).)222(
1)211(2)
211(2
2...
42
211
211
2...
42
log11
1
1
p
ppp
pppppp
ppppp
pn
nn
n
n
==
−=−−=−−=+++
−
−=++++
++
+
+
26+29-03-2010 MVP'10 - Aalborg University 47
All-to-All Personalized Communication
Each node sends a distinct message to every other node.
0 1 2 … 0 1 2 …
M0,0
M0,1
M0,2
M0,0 M0,1 M0,2M1,0
M1,1
M1,2
M1,0 M1,1 M1,2
M2,0
M2,1
M2,2 M2,0 M2,1 M2,2
26+29-03-2010 MVP'10 - Aalborg University 48
Example: Transpose
26+29-03-2010 MVP'10 - Aalborg University 49
Total Exchange on a Ring
0
5
21
34
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
1 2 3 4 5 0 2 3 4 5
0 1 3 4 50 1 2 4 50 1 2 3 5
0 1 2 3 4
0
1
2
3
4
5
26+29-03-2010 MVP'10 - Aalborg University 50
Total Exchange on a Ring
0
5
21
4 3
00
11
22
33
44
55 0 1 4 5
0 1 2 5
0 1 2 3 1 2 3 4 2 3 4 5
0 3 4 5
0
1
2
3
4
5
26+29-03-2010 MVP'10 - Aalborg University 51
Cost AnalysisNumber of steps: p-1.Size transmitted: m(p-1),m(p-2)…,m.
)1)(2/()1(1
1−+=+−= ∑
−
=
pmpttmitptT ws
p
iws
Optimal
26+29-03-2010 MVP'10 - Aalborg University 52
Optimal?Check the lowest bound for communication and compare to the one we have.
Average distance a packet travels = p/2.There are p nodes that need to transmit m(p-1) words.Total traffic = m(p-1)*p/2*p.Number of link that support the load = p, so communication time ≥ twm(p-1)p/2.
26+29-03-2010 MVP'10 - Aalborg University 53
Total Exchange on a Mesh
0 1 2
3 4 5
6 7 8
26+29-03-2010 MVP'10 - Aalborg University 54
Total Exchange on a Mesh
0 1 2
3 4 5
6 7 8
26+29-03-2010 MVP'10 - Aalborg University 55
Total Exchange on a Mesh
0 1 2
3 4 5
6 7 8
26+29-03-2010 MVP'10 - Aalborg University 56
Cost AnalysisSubstitute p by √p (number of nodes per dimension).Substitute message size m by m√p.Cost is the same for each dimension.T=(2ts+twmp)(√p-1)
26+29-03-2010 MVP'10 - Aalborg University 57
Total Exchange on a HypercubeGeneralize the mesh algorithm to logp steps = number of dimensions, with 2 nodes per dimension.Same procedure as all-to-all broadcast.
26+29-03-2010 MVP'10 - Aalborg University 58
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 59
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 60
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 61
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 62
Cost Analysis
Number of steps: logp.Size transmitted per step: pm/2.Cost: T=(ts+twmp/2) logp.Optimal?Each node sends and receives m(p-1) words. Average distance = ( logp)/2. Total traffic = p*m(p-1)* logp/2.Number of links = p logp/2.Time lower bound = twm(p-1).
NO
26+29-03-2010 MVP'10 - Aalborg University 63
An Optimal AlgorithmHave every pair of nodes communicate directly with each other – p-1 communication steps – but without congestion.At jth step node i communicates with node (i xor j) with E-cube routing.
26+29-03-2010 MVP'10 - Aalborg University 64
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 65
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 66
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 67
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 68
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
26+29-03-2010 MVP'10 - Aalborg University 69
Total Exchange on a Hypercube
0 1
2 3
4 5
6 7
Etc…
26+29-03-2010 MVP'10 - Aalborg University 70
Cost AnalysisRemark: Transmit less, only what is needed, but more steps.Number of steps: p-1.Transmission: size m per step.Cost: T=(ts+twm)(p-1).Compared withT=(ts+twmp/2) logp.Previous algorithm better for small messages.
26+29-03-2010 MVP'10 - Aalborg University 71
Circular ShiftIt’s a particular permutation.Circular q-shift: Node i sends data to node (i+q) mod p (in a set of p nodes).Useful in some matrix operations and pattern matching.Ring: intuitive algorithm in min{q,p-q}neighbor to neighbor communication steps. Why?
26+29-03-2010 MVP'10 - Aalborg University 72
q mod √p on rowscompensate⎣q / √p⎦ on colums
Circular 5-shifton a mesh.
26+29-03-2010 MVP'10 - Aalborg University 73
Circular Shift on a HypercubeMap a linear array with 2d nodes onto a hypercube of dimension d.Expand q shift as a sum of powers of 2 (e.g. 5-shift = 20+22).Perform the decomposed shifts.Use bi-directional links for “forward” (shift itself) and “backward” (rotation part)… logpsteps.
26+29-03-2010 MVP'10 - Aalborg University 74
Or better:DirectE-cube routing.q-shifts on a8-nodehypercube.
26+29-03-2010 MVP'10 - Aalborg University 75
Improving PerformanceSo far messages of size m were not split.If we split them into p parts:
One-to-all broadcast = scatter + all-to-all broadcast of messages of size m/p.All-to-one reduction = all-to-all reduce + gather of messages of size m/p.All-reduce = all-to-all reduction + all-to-all broadcast of messages of size m/p.