(How to Implement) Basic Communication...

75
(How to Implement) Basic Communication Operations Alexandre David 1.2.05 [email protected]

Transcript of (How to Implement) Basic Communication...

Page 1: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

(How to Implement)Basic Communication Operations

Alexandre David1.2.05

[email protected]

Page 2: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 2

Overview

One-to-all broadcast & all-to-one reductionAll-to-all broadcast and reductionAll-reduce and prefix-sum operationsScatter and GatherAll-to-All Personalized CommunicationCircular ShiftImproving the Speed of Some Communication Operations

Page 3: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 3

Collective Communication Operations

Represent regular communication patterns.Used extensively in most data-parallel algorithms.Critical for efficiency.Available in most parallel libraries.Very useful to “get started” in parallel processing.Basic model: ts+mtw time for exchanging a m-word message with cut-through routing.

Page 4: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 4

Interesting:

To know:Data transfer time is roughly the same between all pairs of nodes.Homogeneity true on modern hardware (randomized routing, cut-through routing…).

ts+mtwAdjust tw for congestion: effective tw.

Model: bidirectional links, single port.Communication with point-to-point primitives.

Page 5: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 5

Broadcast/ReductionOne-to-all broadcast:

Single process sends identical data to all (or subset of) processes.

All-to-one reduction:Dual operation.P processes have m words to send to one destination.Parts of the message need to be combined.

Page 6: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 6

Broadcast/Reduction

Broadcast Reduce

Page 7: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 7

One-to-All Broadcast – Ring/Linear Array

Naïve approach: send sequentially.Bottleneck.Poor utilization of the network.

Recursive doubling:Broadcast in logp steps (instead of p).Divide-and-conquer type of algorithm.Reduction is similar.

Page 8: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 8

Recursive Doubling

0 1 2 3

7 6 5 4

1

4

2

6

2

2

3 3

33

1 3

7 5

Page 9: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 9

Example: Matrix*Vector

1) 1->all

2) Compute

3) All->1

Page 10: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 10

One-to-All Broadcast – MeshExtensions of the linear array algorithm.

Rows & columns = arrays.Broadcast on a row, broadcast on columns.Similar for reductions.Generalize for higher dimensions (cubes…).

Page 11: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 11

Broadcast on a Mesh

Page 12: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 12

One-to-All Broadcast – Hypercube

Hypercube with 2d nodes = d-dimensional mesh with 2 nodes in each direction.Similar algorithm in d steps.Also in logp steps.Reduction follows the same pattern.

Page 13: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 13

Broadcast on a Hypercube

Page 14: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 14

All-to-One BroadcastBalanced Binary Tree

Processing nodes = leaves.Hypercube algorithm maps well.Similarly good w.r.t. congestion.

Page 15: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 15

Broadcast on a Balanced Binary Tree

Page 16: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 16

AlgorithmsSo far we saw pictures.Not enough to implement.Precise description

to implement.to analyze.

Description for hypercube.Execute the following procedure on all the nodes.

Page 17: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 17

Broadcast Algorithm

000 001101

100

010

110 111

011

111

011 001 000

011

011

001

001 000

000

000000

Current dimension

Page 18: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 18

Broadcast Algorithm

000 001101

100

010

110 111

011

001 000

Page 19: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 19

Broadcast Algorithm

000 001101

100

010

110 111

011

000

Page 20: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 20

Algorithm For Any Source

Page 21: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 21

Reduce Algorithm

In a nutshell:reverse the previous one.

Page 22: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 22

Cost Analysis

p processes → logp steps (point-to-pointtransfers in parallel).Each transfer has a time cost ofts+twm.Total time: T=(ts+twm) logp.

Page 23: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 23

All-to-All Broadcast and Reduction

Generalization of broadcast:Each processor is a source and destination.Several processes broadcast different messages.

Used in matrix multiplication (and matrix-vector multiplication).Dual: all-to-all reduction.

Page 24: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 24

All-to-All Broadcast and Reduction

Page 25: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 25

All-to-All Broadcast – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3

4567

0 1 2

3456

7 7 0 1

2345

6

etc…

Page 26: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 26

All-to-All Broadcast Algorithm

Ring: mod p.Receive & send – point-to-point.Initialize the loop.

Forward msg.Accumulate result.

Page 27: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 27

All-to-All Reduce Algorithm

Accumulate and forward.

Last message for my_id.

Page 28: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 28

1 2 3 4

5670

All-to-All Reduce – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1

2 3 4 5

670

2 3 4 5

6701

Page 29: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 29

All-to-All Reduce – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

12

3 4 5 6

702

3 4 5 6

7012 1 0 7

6543

Page 30: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 30

All-to-All Broadcast – MeshesTwo phases:

All-to-all on rows – messages size m.Collect sqrt(p) messages.

All-to-all on columns – messages size sqrt(p)*m.

Page 31: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 31

All-to-All Broadcast – Meshes

Page 32: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 32

Algorithm

Page 33: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 33

All-to-All Broadcast - Hypercubes

Generalization of the mesh algorithm to logp dimensions.Message size doubles at every step.Number of steps: logp.

Page 34: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 34

All-to-All Broadcast – Hypercubes

Page 35: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 35

Algorithm

Loop on the dimensions

Exchange messages

Forward (double size)

Page 36: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 36

All-to-All Reduction – Hypercubes

Similar patternin reverse order.

Combine results

Page 37: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 37

Cost Analysis (Time)Ring:

T=(ts + twm)(p-1).Mesh:

T=(ts + twm)(√p-1)+(ts + twm√p) (√p-1)= 2ts(√p – 1) + twm(p-1).

Hypercube:

logp stepsmessage of size 2i-1m.

Page 38: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 38

Dense to Sparser: Congestion

Page 39: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 39

All-ReduceEach node starts with a buffer of size m.The final result is the same combination of all buffers on every node.Same as all-to-one reduce + one-to-all broadcast.Different from all-to-all reduce.

1 2 3 4 1234 1234 1234 1234

Page 40: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 40

All-Reduce AlgorithmUse all-to-all broadcast but

Combine messages instead of concatenating them.The size of the messages does not grow.Cost (in logp steps): T=(ts+twm) logp.

Page 41: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 41

Prefix-SumGiven p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i

k= 0 ni for all k between 0 and p-1.

Initially, nk is on the node labeled k, and at the end, the same node holds Sk.

Page 42: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 42

Prefix-Sum Algorithm

All-reduce

Prefix-sum

Page 43: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 43

Prefix-Sum

0 1

2 3

4

6 7

5

0 1

54

2

6 7

3

0 1

5

2

7

3

4

6 67

45

01

6

4

0

23

2

1 0

3 2

5 4

7 6

0 1

2 3

4 5

6 7

1 00 1

4 5 5 4

Buffer = all-reduce sum

Page 44: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 44

Scatter and GatherScatter: A node sends a unique message to every other node – unique per node.Gather: Dual operation but the target node does not combine the messages into one.

0 1 2 … 0 1 2 …

M0 M0

M1

M1

M2

M2

Scatter

Gather

Page 45: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 45

Page 46: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 46

Cost AnalysisNumber of steps: logp.Size transferred: pm/2, pm/4,…,m.

Geometric sum

Cost T=tslogp+twm(p-1).)222(

1)211(2)

211(2

2...

42

211

211

2...

42

log11

1

1

p

ppp

pppppp

ppppp

pn

nn

n

n

==

−=−−=−−=+++

−=++++

++

+

+

Page 47: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 47

All-to-All Personalized Communication

Each node sends a distinct message to every other node.

0 1 2 … 0 1 2 …

M0,0

M0,1

M0,2

M0,0 M0,1 M0,2M1,0

M1,1

M1,2

M1,0 M1,1 M1,2

M2,0

M2,1

M2,2 M2,0 M2,1 M2,2

Page 48: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 48

Example: Transpose

Page 49: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 49

Total Exchange on a Ring

0

5

21

34

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

1 2 3 4 5 0 2 3 4 5

0 1 3 4 50 1 2 4 50 1 2 3 5

0 1 2 3 4

0

1

2

3

4

5

Page 50: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 50

Total Exchange on a Ring

0

5

21

4 3

00

11

22

33

44

55 0 1 4 5

0 1 2 5

0 1 2 3 1 2 3 4 2 3 4 5

0 3 4 5

0

1

2

3

4

5

Page 51: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 51

Cost AnalysisNumber of steps: p-1.Size transmitted: m(p-1),m(p-2)…,m.

)1)(2/()1(1

1−+=+−= ∑

=

pmpttmitptT ws

p

iws

Optimal

Page 52: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 52

Optimal?Check the lowest bound for communication and compare to the one we have.

Average distance a packet travels = p/2.There are p nodes that need to transmit m(p-1) words.Total traffic = m(p-1)*p/2*p.Number of link that support the load = p, so communication time ≥ twm(p-1)p/2.

Page 53: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 53

Total Exchange on a Mesh

0 1 2

3 4 5

6 7 8

Page 54: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 54

Total Exchange on a Mesh

0 1 2

3 4 5

6 7 8

Page 55: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 55

Total Exchange on a Mesh

0 1 2

3 4 5

6 7 8

Page 56: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 56

Cost AnalysisSubstitute p by √p (number of nodes per dimension).Substitute message size m by m√p.Cost is the same for each dimension.T=(2ts+twmp)(√p-1)

Page 57: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 57

Total Exchange on a HypercubeGeneralize the mesh algorithm to logp steps = number of dimensions, with 2 nodes per dimension.Same procedure as all-to-all broadcast.

Page 58: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 58

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 59: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 59

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 60: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 60

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 61: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 61

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 62: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 62

Cost Analysis

Number of steps: logp.Size transmitted per step: pm/2.Cost: T=(ts+twmp/2) logp.Optimal?Each node sends and receives m(p-1) words. Average distance = ( logp)/2. Total traffic = p*m(p-1)* logp/2.Number of links = p logp/2.Time lower bound = twm(p-1).

NO

Page 63: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 63

An Optimal AlgorithmHave every pair of nodes communicate directly with each other – p-1 communication steps – but without congestion.At jth step node i communicates with node (i xor j) with E-cube routing.

Page 64: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 64

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 65: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 65

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 66: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 66

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 67: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 67

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 68: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 68

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Page 69: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 69

Total Exchange on a Hypercube

0 1

2 3

4 5

6 7

Etc…

Page 70: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 70

Cost AnalysisRemark: Transmit less, only what is needed, but more steps.Number of steps: p-1.Transmission: size m per step.Cost: T=(ts+twm)(p-1).Compared withT=(ts+twmp/2) logp.Previous algorithm better for small messages.

Page 71: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 71

Circular ShiftIt’s a particular permutation.Circular q-shift: Node i sends data to node (i+q) mod p (in a set of p nodes).Useful in some matrix operations and pattern matching.Ring: intuitive algorithm in min{q,p-q}neighbor to neighbor communication steps. Why?

Page 72: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 72

q mod √p on rowscompensate⎣q / √p⎦ on colums

Circular 5-shifton a mesh.

Page 73: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 73

Circular Shift on a HypercubeMap a linear array with 2d nodes onto a hypercube of dimension d.Expand q shift as a sum of powers of 2 (e.g. 5-shift = 20+22).Perform the decomposed shifts.Use bi-directional links for “forward” (shift itself) and “backward” (rotation part)… logpsteps.

Page 74: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 74

Or better:DirectE-cube routing.q-shifts on a8-nodehypercube.

Page 75: (How to Implement) Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MVP-10/12-MPI... · 26+29-03-2010 MVP'10 - Aalborg University 3 Collective Communication Operations

26+29-03-2010 MVP'10 - Aalborg University 75

Improving PerformanceSo far messages of size m were not split.If we split them into p parts:

One-to-all broadcast = scatter + all-to-all broadcast of messages of size m/p.All-to-one reduction = all-to-all reduce + gather of messages of size m/p.All-reduce = all-to-all reduction + all-to-all broadcast of messages of size m/p.