1 Collective Operations Dr. Stephen Tse [email protected] 908-872-2108 Lesson 12.

26
1 Collective Operations Collective Operations Dr. Stephen Tse Dr. Stephen Tse [email protected] 908-872-2108 Lesson 12 Lesson 12

Transcript of 1 Collective Operations Dr. Stephen Tse [email protected] 908-872-2108 Lesson 12.

Page 1: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

1

Collective OperationsCollective Operations

Dr. Stephen TseDr. Stephen Tse

[email protected]

Lesson 12Lesson 12

Page 2: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

2

Collective Communication

• A collective communication is– A communication pattern that involves all the

processes in a communicator– It involves more than two processes.

• Different Collective Communication Operations:– Broadcast– Gather and Scatter– Allgather– Alltoall

Page 3: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

3

Consider the following Arrangement

----> Data 0 | A0 A1 A2 A3 A4 . . . 1 | 2 | : | n |

VProcesses

Page 4: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

4

Broadcast

• A broadcast is a collective communication that a single process sends the same data to every process in the communicator.

----> Data | A0 =======> A0 | bcast A0 | A0 | A0VProcesses

Page 5: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

5

Matrix-Vector Product

• If A=(aij) is an mxn matrix and x=(x0, x1, …, xn-1 )T is an n-dimensional vector then the matrix-vector product is y=Ax

=

A x y

Process 0

Process 1

Process 2

Process 3

Page 6: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

6

A Gather• A collective communication in which a root

process receives data from every other process.• In order to form the dot product of each row of A

with x:• We need to gather all of x onto each process

Process 0 x0

Process 1 x1

Process 2 x2

Process 3 x3

A

Page 7: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

7

A Scatter• A collective communication in which a fixed root

process sends a distinct collection of data to every other process.

• Scatter each row of A across the process

Process 0 a00 a01 a02 a03

Process 1

Process 2

Process 3

AAA

Page 8: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

8

Gather and Scatter

----> Data | A0 A1 A2 A3 A4 A0 | ====> A1 | Scatter A2 | <==== A3 | Gather A4 VProcesses

Page 9: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

9

Allgather ----> Data

|A0 A0 B0 C0 D0 E0 |B0 A0 B0 C0 D0 E0 |C0 A0 B0 C0 D0 E0 |D0 A0 B0 C0 D0 E0 |E0 A0 B0 C0 D0 E0 VProcesses

• Simultaneously gather all of x onto each processes. • Gathering a distributed array to every process.• It gathers the contents of each process’s send_data into

each process’s recv_data.• After the function returns, all the processes in the

communicator will have the result stored in the memory referenced by result.

Page 10: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

10

Alltoall (transpose) ----> Data

|A0 A1 A2 A3 A4 A0 B0 C0 D0 E0 |B0 B1 B2 B3 B4 A1 B1 C1 D1 E1 |C0 C1 C2 C3 C4 A2 B2 C2 D2 E2 |D0 D1 D2 D3 D4 A3 B3 C3 D3 E3 |E0 E1 E2 E3 E4 A4 B4 C4 D4 E4 VProcesses

• The heart of the redistribution of the keys is each process’s sending of its original local keys to the appropriate process

• This is a collective communication operation in which each process sends a distinct collection of data to every other process.

Page 11: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

11

Tree-Structure Communication• To improve the coding, we should focus on the distribution

of the input data.• How can we divide the work more evenly among

processes?• We think of that we have a tree of processes, with process

0 as the root;• During the 1st stage of the data distribution: 0 sends data to

1.• During 2nd stage: 0 sends the data to 2 while 1 sends data

to 3.• During 3rd stage: 0 sends to 4, while 1 sends to 5, 2 sends

to 6, and 3 sends to 7. • So we reduce the input distribution loop from 7 stages to 3

stages.• In general, if we have p processes, this procedure allows

us to distribute the input data in | log2(p) | stages; which is the smallest whole number greater than of equal to log2(p) , which is all called the ceiling of that number.

(See the processes configuration tree )

Page 12: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

12

Data Distribution Stages

0

0

0

0 351624

31

1

2

7

1. This distribution reduced the original p-1 stages.2. If p=7, it reduced the time required for the program to complete the data

distribution from 6 to 3 and reduced by a factor of 50 times.3. There is no canonical choice of ordering.4. We have to know the topology of the system in order to have better

choice of scheme.

Page 13: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

13

Reduce the burden of final Sum• In the final summation phase, process 0 always gets a

disproportionate amount of work; i.e. the global sum of results from all other processes.

• To accelerate the final phase, we can use the tree concept in reverse to reduce the load of process 0.

• Distribute the work as:– Stage 1:

1. 4 sends to 0; 5 sends to 1; 6 sends to 2; 7 sends to 3.2. 0 adds its integral from 4; 1 adds its integral from 5; 2 adds its

integral from 6; 3 adds its integral from 7.

– Stage 2:1. 2 sends to 0; 3 sends to 1.2. 0 adds its integral from 2; 1 adds its integral from 3.

– Stage 3:1. 1 sends to 0.2. 0 adds its integral from 1.

(See the reverse tree processes configuration)

Page 14: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

14

Reverse-tree Processes Configuration

0

0

0

0

351624

31

1

2

7

Page 15: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

15

Reduction Operations

• The “global sum” calculation , is a general class of collective communication operations called reduction operations.

• In a global reduction operation, all the processes in a communicator are contributing data. All those data will be combined by using a binary operation.

• Typical operations are addition, max, min, logical and , etc.

Page 16: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

16

Simple Reduce

----> Data

| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| B0 B1 B2

| C0 C1 C2

V

Process

 

Page 17: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

17

Allreduce

• In the simple reduce function only process 0 will return the global sum result. All the other processes will return 0.

• If we want to use the result for subsequent calculations, we would like each process to return the same correct result.

• The obvious approach is to call MPI_Reduce with a call to MPI_Bcast.

Page 18: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

18

Every Processes have same Results

----> Data Results

| A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| B0 B1 B2 A0+B0+C0 A1+B1+C1 A2+B2+C2

| C0 C1 C2 A0+B0+C0 A1+B1+C1 A2+B2+C2

V

Process

Page 19: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

19

Implementation in MPI- MPI_Gather

1. MPI_Gather(sendbuffersendcountsendtyperecvbufferrecvcountrecvtyperoot rankcomm)

Remarks:1.   All processes in “comm..” Including root send “sendbuffer” to root’s

recvbuffer2.   Root collects these “sendbuffer” contents and put them in rank order in

“recvbuffer”3.   “recvbuffer” is ignored in all processes except the “root”.4.   Its inverse operation is MPI_Scatter()

Page 20: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

20

Implementation in MPI- MPI_Scatter

2. MPI_Scatter(sendbuffersendcountsendtype

recvbufferrecvcountrecvtype

root rankcomm.)

Remarks:1. Root sends “sendbuffer” to all processes including “root”2. Root puts them in rank order in “recvbuffer”3.   Root cuts its msg into “n” equal parts and then sends them to “n” processes

Page 21: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

21

Implementation in MPI- MPI_GatherV

3. MPI_GatherV(sendbuffersendcountsendtype

recvbufferrecvcount

displacement /* integer array for displacement */

recvtype

root rankcomm.)

Remarks:1.     This is a more general and more flexible function2.     Allowing varying count of data from each process3.     The variation is marked in "displacement" which is an "n-" dimensional

array.

Page 22: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

22

Implementation in MPI- MPI_Allgather

4. MPI_Allgather(sendbuffersendcountsendtype

recvbufferrecvcountrecvtype

comm)

Remarks:

1.     This operation is similar to all-to-all operation2.     Instead of specifying the "root", every process sends a its data too

all other processes3.     The "j-th" block of data from each process is received by every

process and is placed in the "j-th" block of the buffer "recvbuf"

Page 23: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

23

Implementation in MPI- MPI_Allgather

5. MPI_AllgatherV(sendbuffersendcountsendtype

recvbufferrecvcount

displacement

recvtype

comm)

Remarks:(1)          This is an operation similar to all-to-all operation. (2)          Instead of specifying the "root", every process sends its data too all other processes.(3)          The "j-th" block of data from each process is received by every process and is placed in the

"j-th" block of the buffer "recvbuf". (4)          But the blocks from different processes need not to be uniform in sizes.

Page 24: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

24

Implementation in MPI- MPI_Alltoall

6. MPI_Alltoall(sendbuffersendcountsendtype

recvbufferrecvcountrecvtype

comm)

Remarks: (1)    This is an all-to-all operation(2)    "j-th" block sent from process "i" is placed in process "j"'s "i-th"

location of the "recv" buffer

Page 25: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

25

Implementation in MPI- MPI_AlltoallV

7. MPI_AlltoallV(sendbuffersendcounts-displacementsendtype

recvbufferrecvcountr-displacementrecvtype

comm)

Remarks:

(1)     This is an all-to-all process(2)     "j-th" block sent from process "i" is placed process "j"'s "i-th"

location of the "recv" buffer

Page 26: 1 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108 Lesson 12.

26