Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon...

Parallel Programming – Process-Based Communication Operations

David MonismithCS599

Based upon notes from Introduction to Parallel Programming, Second Edition by Grama, Gupta, Karypis,

and Kumar and from CS550

Last time

• We reviewed the Scan pattern• We will continue with OpenMP scheduling

operations later in the course.• For now, we are going to move on to MPI so

we can make use of multi-process programming.

Interprocess Communication

• Often communication between processes is necessary.

• Communication may occur sporadically from one process to another.

• It may also occur in well defined patterns some of which are used collectively (by all processes).

• Collective patterns are frequently used in parallel algorithms.

Send and Receive (Abstract operations)

• Point to point (i.e. process to process) communication occurs as send and receive operations.

• send – send data from this process with to a process identified by rank.– Example: send(myMessage, rank)

• receive – receive data in this process from the process with identifier rank.– Example: receive(receivedMessage, rank)

MPI Message Passing

• Send and receive are implemented concretely in MPI using the MPI_Send and MPI_Recv functions.

• MPI - the message passing interface allows for communication (IPC) between running processes, even those using the same source code.

Using MPI• Processes use MPI by using #include "mpi.h" or <mpi.h>

depending upon the system and MPI stack.• MPI is started in a program using:

MPI_Init(&argc, &argv);

• and ended with:

MPI_Finalize();

• These function almost like curly brackets to start and end the parallel portion of the program.

Using MPI on LittleFe• Anything between the MPI_Init and MPI_Finalize statements runs in as many processes that are requested by "mpirun" at the command line.

• For example, on littlefe:

mpirun -np 12 -machinefile machines-openmpi prog1.exe

• Runs 12 processes using the executable code from prog1.exe.

Try running MPI Hello World on LittleFe1 or LittleFe2

Using MPI on Stampede

• On Stampede, one specifies the number of tasks in a batch script using the –n operator.

• Example:– #SBATCH -n 32– Specifies to use 32 tasks (MPI Processes, one per CPU

core)• After all options have been specified, an MPI

program is started in the script using ibrun• Example– ibrun prog1.exe

Identifying Processes in MPI

• The MPI_Comm_rank and MPI_Comm_size functions get the rank (process identifier) and number of processes (the value 12 after -np, and the value 32 on the previous slides).

• These were previously reviewed in class.

MPI Message Passing

• Messages are passed in MPI using MPI_Send and MPI_Recv

• MPI_Send - sends a message of a given size with a given type to a process with a specific rank.

• MPI_Recv - receives a message of a maximum size with a given type from a process with a specific rank.

• MPI_COMM_WORLD - the "world" in which the processes exist. This is a constant.

Sending and Receiving Messages

• MPI_Send and MPI_Recv have the following parameters:

MPI_Send( pointer to message, message size, message type, process rank to send to, message tag or id, MPI_COMM_WORLD)

MPI_Recv( pointer to variable used to receive, maximum recv size, message type, process rank to receive from, message tag or id, MPI_COMM_WORLD, MPI_STATUS_IGNORE)

MPI Types

MPI_CHAR MPI_LONGMPI_SHORT MPI_FLOATMPI_INT MPI_DOUBLE

• Many other types exist• These types are analogous to C primitive types• See the MPI Reference Manual for more

examples

Blocking I/O

• MPI_Send and MPI_Recv are blocking I/O operations.

• In blocking I/O, when a message is sent, a process waits until it has acknowledgement that the message has been received before it can continue processing.

• Similarly, when a message is requested (a receive method/function is called) the program waits until the message has been received before continuing processing.

Blocking I/O Example

Process1 Process2+------------+ 1.send msg +------------+|MPI_Send | --------> |MPI_Recv ||wait for ack| |wait for msg||ack received| <-------- |ack receipt ||3b.continue | 2.send ack |3a.continue |+------------+ +------------+

Before we continue…

• Try #1 from worksheet 6, and

DON’T PANIC!!!• Most functional MPI programs can be implemented with

only 6 functions:– MPI_Init– MPI_Finalize– MPI_Send– MPI_Recv– MPI_Comm_rank– MPI_Comm_size

Why are Send and Receive Important?

• MPI is not the only framework in which send and receive operations are used.

• Send and receive exist in Java, Android Services, iOS, Web Services (i.e. GET and POST), etc.

• It is likely that you have used these operations before and that you will use them again.

Collective Message Patterns• We will investigate commonly used collective message communication

patterns.• Collective means that the functions representing these patterns must be

called in ALL processes.• These include:

– Broadcast– Reduction– All-to-all– Scatter– Gather– Scan– And more

• Communication patterns on simple interconnect networks will also be covered for linear arrays, meshes, and hypercubes.

One to All Broadcast

• Send identical data from one process to all other processes or a subset thereof.

• Initially the root process only has the data (size m)

• After completing the operation, there are p copies of the data where p is the number of processes to which the data was broadcast

• Implemented by MPI_Bcast

All-to-One Reduction

• Each of p processes starts with a buffer B of size m• Data from all processes is combined using an

associative operator such as +, *, min, max, etc.• Data is accumulated at a single process into one

buffer B_reduce of size m• Data element i of B_reduce is the sum, product,

minimum, maximum, etc., of all of the ith elements of each original buffer B.

• This reduction is implemented by MPI_Reduce

Broadcasting

• On a ring or linear array, the naïve way to send data is to send p – 1 messages from the source to the other p – 1 processes.

• After the first message is sent, recursive doubling can be used to send the message to two processes.

• That is, the message can be sent from both the original source and the first destination to two additional processes.

• This algorithm can be repeated to reduce the number of steps required to broadcast to log(p)

• Note that on a linear array, the initial message must be sent to the farthest node, thereafter the distances are halved.

Mesh

• Communication on a mesh can be regarded as a extension of the linear array.

• A 2d mesh of p nodes consists of sqrt(p) linear arrays.• Therefore, the first sqrt(p-1) messages can be sent from the

root to those sqrt(p-1) nodes in the linear array.• From there, messages may be sent in parallel to the remaining

sqrt(p-1) linear arrays.• A similar process can be carried out with a hypercube of size

2^d as it can be modeled as a d-dimensional mesh with 2 nodes per dimension.

• Therefore, on a hypercube, a broadcast may be carried out in d steps.

Hypercube Broadcast Algorithmone_to_all_bc(d, my_id, X)mask = 2^d – 1 //Set d bits of mask to 1for i = d – 1 to 0 //Start loop mask = mask XOR 2î //Set bit i of mask to 0 if(my_id AND mask == 0) //If lower i bits of my_id are 0 if(my_id AND 2î == 0) dest = my_id XOR 2î send X to dest else source = my_id XOR 2î recv X from source endif endifendfor

All-to-All Broadcast and Reduction• In an All-to-all Broadcast every process out of p processes simultaneously

initiates a a broadcast.• Each process sends the same message of size m to every other process,

but different processes may broadcast different messages.• This is useful in matrix multiplication and matrix-vector multiplication.• Naïve implementations may take p times as long as the one-to-all

broadcast.• It is possible to implement the all-to-all algorithm in such a manner to

take advantage of the interconnect network so all messages traversing the same path at the same time are concatenated.

• The dual operation of such a broadcast is an all-to-all reduction in which every node is the destination of an all-to-one reduction

• These operations are implemented via the MPI_Allgather (All-to-all broadcast) and MPI_Reduce_scatter (All-to-all reduction) operations.

Ring All to All Broadcast• Consider a ring topology.• All links can be kept busy until the all-to-all broadcast is complete.• An algorithm for such a broadcast follows below.

all_to_all_ring_bc(myId,myMsg, p, result)left = (myId-1) % pright = (myId+1) % pResult = myMsgmsg = resultfor i = 1 to p-1 send msg to right recv msg from left result = concat(result, msg)endfor

Ring All to One Reduce Algorithm

All_to_all_ring_reduce(myId, myMsg, p, result) left = (myId-1)%p right = (myId+1)%p recvVal = 0 for i = 1 to p-1 j = (myId + 1) % p temp = myMsg[j] + recvVal send temp to left recv recvVal from right endfor result = myMsg[myId] + recvVal

Mesh and Hypercube Implementations

• Mesh and Hypercube implementations can be constructed by expanding upon the linear array and ring algorithms, to carry these out in two steps.

• The hypercube algorithm is a generalization of the mesh algorithm to log(p) dimensions.

• It is important to realize that such implementations are used to take advantage of the existing interconnect networks on large scale systems.

Scatter and Gather

• Scatter and gather are personalized operations.• Scatter – single node sends a unique message of

size m to every other node.• One to many personalized communication• Gather – a single node collects unique

messages from each node• Implemented using MPI_Scatter and

MPI_Gather respectively

MPI Operations

• One to All – MPI_Bcast• All-to-one – MPI_Reduce• All-to-all Broadcast – MPI_Allgather• All-to-all Reduction – MPI_Reduce_scatter• All-reduce – MPI_Allreduce• Gather – MPI_Gather, MPI_Gatherv• Scatter – MPI_Scatter, MPI_Scatterv• All-to-all personalized – MPI_Alltoall• Scan – MPI_Scan

Next Time: All-to-All Personalized Communication

• Total exchange• Used in FFT, matrix transpose, sample sort,

and parallel DB join operations• Different algorithms exist for:– Linear Array– Mesh– Hypercube

Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon...

Documents

Transcript of Parallel Programming – Process- Based Communication Operations David Monismith CS599 Based upon...