Quick Intr oduction Programming in Parallel · 2008-06-18 · Quick Intr oduction Si Hammond,...
Transcript of Quick Intr oduction Programming in Parallel · 2008-06-18 · Quick Intr oduction Si Hammond,...
MPI: Parallel Programming for Extreme MachinesSi Hammond, High Performance Systems Group
1
Quick Introduction
Si Hammond,
WPRF/PhD Research
student, High Performance
Systems Group, Computer
Science
Platforms - Cray XT3/4,
CSC Francesca, small AMD
Opteron/Intel Xeon clusters
2
What’s in this talk?
Parallel programming methodologies - why MPI?
Where can I use MPI?
MPI in action
Getting MPI to work at Warwick
Examples
3
Programming in Parallel
One Computer
Multiple Processors/
Multiple Cores
Many Computers
Multiple Processors/
Multiple Cores
Network
OpenMP,
Threads
MPI,
Network
Sockets
4
What is MPI?
Message Passing Interface
Programming paradigm for writing parallel codes
Defines what ‘messages’ of data are passed between
processes - how this happens at the underlying layers
doesn’t matter.
Runs on single multi-core/processor machines to small
distributed clusters, right the way up to IBM BlueGene
and IBM RoadRunner
5
Why learn MPI?
Used in almost every major parallel scientific code
Can be used with Fortran, C, C++ (Java almost)
So far the only messaging paradigm to scale to 100k+
nodes
Highly tuned and optimised - if you want parallel codes
that perform you really need this.
6
MPI - The Theory
MPI Programs are Single Program Multiple Data
Same executable running multiple times but each will
have its own separate data - there is no global
memory.
MPI is programmer driven - you have to write the
parallelism, there are no smart tools to do this for you
(like OpenMP).
7
MPI in Action
8
MPI in Action
First step is assigning
each machine a rank.
MPI will do this for you
automatically.
Ranks start at 0 and go
to n-1
Usually refer to rank 0 as
‘the root’
MPI
Library
Rank 1Rank 0
9
Sending a Message
Lets send a message
from 0 to 1.
Rank 0 posts a send to
rank 1.
Rank 1 posts a receive
from rank 0.
Data is exchanged.
MPI
Library
Rank 1Rank 0
Send(1, “Hello”) Recv(1)
10
MPI Program Outline in C
#include “mpi.h”
main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
// Program goes here
MPI_Finalize();
}
Must be before any use
of MPI functions
Must be the last use of
MPI
11
What Rank Am I?
#include “mpi.h”
main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if(my_rank == 0) printf(“I am the boss\n”); else printf(“I am not the boss :( \n”);
MPI_Finalize();
}
12
What Rank Am I?
#include “mpi.h”
main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if(my_rank == 0) printf(“I am the boss\n”); else printf(“I am not the boss :( \n”);
MPI_Finalize();
}
MPI_COMM_WORLD is the
communicator group, we’ll
come back to this
13
What Rank Am I?
#include “mpi.h”
main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if(my_rank == 0) printf(“I am the boss\n”); else printf(“I am not the boss :( \n”);
MPI_Finalize();
}
Parameters to MPI which
usually passed as pointers
14
What Rank Am I?
#include “mpi.h”
main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int my_rank; int world_size;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if(my_rank == 0) printf(“I am the boss\n”); else printf(“I am not the boss :( \n”);
MPI_Finalize();
}
15
Send a message
#include “mpi.h”
main(int argc, char* argv[]) { MPI_Init(&argc, &argv); int my_rank; int a[1]; a[0] = 42; int tag = 0;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if(my_rank == 0) MPI_Send(a,1,MPI_INT,1,tag,MPI_COMM_WORLD);
else {MPI_Status status;MPI_Recv(a,1,MPI_INT,0,tag,MPI_COMM_WORLD, &status);
}
MPI_Finalize();
}
Send data from array ‘a’, 1
piece of data of type INT to
rank 1
Recv data int array ‘a’, 1 piece of
data of type INT from
rank 0
16
MPI - Data Types
The purpose of MPI data types are so the programmer
can say how many items of data should be transmitted
without worrying about how much memory this is.
The compiler and MPI will work this out for you!
Good idea to stick to these, will be correct for your
architecture, e.g. 64-bit, 32-bit etc.
17
MPI - Data Types
MPI_CHAR = signed char
MPI_SHORT = signed short int
MPI_INT = signed int
MPI_LONG = signed long int
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
18
MPI Gets Serious
19
MPI Collectives
Real power of MPI is in the advanced data handling
functions - these are known as the collectives
Situation is:
Multiple ranks have a piece of data you need to carry
some operation out with.
One rank has a big piece of data you need to split up
between multiple ranks.
Collectives are highly tuned for this = fast performance
20
MPI Broadcast
MPI_Bcast(void* msg, int
count, MPI_Datatype datatype,
int root, MPI_Comm comm);
e.g. MPI_Bcast(a, 16, MPI_INT,
0, MPI_COMM_WORLD);
Broadcasts data from the rank
which matches root to all ranks
in the communicator group.
Rank = root
21
MPI Reduce
MPI_Reduce(void* operand,
void* result, int count,
MPI_Datatype datatype,
MPI_Op operation, int root,
MPI_Comm comm);
Reduces data from ranks in the
world to one single rank Rank = root
22
MPI Reduce
Reduces essentially apply a mathematical operation to
pieces of data being held remotely.
MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD etc
You must not specify the operand and result to be the
same location in memory - only the result will be used
on the root, for everyone else it will be empty
23
MPI All-Reduce
MPI_Allreduce(void* operand,
void* result, int count,
MPI_Datatype datatype,
MPI_Op operation, MPI_Comm
comm);
Reduces data from ranks to
everyone else in the world. Rank = root
24
MPI Collectives and Arrays
MPI has a lot of collectives to handle decomposition
and recomposition of arrays.
Typically very useful for matrix/vector operations which
are conducted in parallel.
25
Scatter and Gather
a0, a1, a2, a3 a0, a1, a2, a3
Scatter Gather
26
MPI Scatter
MPI_Scatter(void* send_data, int send_count,
MPI_Datatype send_type, void* recv_data,
int recv_count, MPI_Datatype recv_type, int root,
MPI_Comm comm)
The send parameters are used on the root
The recv parameters are used by everyone
Decomposes arrays onto ranks (inc. the root)
27
MPI Gather
MPI_Gather(void* send_data, int send_count,
MPI_Datatype send_type, void* recv_data,
int recv_count, MPI_Datatype recv_type, int root,
MPI_Comm comm)
Recombines sub-arrays on world ranks (inc. root) into
one large array on the root.
MPI_Allgather allows you to gather with results being
copied to all ranks.
28
Non-Blocking MPI
29
Non-Blocking Send/Recv
Until now all MPI operations block until they complete.
Good if you know your data might be overwritten, bad
for performance if you know its safe to proceed.
Can we issue the send/recv now? Do some work and
then wait for the operation to complete later at some
point later?
..... Yes.
30
Non-Blocking Send/Recv
MPI_Isend(void* buffer, int count,
MPI_Datatype datatype, int destination, int tag,
MPI_Comm comm, MPI_Request* request)
The Isend operation issues the send and then returns
immediately, the ‘request’ parameter is a hook to the
operation that allows you to monitor it.
There is an overhead with issuing an Isend
31
Non-Blocking Send/Recv
MPI_Irecv(void* buffer, int count,
MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Request* request)
The Irecv operation issues the recv operation and then
returns.
Again, there is an overhead for this.
32
Non-Blocking Send/Recv
MPI_Wait(MPI_Request* request, MPI_Status* status)
The wait operation blocks on the request until it
completes, status is updated with the appropriate
information.
Using Isend/Irecv enables you to issue the operation do
some more processing and then wait later (hopefully
the operation will have completed).
Overlapping computation in this way can seriously
improve performance.
33
Using MPI at Warwick
34
Loading the MPI Compilers
To deal with the nitty gritty of linking MPI libraries etc,
MPI has compiler wrappers - these are the same
compilers but will all the command line options
enabled.
Load compilers:
module load intel/intel64module load intel/ompi64
Loads MPI compiler with underlying compiler set
Francesca, Skua etc
35
Compiling MPI
Use the MPI compiler equivalent in place of your
normal compiler.
mpicc (C programs)
mpicxx (C++ programs)
mpif77 (Fortran 77)
mpif90 (Fortran 90)
36
MPI and PBS Pro
In your PBS script you must use the command:
mpirun <executable>
This will load your executable under MPI and ensure all
the ranks etc are set up correctly.
When you submit under multiple nodes for PBS the
system will automatically sort the MPI out (provided you
use MPI_Init in your program).
37
MPI Performance
38
MPI Performance
0
17.5
35.0
52.5
70.0
CG EP LU MG SP
16 32 64
NAS, NPB-MPI 2.4CSC-Francesca
39
Conclusions...
40
Summary
In this presentation we’ve met the Message Passing
Interface (MPI).
MPI uses a Single Program Multiple Data programming
paradigm (one executable, each one has its own data).
No shared-memory - MPI is all about you saying what
data to move around the system.
MPI is programmer driven, not compiler like OpenMP
Used for small clusters right the way to ultra-scale
peta-flop computing
41
Summary
We have looked at:
MPI Point to Point (Send/Recv) Operations
MPI Collectives
MPI Non-Blocking Sends/Recvs
So what can I do now?
http://go.warwick.ac.uk/csrcbc
42
Whats Next?
So what can I do now?
http://go.warwick.ac.uk/csrcbc
MPI has lots more operations for improving
performance, making programming easy.
MPI I/O Operations for Parallel Data Processing
43
Questions?Thanks for listening
44