Quick Intr oduction Programming in Parallel · 2008-06-18 · Quick Intr oduction Si Hammond,...

MPI: Parallel Programming for Extreme MachinesSi Hammond, High Performance Systems Group

1

Quick Introduction

Si Hammond,

([email protected])

WPRF/PhD Research

student, High Performance

Systems Group, Computer

Science

Platforms - Cray XT3/4,

CSC Francesca, small AMD

Opteron/Intel Xeon clusters

2

What’s in this talk?

Parallel programming methodologies - why MPI?

Where can I use MPI?

MPI in action

Getting MPI to work at Warwick

Examples

3

Programming in Parallel

One Computer

Multiple Processors/

Multiple Cores

Many Computers

Multiple Processors/

Multiple Cores

Network

OpenMP,

Threads

MPI,

Network

Sockets

4

What is MPI?

Message Passing Interface

Programming paradigm for writing parallel codes

Defines what ‘messages’ of data are passed between

processes - how this happens at the underlying layers

doesn’t matter.

Runs on single multi-core/processor machines to small

distributed clusters, right the way up to IBM BlueGene

and IBM RoadRunner

5

Why learn MPI?

Used in almost every major parallel scientific code

Can be used with Fortran, C, C++ (Java almost)

So far the only messaging paradigm to scale to 100k+

nodes

Highly tuned and optimised - if you want parallel codes

that perform you really need this.

6

MPI - The Theory

MPI Programs are Single Program Multiple Data

Same executable running multiple times but each will

have its own separate data - there is no global

memory.

MPI is programmer driven - you have to write the

parallelism, there are no smart tools to do this for you

(like OpenMP).

7

MPI in Action

8

MPI in Action

First step is assigning

each machine a rank.

MPI will do this for you

automatically.

Ranks start at 0 and go

to n-1

Usually refer to rank 0 as

‘the root’

MPI

Library

Rank 1Rank 0

9

Sending a Message

Lets send a message

from 0 to 1.

Rank 0 posts a send to

rank 1.

Rank 1 posts a receive

from rank 0.

Data is exchanged.

MPI

Library

Rank 1Rank 0

Send(1, “Hello”) Recv(1)

10

MPI Program Outline in C

#include “mpi.h”

main(int argc, char* argv[]) {

MPI_Init(&argc, &argv);

// Program goes here

MPI_Finalize();

}

Must be before any use

of MPI functions

Must be the last use of

MPI

11

What Rank Am I?




int my_rank;

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

if(my_rank == 0) printf(“I am the boss\n”); else printf(“I am not the boss :( \n”);

MPI_Finalize();

}

12

What Rank Am I?




int my_rank;



MPI_Finalize();

}

MPI_COMM_WORLD is the

communicator group, we’ll

come back to this

13

What Rank Am I?




int my_rank;



MPI_Finalize();

}

Parameters to MPI which

usually passed as pointers

14

What Rank Am I?




int my_rank; int world_size;

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size);


MPI_Finalize();

}

15

Send a message


main(int argc, char* argv[]) { MPI_Init(&argc, &argv); int my_rank; int a[1]; a[0] = 42; int tag = 0;


if(my_rank == 0) MPI_Send(a,1,MPI_INT,1,tag,MPI_COMM_WORLD);

else {MPI_Status status;MPI_Recv(a,1,MPI_INT,0,tag,MPI_COMM_WORLD, &status);

}

MPI_Finalize();

}

Send data from array ‘a’, 1

piece of data of type INT to

rank 1

Recv data int array ‘a’, 1 piece of

data of type INT from

rank 0

16

MPI - Data Types

The purpose of MPI data types are so the programmer

can say how many items of data should be transmitted

without worrying about how much memory this is.

The compiler and MPI will work this out for you!

Good idea to stick to these, will be correct for your

architecture, e.g. 64-bit, 32-bit etc.

17

MPI - Data Types

MPI_CHAR = signed char

MPI_SHORT = signed short int

MPI_INT = signed int

MPI_LONG = signed long int

MPI_FLOAT

MPI_DOUBLE

MPI_LONG_DOUBLE

18

MPI Gets Serious

19

MPI Collectives

Real power of MPI is in the advanced data handling

functions - these are known as the collectives

Situation is:

Multiple ranks have a piece of data you need to carry

some operation out with.

One rank has a big piece of data you need to split up

between multiple ranks.

Collectives are highly tuned for this = fast performance

20

MPI Broadcast

MPI_Bcast(void* msg, int

count, MPI_Datatype datatype,

int root, MPI_Comm comm);

e.g. MPI_Bcast(a, 16, MPI_INT,

0, MPI_COMM_WORLD);

Broadcasts data from the rank

which matches root to all ranks

in the communicator group.

Rank = root

21

MPI Reduce

MPI_Reduce(void* operand,

void* result, int count,

MPI_Datatype datatype,

MPI_Op operation, int root,

MPI_Comm comm);

Reduces data from ranks in the

world to one single rank Rank = root

22

MPI Reduce

Reduces essentially apply a mathematical operation to

pieces of data being held remotely.

MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD etc

You must not specify the operand and result to be the

same location in memory - only the result will be used

on the root, for everyone else it will be empty

23

MPI All-Reduce

MPI_Allreduce(void* operand,

void* result, int count,

MPI_Datatype datatype,

MPI_Op operation, MPI_Comm

comm);

Reduces data from ranks to

everyone else in the world. Rank = root

24

MPI Collectives and Arrays

MPI has a lot of collectives to handle decomposition

and recomposition of arrays.

Typically very useful for matrix/vector operations which

are conducted in parallel.

25

Scatter and Gather

a0, a1, a2, a3 a0, a1, a2, a3

Scatter Gather

26

MPI Scatter

MPI_Scatter(void* send_data, int send_count,

MPI_Datatype send_type, void* recv_data,

int recv_count, MPI_Datatype recv_type, int root,

MPI_Comm comm)

The send parameters are used on the root

The recv parameters are used by everyone

Decomposes arrays onto ranks (inc. the root)

27

MPI Gather

MPI_Gather(void* send_data, int send_count,

MPI_Datatype send_type, void* recv_data,

int recv_count, MPI_Datatype recv_type, int root,

MPI_Comm comm)

Recombines sub-arrays on world ranks (inc. root) into

one large array on the root.

MPI_Allgather allows you to gather with results being

copied to all ranks.

28

Non-Blocking MPI

29

Non-Blocking Send/Recv

Until now all MPI operations block until they complete.

Good if you know your data might be overwritten, bad

for performance if you know its safe to proceed.

Can we issue the send/recv now? Do some work and

then wait for the operation to complete later at some

point later?

..... Yes.

30


MPI_Isend(void* buffer, int count,

MPI_Datatype datatype, int destination, int tag,

MPI_Comm comm, MPI_Request* request)

The Isend operation issues the send and then returns

immediately, the ‘request’ parameter is a hook to the

operation that allows you to monitor it.

There is an overhead with issuing an Isend

31


MPI_Irecv(void* buffer, int count,

MPI_Datatype datatype, int source, int tag,

MPI_Comm comm, MPI_Request* request)

The Irecv operation issues the recv operation and then

returns.

Again, there is an overhead for this.

32


MPI_Wait(MPI_Request* request, MPI_Status* status)

The wait operation blocks on the request until it

completes, status is updated with the appropriate

information.

Using Isend/Irecv enables you to issue the operation do

some more processing and then wait later (hopefully

the operation will have completed).

Overlapping computation in this way can seriously

improve performance.

33

Using MPI at Warwick

34

Loading the MPI Compilers

To deal with the nitty gritty of linking MPI libraries etc,

MPI has compiler wrappers - these are the same

compilers but will all the command line options

enabled.

Load compilers:

module load intel/intel64module load intel/ompi64

Loads MPI compiler with underlying compiler set

Francesca, Skua etc

35

Compiling MPI

Use the MPI compiler equivalent in place of your

normal compiler.

mpicc (C programs)

mpicxx (C++ programs)

mpif77 (Fortran 77)

mpif90 (Fortran 90)

36

MPI and PBS Pro

In your PBS script you must use the command:

mpirun <executable>

This will load your executable under MPI and ensure all

the ranks etc are set up correctly.

When you submit under multiple nodes for PBS the

system will automatically sort the MPI out (provided you

use MPI_Init in your program).

37

MPI Performance

38

MPI Performance

0

17.5

35.0

52.5

70.0

CG EP LU MG SP

16 32 64

NAS, NPB-MPI 2.4CSC-Francesca

39

Conclusions...

40

Summary

In this presentation we’ve met the Message Passing

Interface (MPI).

MPI uses a Single Program Multiple Data programming

paradigm (one executable, each one has its own data).

No shared-memory - MPI is all about you saying what

data to move around the system.

MPI is programmer driven, not compiler like OpenMP

Used for small clusters right the way to ultra-scale

peta-flop computing

41

Summary

We have looked at:

MPI Point to Point (Send/Recv) Operations

MPI Collectives

MPI Non-Blocking Sends/Recvs

So what can I do now?

http://go.warwick.ac.uk/csrcbc

42

Whats Next?

So what can I do now?

http://go.warwick.ac.uk/csrcbc

MPI has lots more operations for improving

performance, making programming easy.

MPI I/O Operations for Parallel Data Processing

43

Questions?Thanks for listening

44

Quick Intr oduction Programming in Parallel · 2008-06-18 · Quick Intr oduction Si Hammond,...

Documents

Transcript of Quick Intr oduction Programming in Parallel · 2008-06-18 · Quick Intr oduction Si Hammond,...