ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

25
ECMWF Porting MPI Programs to the IBM Cluster 1600 Slide 1 Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004

Transcript of ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Page 1: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 1

Porting MPI Programs to theIBM Cluster 1600

Peter Towers

March 2004

Page 2: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 2

Topics

The current hardware switch

Parallel Environment (PE)

Issues with Standard Sends/Receives

Use of non blocking communications

Debugging MPI programs

MPI tracing

Profiling MPI programs

Tasks per Node

Communications Optimisation

The new hardware switch

Third Practical

Page 3: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 3

The current hardware switch

Designed for a previous generation of IBM hardware

Referred to as the Colony switch

2 switch adaptors per logical node

- 8 processors share 2 adaptors

- called a dual plane switch

Adaptors are multiplexed

- software stripes large messages across both adaptors

Minimum latency 21 microseconds

Maximum bandwidth approx 350 MBytes/s

- about 45 MB/s per task when all going off node together

Page 4: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 4

Parallel Environment (PE)

MPI programs are managed by the IBM PE

IBM documentation refers to PE and POE

- POE stands for Parallel Operating Environment

- many environment variables to tune the parallel environment

- talks about launching parallel jobs interactively

ECMWF uses Loadleveler for batch jobs

- PE usage becomes almost transparent

Page 5: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 5

Issues with Standard Sends/Receives

The MPI standard can be implemented in different ways

Programs may not be fully portable across platforms

Standard Sends and Receives can cause problems

- Potential for deadlocks

- need to understand Blocking v Non Blocking communications

- need to understand Eager versus Rendezvous protocols

IFS had to be modified to run on IBM

Page 6: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 6

Blocking Communications

MPI_Send is a blocking routine

It returns when it is safe to re-use the buffer being sent- the send buffer can then be overwritten

The MPI layer may have copied the data elsewhere- using internal buffer/mailbox space

- the message is then in transit but not yet received

- this is called an “eager” protocol

- good for short messages

The MPI layer may have waited for the receiver- the data is copied from send to receive buffer directly

- lower overhead transfer

- this is called a “rendezvous” protocol

- good for large messages

Page 7: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 7

MPI_Send on the IBM

Uses the “Eager” protocol for short messages- By default short means up to 4096 bytes

the higher the task count, the lower the value

Uses the “Rendezvous” protocol for long messages

Potential for send/send deadlocks- tasks block in mpi_send

if(me .eq.0) then him=1else him=0endif

call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)

Page 8: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 8

Solutions to Send/Send deadlocks

Pair up sends and receives

use MPI_SENDRECV

use a buffered send

- MPI_BSEND

use asynchronous sends/receives

- MPI_ISEND/MPI_IRECV

Page 9: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 9

Paired Sends and Receives

More complex code

Requires close synchronisation

if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)endif

Page 10: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 10

MPI_SENDRECV

Easier to code

Still implies close synchronisation

call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, &

rbuff,n,MPI_REAL8,him,1, &

MPI_COMM_WORLD,stat,ierror)

Page 11: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 11

MPI_BSEND

This performs a send using an additional buffer

- the buffer is allocated by the program via MPI_BUFFER_ATTACH

- done once as part of the program initialisation

Typically quick to implement

- add the mpi_buffer_attach call

how big to make the buffer?

- change MPI_SEND to MPI_BSEND everywhere

But introduces additional memory copy

- extra overhead

- not recommended for production codes

Page 12: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 12

MPI_IRECV / MPI_ISEND

Uses Non Blocking Communications

Routines return without completing the operation- the operations run asynchronously

- Must NOT reuse the buffer until safe to do so

Later test that the operation completed- via an integer identification handle passed to MPI_WAIT

I stands for immediate- the call returns immediately

call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror)call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror)call mpi_wait(request,stat,ierr)

Alternatively could have used MPI_ISEND and MPI_RECV

Page 13: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 13

Non Blocking Communications

Routines include

- MPI_ISEND

- MPI_IRECV

- MPI_WAIT

- MPI_WAITALL

Page 14: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 14

Debugging MPI Programs

The Universal Debug Tool

and

Totalview

Page 15: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 15

The Universal Debug Tool

The Print/Write Statement

Recommend the use of call flush(unit_number)- ensures output is not left in runtime buffers

Recommend the use of separate output files eg:unit_number=100+mytaskwrite(unit_number,*) ......call flush(unit_number)

Or set the Environment variable MP_LABELIO=yes

Do not output too much

Use as few processors as possible

Think carefully.....

Discuss the problem with a colleague

Page 16: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 16

Totalview

Assumes you can launch X-Windows remotely

Run totalview as part of a loadleveler job

export DISPLAY=.......poe totalview -a a.out <arguments>

But you have to wait for the job to run.....

Use only a few processors

- minimises the queuing time

- minimises the waste of resource while thinking....

Page 17: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 17

MPI Trace Tools

Identify message passing hot spots

Just link with

/usr/local/lib/trace/libmpiprof.a

low overhead timer for all mpi routine calls

Produces output files named mpi_profile.N

- were N is the task number

Examples of the output follow

Page 18: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 18

Page 19: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 19

Page 20: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 20

Profiling MPI programs

The same as for serial codes

Use the –pg flag at compile and/or link time

Produces multiple gmon.out.N files

- N is the task number

gprof a.out gmon.out.*

The routine .kickpipes often appears high up the profile

- an internal mpi library routine

- where the mpi library spins waiting for something

eg a message to be sent

or in a barrier

Page 21: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 21

Tasks Per Node ( 1 of 2 )

Try both 7 and 8 tasks per node for multi node jobs

- 7 tasks may run faster than 8

- depends on the frequency of communications

7 tasks leaves a processor spare

- used by the OS and background daemons such as for GPFS

- mpi tasks run with minimal scheduling interference

8 tasks are subject to scheduling interference

- by default mpi tasks cpu spin in kickpipes

- they may spin waiting for a task that has been scheduled out

- the OS has to schedule cpu time for background processes

- random interference across nodes is cumulative

Page 22: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 22

Tasks Per Node ( 2 of 2 )

Also try 8 tasks per node and MP_WAIT_MODE=sleep- export MP_WAIT_MODE=sleep

- tasks give up the cpu instead of spinning

- increases latency but reduces interference

- effect varies from application to application

Mixed mode MPI/OpenMP works well- master OpenMP thread does the message passing

- while slave OpenMP threads go to sleep

- cpu cycles are freed up for background processes

- used by the IFS to good effect

2 tasks each of 4 threads per node

suspect success depends on the parallel granularity

Page 23: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 23

Communications Optimisation

Communications costs often impact parallel speedup

Concatenate messages

- fewer larger messages are better

- reduces the effect of latency

Increase MP_EAGER_LIMIT

- export MP_EAGER_LIMIT=65536

- maximum size for messages sent with the “eager” protocol

Use collective routines

Use ISEND/IRECV

Remove barriers

Experiment with tasks per node

Page 24: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 24

The new hardware switch

Designed for the Cluster 1600

Referred to as the Federation switch

2 switch adaptors per physical node- 2 links each 2GB/s per adaptor

- 32 processors share 4 links

Adaptors/links are NOT multiplexed

Minimum latency 10 microseconds

Maximum bandwidth approx 2000 MByte/s- about 250 MB/s per task when all going off node together

Up to 5 times better performance

32 processor nodes- will affect how we schedule and run jobs

Page 25: ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 25

Third Practical

Contained in the directory

/home/ectrain/trx/mpi/exercise3 on hpca

Parallelising the computation of PI

See the README for details