ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWFPorting MPI Programs to the IBM Cluster 1600 Slide 1

Porting MPI Programs to theIBM Cluster 1600

Peter Towers

March 2004


Topics

The current hardware switch

Parallel Environment (PE)

Issues with Standard Sends/Receives

Use of non blocking communications

Debugging MPI programs

MPI tracing

Profiling MPI programs

Tasks per Node

Communications Optimisation

The new hardware switch

Third Practical


The current hardware switch

Designed for a previous generation of IBM hardware

Referred to as the Colony switch

2 switch adaptors per logical node

- 8 processors share 2 adaptors

- called a dual plane switch

Adaptors are multiplexed

- software stripes large messages across both adaptors

Minimum latency 21 microseconds

Maximum bandwidth approx 350 MBytes/s

- about 45 MB/s per task when all going off node together


Parallel Environment (PE)

MPI programs are managed by the IBM PE

IBM documentation refers to PE and POE

- POE stands for Parallel Operating Environment

- many environment variables to tune the parallel environment

- talks about launching parallel jobs interactively

ECMWF uses Loadleveler for batch jobs

- PE usage becomes almost transparent


Issues with Standard Sends/Receives

The MPI standard can be implemented in different ways

Programs may not be fully portable across platforms

Standard Sends and Receives can cause problems

- Potential for deadlocks

- need to understand Blocking v Non Blocking communications

- need to understand Eager versus Rendezvous protocols

IFS had to be modified to run on IBM


Blocking Communications

MPI_Send is a blocking routine

It returns when it is safe to re-use the buffer being sent- the send buffer can then be overwritten

The MPI layer may have copied the data elsewhere- using internal buffer/mailbox space

- the message is then in transit but not yet received

- this is called an “eager” protocol

- good for short messages

The MPI layer may have waited for the receiver- the data is copied from send to receive buffer directly

- lower overhead transfer

- this is called a “rendezvous” protocol

- good for large messages


MPI_Send on the IBM

Uses the “Eager” protocol for short messages- By default short means up to 4096 bytes

the higher the task count, the lower the value

Uses the “Rendezvous” protocol for long messages

Potential for send/send deadlocks- tasks block in mpi_send

if(me .eq.0) then him=1else him=0endif

call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)


Solutions to Send/Send deadlocks

Pair up sends and receives

use MPI_SENDRECV

use a buffered send

- MPI_BSEND

use asynchronous sends/receives

- MPI_ISEND/MPI_IRECV


Paired Sends and Receives

More complex code

Requires close synchronisation

if (me .eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror)endif


MPI_SENDRECV

Easier to code

Still implies close synchronisation

call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, &

rbuff,n,MPI_REAL8,him,1, &

MPI_COMM_WORLD,stat,ierror)


MPI_BSEND

This performs a send using an additional buffer

- the buffer is allocated by the program via MPI_BUFFER_ATTACH

- done once as part of the program initialisation

Typically quick to implement

- add the mpi_buffer_attach call

how big to make the buffer?

- change MPI_SEND to MPI_BSEND everywhere

But introduces additional memory copy

- extra overhead

- not recommended for production codes


MPI_IRECV / MPI_ISEND

Uses Non Blocking Communications

Routines return without completing the operation- the operations run asynchronously

- Must NOT reuse the buffer until safe to do so

Later test that the operation completed- via an integer identification handle passed to MPI_WAIT

I stands for immediate- the call returns immediately

call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror)call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror)call mpi_wait(request,stat,ierr)

Alternatively could have used MPI_ISEND and MPI_RECV


Non Blocking Communications

Routines include

- MPI_ISEND

- MPI_IRECV

- MPI_WAIT

- MPI_WAITALL


Debugging MPI Programs

The Universal Debug Tool

and

Totalview


The Universal Debug Tool

The Print/Write Statement

Recommend the use of call flush(unit_number)- ensures output is not left in runtime buffers

Recommend the use of separate output files eg:unit_number=100+mytaskwrite(unit_number,*) ......call flush(unit_number)

Or set the Environment variable MP_LABELIO=yes

Do not output too much

Use as few processors as possible

Think carefully.....

Discuss the problem with a colleague


Totalview

Assumes you can launch X-Windows remotely

Run totalview as part of a loadleveler job

export DISPLAY=.......poe totalview -a a.out <arguments>

But you have to wait for the job to run.....

Use only a few processors

- minimises the queuing time

- minimises the waste of resource while thinking....


MPI Trace Tools

Identify message passing hot spots

Just link with

/usr/local/lib/trace/libmpiprof.a

low overhead timer for all mpi routine calls

Produces output files named mpi_profile.N

- were N is the task number

Examples of the output follow


Profiling MPI programs

The same as for serial codes

Use the –pg flag at compile and/or link time

Produces multiple gmon.out.N files

- N is the task number

gprof a.out gmon.out.*

The routine .kickpipes often appears high up the profile

- an internal mpi library routine

- where the mpi library spins waiting for something

eg a message to be sent

or in a barrier


Tasks Per Node ( 1 of 2 )

Try both 7 and 8 tasks per node for multi node jobs

- 7 tasks may run faster than 8

- depends on the frequency of communications

7 tasks leaves a processor spare

- used by the OS and background daemons such as for GPFS

- mpi tasks run with minimal scheduling interference

8 tasks are subject to scheduling interference

- by default mpi tasks cpu spin in kickpipes

- they may spin waiting for a task that has been scheduled out

- the OS has to schedule cpu time for background processes

- random interference across nodes is cumulative


Tasks Per Node ( 2 of 2 )

Also try 8 tasks per node and MP_WAIT_MODE=sleep- export MP_WAIT_MODE=sleep

- tasks give up the cpu instead of spinning

- increases latency but reduces interference

- effect varies from application to application

Mixed mode MPI/OpenMP works well- master OpenMP thread does the message passing

- while slave OpenMP threads go to sleep

- cpu cycles are freed up for background processes

- used by the IFS to good effect

2 tasks each of 4 threads per node

suspect success depends on the parallel granularity


Communications Optimisation

Communications costs often impact parallel speedup

Concatenate messages

- fewer larger messages are better

- reduces the effect of latency

Increase MP_EAGER_LIMIT

- export MP_EAGER_LIMIT=65536

- maximum size for messages sent with the “eager” protocol

Use collective routines

Use ISEND/IRECV

Remove barriers

Experiment with tasks per node


The new hardware switch

Designed for the Cluster 1600

Referred to as the Federation switch

2 switch adaptors per physical node- 2 links each 2GB/s per adaptor

- 32 processors share 4 links

Adaptors/links are NOT multiplexed

Minimum latency 10 microseconds

Maximum bandwidth approx 2000 MByte/s- about 250 MB/s per task when all going off node together

Up to 5 times better performance

32 processor nodes- will affect how we schedule and run jobs


Third Practical

Contained in the directory

/home/ectrain/trx/mpi/exercise3 on hpca

Parallelising the computation of PI

See the README for details

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Documents

Transcript of ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.