Toward MPI Asynchronous Progress on Cray XE Systems · MPICH2 on XE Basics (2) The eager protocol...
Transcript of Toward MPI Asynchronous Progress on Cray XE Systems · MPICH2 on XE Basics (2) The eager protocol...
Toward MPI Asynchronous Progress on Cray XE Systems
Howard Pritchard – [email protected]
5/3/2012 Cray User Group Proceedings 20121
Howard Pritchard – [email protected] Duncan RowethDavid HenselerPaul Cassella
Cray, Inc.
Overview
● MPI Progress - what is it?● XE6 (Gemini) Features relevant to MPI progress● MPICH2 Enhancements for MPI progress on Gemini● Core Specialization and MPI progress● Benchmark and application results● Future work
5/3/2012 Cray User Group Proceedings 20122
● Future work
● Determines how an MPI communication operation completes once it has been initiated
● Strict interpretation ● Once a communication has been initiated no further MPI calls are
required in order to complete it
MPI Progress Rule
5/3/2012 Cray User Group Proceedings 20123
● Weak interpretation● An application must make further MPI library calls in order to make
progress
● Primarily an issue for point-to-point calls, but al so relevant to non-blocking collectives – to be introduced in MP I version 3.
Receive side progression example
● A well written application should post its receives early so as to overlap data transfer with computation
● But messages typically arrive during the computation
● Matching occurs and bulk data
MPI_Irecv()
5/3/2012 Cray User Group Proceedings 20124
● Matching occurs and bulk data transfer starts in the next MPI call, MPI_Wait() if only weak MPI progression interpretation supported
● How do we achieve independent progression?● Asynchronous progress thread(s) ● Offload to the NIC
Computation
MPI_Wait()
Message arrives
Gemini Hardware Features and MPI Progress
● DMA Engine with four virtual channels● Local interrupt can be generated along with a Completion Queue
Event (CQE) upon completion of a TX descriptor (RDMA transaction)● Remote interrupt and remote CQE can be generated at a target node
when a RDMA transaction has completed● RDMA fence capability
● Completion Queues configurable for polling or block ing
5/3/2012 Cray User Group Proceedings 20126
● Completion Queues configurable for polling or block ing (interrupt driven)
● Gemini has low overhead method for a process operat ing in user-space to generate an interrupt at a remote node (along with a CQE)
● No explicit support for MPI
MPICH2 Enhancements to SupportAsynchronous Progress
5/3/2012 Cray User Group Proceedings 20127
Asynchronous Progress
MPICH2 on XE - Basics
● Based on Argonne National Lab (ANL) MPICH2 Nemesis CH3 channel
● A GNI Nemesis Network Module (Netmod) interfaces t he upper part of MPICH2 to the XE network software sta ck
● Full support for MPI -2 thread safety level
5/3/2012 Cray User Group Proceedings 20128
● Full support for MPI -2 thread safety level MPI_THREAD_MULTIPLE
● Progress thread infrastructure (although simplistic ) already exists in the library
MPICH2 on XE Basics (2)
● The eager protocol both for intra-node and inter-no de messaging is CPU intensive and not suitable for asynchronous progress
● The rendezvous path for intra-node messages is also CPU intensive and not suitable for asynchronous progres s
5/3/2012 Cray User Group Proceedings 20129
● The rendezvous – Long Message Transfer (LMT) – path f or inter-node messages is not CPU intensive (usually):● Rendezvous messages up to 4 MB in size normally use a RDMA read
LMT protocol● Messages larger than 4 MB, or in the case of shorter messages when
hardware resource depletion is occurring, use a cooperative RDMA write LMT protocol
● The Gemini DMA engine (BTE) is normally used for these transfers
Enhancing MPICH2 on XE for Better MPI Progress - Goals
● Keep complexity/new software to a minimum
● Minimize impact on short message latency/message ra te
● Don’t focus on a single message transfer protocol p ath for providing MPI progress (e.g. don’t focus just on RD MA read protocol)
5/3/2012 Cray User Group Proceedings 201210
read protocol)
● First phase focuses on better progression of longer message sizes (32 – 64 KB or longer)
……
… …
MPICH2 - Progress Thread Model
● Each MPI rank on a node starts an extra progress thread during MPI_Init.
● An extra completion queue (CQ) is created with blocking attribute during MPI_Init
● The progress threads call GNI_CqWait and remain
5/3/2012 Cray User Group Proceedings 201211
… …
application threadprogress thread
● The progress threads call GNI_CqWait and remain blocked in the kernel until an interrupt is delivered by the Gemini NIC indicating progress on an MPI message receive or send
● Interrupts are only generated when progressing inter-node LMT (rendezvous) messages
LMT RDMA Write Path – No Progress
Sender ReceiverSMSG Mailboxes
PE 1
PE 82
PE 5PE 22
PE 96
4. Register Chunk of App 2. Register Chunk4. Register Chunk of App Send Buffer
2. Register Chunkof App Recv Buffer
Cray User Group Proceedings 201212
5/3/2012
LMT RDMA Write Path – With Progress
Sender ReceiverSMSG Mailboxes
PE 1
PE 82
PE 5PE 22
PE 96
4. Register Chunk of App 2. Register Chunk4. Register Chunk of App Send Buffer
2. Register Chunkof App Recv Buffer
Cray User Group Proceedings 201213
5/3/2012
BTE RX CQ
Core Specialization
● Modifications to the Linux kernel to allow for dyna mic partitioning of the cores on a node between applica tion threads and OS kernel threads and system daemons
● Provides an interface to allow application librarie s to specify whether threads/processes it creates are to be scheduled on the application partition or the OS pa rtition
5/3/2012 Cray User Group Proceedings 201215
scheduled on the application partition or the OS pa rtition
● User of a parallel application specifies at launch time how many cores, if any, per node to reserve for the OS partition
Core Specialization and MPI progress
● Typical HPC application threads tend to run hot , i.e. they don’t typically make calls that result in yielding of the core on which they are scheduled
● Because of this, MPI progress threads need to have at least one core of a compute unit available per node for efficient handling of interrupts received from Gemi ni
5/3/2012 Cray User Group Proceedings 201216
efficient handling of interrupts received from Gemi ni
● Core Specialization provides a convenient way to pa rtition cores on a node between hot application threads, and coolsystem service daemon threads as well as MPI progre ss threads.
Job Container/Corespec Example
int disable_job_aff_algo(void) {
int rc;
int fd = open("/proc/job", O_WRONLY);
if (fd == -1){
return -1;
}
rc = write(fd, "disable_affinity_apply", strlen("disable_affinity_apply"));
if (rc < 0){
close(fd);
return -1;
}
close(fd);
Tell job container package that the next thread/process created by this process should not be placed like an application thread/process.
5/3/2012 Cray User Group Proceedings 201217
close(fd);
return 0;
}
int task_is_not_app(void) {size_t count;FILE *f;char zero[]="0";char filename[PATH_MAX];
snprintf(filename, sizeof(filename),"/proc/self/tas k/%ld/task_is_app",syscall(SYS_gettid));
f = fopen(filename, "w");if (!f) {
return -1;}
count = fwrite(zero,sizeof(zero),1,f);if (count != 1) {
fclose(f);return -1;
}fclose(f);return 0;
}
Tell CoreSpec package that this task (thread) should be scheduled on OS partition if one is available, otherwise use default Linux scheduler policy.
MPI Asynchronous Progress - enabling
● export MPICH_NEMESIS_ASYNC_PROGRESS=1
● export MPICH_MAX_THREAD_SAFETY=multiple
5/3/2012 Cray User Group Proceedings 201218
Sandia MPI Benchmark (mpi_overhead)
Measures MPI message overhead:
iter_twork_t
overhead = iter_t – work_t
5/3/2012 Cray User Group Proceedings 201220
MPI_Irecv()
MPI_Wait()base_t
Measures application availability:
avail = 1 – (iter_t – work_t)/base_t
http://www.cs.sandia.gov/smb/index.html
SMB – Receive side 64 KB message size (no progress)
100
120
140
160
180
base_tsecs
5/3/2012 Cray User Group Proceedings 201221
0
20
40
60
80base_twork_titer_t
µse
cs
work loop iterations
SMB – Receive side 64 KB message size(progress enabled)
100
120
140
160
180
base_tsecs
5/3/2012 Cray User Group Proceedings 201222
0
20
40
60
80base_twork_titer_t
work loop iterations
µse
cs
Availability – no progress
60
70
80
90
100
Ava
ilabi
lity
(%)
5/3/2012 Cray User Group Proceedings 201223
0
10
20
30
40
50
0 2 4 8 16 32 64 128
256
512
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
SenderReceiverA
vaila
bilit
y (%
)
Message length
Availability – progression enabled
60
70
80
90
100
Ava
ilabi
lity
(%)
5/3/2012 Cray User Group Proceedings 201224
0
10
20
30
40
50
0 2 4 8 16 32 64 128
256
512
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
SenderReceiverA
vaila
bilit
y (%
)
Message length
S3D Application
● Hybrid MPI/OpenMP● Joint work by ORNL and
Cray to prepare app foreither using acceleratorsor OpenMP only on node
● Large messages for thedata set considered
|-------------------------------------------------- ----
| Time% | Time | -- | Calls |Total
|-------------------------------------------------- ----
100.0% | 333.554284 | -- | 2022296 |Total
|-------------------------------------------------- ----
| 56.7% | 189.059471 | -- | 605408 |USER
||================================================= ====
| 38.3% | 127.905679 | -- | 869027 |MPI
||------------------------------------------------- ----
5/3/2012 Cray User Group Proceedings 201225
● Large messages for thedata set considered || 23.7% | 78.906595 | -- | 61200 |mpi_waitall_
|| 11.8% | 39.426609 | -- | 6960 |mpi_wait_
|| 2.4% | 7.933348 | -- | 399480 |mpi_isend_
||================================================= ====
| 4.1% | 13.729663 | -- | 546009 |OMP
|…
MPI Msg
Bytes
MPI Msg
Count
< 16 Bytes 16 – 256B
Bytes
256 – 4K
Bytes
4K – 64K
Bytes
64 K – 1M
Bytes
2223830363
6
401311 1795 27 100806 2 298681
S3D Timings with and without Progression
3
4
5
6
Tim
e p
er
S3
D t
ime
ste
p (
seco
nd
s)
5/3/2012 Cray User Group Proceedings 201226
0
1
2
13 14 15 16
Tim
e p
er
S3
D t
ime
ste
p (
seco
nd
s)
Application cores per node
Without progression
With progression
Runs done on 64 XK nodes
MILC Application
● Flat MPI application● Message sizes more varied, both eager and
rendezvous protocols are being used● Significant amount of intra-node
messaging
1.2E+06
1.4E+06
mes
sage
cou
nt
5/3/2012 Cray User Group Proceedings 201227
0.0E+00
2.0E+05
4.0E+05
6.0E+05
8.0E+05
1.0E+06
1.2E+06
Message length
mes
sage
cou
nt
MILC Application – MPI_Wait time
● App was modified to measure time spent in core gather/scatter communication scheme
● Non-blocking send/recvs used with some intervening work
● Max MPI_Wait time reduced significantly
1000
1200
5/3/2012 Cray User Group Proceedings 201228
0
200
400
600
800
2048 4096 8192
avg. wait time with progress
max wait time with progress
avg. wait time no progress
max wait time, no progress
seco
nds
2048 runs had to be run 16 ranks/per nod due to memory size
MILC Application – actual runtime
● The actual runtime changed much more modestly● Load imbalance and MPI_Allreduce sync points may lim it
the actual speedup for this particular job type for MILC● ~8 % improvement at 8192 ranks● ~2 % improvement at 4096 ranks
3500
4000
4500
5/3/2012 Cray User Group Proceedings 201229
seco
nds
0
500
1000
1500
2000
2500
3000
3500
2048 4096 8192
no progress
with progress
Conclusions
● Thread-based method to progress does show some promise based on micro-benchmarks
● The current approach can be effective for applicati ons with larger messages (at least 32 KB), but a mixtur e of messages using eager and rendezvous path limits the usefulness of the progress threads
5/3/2012 Cray User Group Proceedings 201230
usefulness of the progress threads
● Not shown in the data here, but in the paper, that for flat MPI codes, CoreSpec is highly recommended when tryin g to use this thread-based progress mechanism
Future Work
● Enhance GNI interface to allow for more efficient u se of the DMA (BTE) engine, allowing for better handling of shorter MPI messages (< 32KB)
● Use extensions to the CoreSpec/job package to sched ule the progress threads on unused hyperthreads for next generation of Gemini interconnect
5/3/2012 Cray User Group Proceedings 201231
generation of Gemini interconnect
● Enhancements to GNI to more efficiently wake up progress threads
● Integrate the improved progress thread framework ba ck into mainline Nemesis MPICH2 code
Acknowledgements
● John Levesque and Steve Whalen for assistance with MILC and S3D.
This material is based upon work supportedby the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0001. Any opinions,
5/3/2012 Cray User Group Proceedings 201232
Projects Agency under its Agreement No. HR0011-07-9-0001. Any opinions, findings and conclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views ofthe Defense Advanced Research Projects Agency. This work was supported in part by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357