Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64...
Transcript of Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64...
SPONSORED BY
FIND OUT MORE AT http://fault-tolerance.org/
Resilience Extensions for MPI: ULFM
User Level Failure Mitigation is a set of MPI interface extensions enabling Message Passing programs to restore MPI communication capabilities affected by process failures. It supports rebuilding communicators, RMA windows and I/O Files.
No particular recovery model is imposed or favored. Instead, a set of versatile APIs is included that provides support for different recovery styles (checkpoint, ABFT, iterative, Master-Worker, etc.).
Application directs the recovery, it pays only for the level of protection it needs.
Recovery can be restricted to a subgroup, preserving scalability and easing the composition of libraries.
Backward compatible with legacy, fragile applications.
Simple and familiar concepts to repair MPI.
Portability guaranteed by standardization.
Provides key MPI concepts to enable FT support from library, runtime and language extensions.
Protective actions are outside of critical MPI routines.
MPI implementors can uphold communication, collective, one-sided and I/O management algorithms unmodified.
Encourages programs to be reactive to failures, cost manifests only at recovery.
FLEXIBILITY PERFORMANCE PRODUCTIVITY
The ULFM specification is a crucial infrastructure that will enable the deployment of advanced, production quality fault tolerant techniques; it is a versatile solution to improve the efficiency of novel and established techniques.
National Science Foundation
a b
c
d
b
e
Master
Worker0 Worker1 Worker2
TIME
Pro
tectio
n b
locks
Fa
ctorize
d in
p
revio
us ite
ratio
ns
trailing matrix & protection
update by applying the
same operations
Fa
ctorize
d in
p
revio
us ite
ratio
ns
Fa
cto
rize
ABFT
Coordinated Checkpoint/Restart, Automatic, Compiler Assisted, User-driven Checkpointing, etc.
ULFM makes these approaches portable across MPI implementations
Naturally Fault Tolerant Applications, Master-Worker, Domain Decomposition, etc.
ULFM allows for the deployment of ultra-scalable, algorithm specific FT techniques.
ULFM MPISpecification
Uncoordinated Checkpoint/Restart, Transactional FT, Migration, Replication, etc.
Algorithm Fault Tolerance
Application continues a simple communication pattern, ignoring failures
In-place restart (i.e., without disposing of non-failed processes) accelerates recovery, permits in-memory checkpoint
Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.
Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.
The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.
CONTINUE ACROSS ERRORS
In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.
GROUP EXCEPTIONS
Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.
COLLECTIVE OPERATIONS
Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.
Master
W1
W2
Wn
Send (W1,T1)Submit T1
Send (W2,T1)Resubmit
Recv (ANY)Detected W1
Recv(P1): failureP2 calls RevokeP1
P2
P3
Pn
Recv(P1) Recv(P1): revoked
Recovery
P1
P2
P3
Pn
Bcast
Bcast
Shrink
Bcast
BA
ND
WID
TH
(G
bit
/s)
MESSAGE SIZE (Bytes)
ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)
Open MPIFT Open MPI (w/failure at rank 3)
0
1
2
3
4
5
6
7
8
9
10
11
12
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
LA
TE
NC
Y (
us
)
4
4.5
5
5.5
6
6.5
7
7.5
8
1 4 16 64 256 1K
-1%
-0.5%
+0%
+0.5%
+1%
8 16 32 64 128 256 512
DIF
FE
RE
NC
E IN
RU
NN
ING
TIM
E
NUMBER OF PROCESSES
Sequoia AMG Performance with Fault Tolerance
No
n-F
T is f
aste
rU
LF
M is f
aste
r
OPEN MPI ULFM IMPLEMENTATION PERFORMANCE