Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64...

2
SPONSORED BY FIND OUT MORE AT http://fault-tolerance.org/ Resilience Extensions for MPI: ULFM User Level Failure Mitigation is a set of MPI interface extensions enabling Message Passing programs to restore MPI communication capabilities affected by process failures. It supports rebuilding communicators, RMA windows and I/O Files. No particular recovery model is imposed or favored. Instead, a set of versatile APIs is included that provides support for different recovery styles (checkpoint, ABFT, iterative, Master-Worker, etc.). Application directs the recovery, it pays only for the level of protection it needs. Recovery can be restricted to a subgroup, preserving scalability and easing the composition of libraries. Backward compatible with legacy, fragile applications. Simple and familiar concepts to repair MPI. Portability guaranteed by standardization. Provides key MPI concepts to enable FT support from library, runtime and language extensions. Protective actions are outside of critical MPI routines. MPI implementors can uphold communication, collective, one-sided and I/O management algorithms unmodified. Encourages programs to be reactive to failures, cost manifests only at recovery. FLEXIBILITY PERFORMANCE PRODUCTIVITY The ULFM specification is a crucial infrastructure that will enable the deployment of advanced, production quality fault tolerant techniques; it is a versatile solution to improve the efficiency of novel and established techniques. National Science Foundation a b c d b e Master Worker0 Worker1 Worker2 TIME Protection blocks Factorized in previous iterations trailing matrix & protection update by applying the same operations Factorized in previous iterations Factorize ABFT Coordinated Checkpoint/Restart, Automatic, Compiler Assisted, User-driven Checkpointing, etc. ULFM makes these approaches portable across MPI implementations Naturally Fault Tolerant Applications, Master-Worker, Domain Decomposition, etc. ULFM allows for the deployment of ultra-scalable, algorithm specific FT techniques. ULFM MPI Specification Uncoordinated Checkpoint/Restart, Transactional FT, Migration, Replication, etc. Algorithm Fault Tolerance Application continues a simple communication pattern, ignoring failures In-place restart (i.e., without disposing of non-failed processes) accelerates recovery, permits in-memory checkpoint

Transcript of Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64...

Page 1: Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M LATENCY (us) 4 4.5 5 5.5 6 6.5 7 7.5 8 1 4 16 64 256

SPONSORED BY

FIND OUT MORE AT http://fault-tolerance.org/

Resilience Extensions for MPI: ULFM

User Level Failure Mitigation is a set of MPI interface extensions enabling Message Passing programs to restore MPI communication capabilities affected by process failures. It supports rebuilding communicators, RMA windows and I/O Files.

No particular recovery model is imposed or favored. Instead, a set of versatile APIs is included that provides support for different recovery styles (checkpoint, ABFT, iterative, Master-Worker, etc.).

Application directs the recovery, it pays only for the level of protection it needs.

Recovery can be restricted to a subgroup, preserving scalability and easing the composition of libraries.

Backward compatible with legacy, fragile applications.

Simple and familiar concepts to repair MPI.

Portability guaranteed by standardization.

Provides key MPI concepts to enable FT support from library, runtime and language extensions.

Protective actions are outside of critical MPI routines.

MPI implementors can uphold communication, collective, one-sided and I/O management algorithms unmodified.

Encourages programs to be reactive to failures, cost manifests only at recovery.

FLEXIBILITY PERFORMANCE PRODUCTIVITY

The ULFM specification is a crucial infrastructure that will enable the deployment of advanced, production quality fault tolerant techniques; it is a versatile solution to improve the efficiency of novel and established techniques.

National Science Foundation

a b

c

d

b

e

Master

Worker0 Worker1 Worker2

TIME

Pro

tectio

n b

locks

Fa

ctorize

d in

p

revio

us ite

ratio

ns

trailing matrix & protection

update by applying the

same operations

Fa

ctorize

d in

p

revio

us ite

ratio

ns

Fa

cto

rize

ABFT

Coordinated Checkpoint/Restart, Automatic, Compiler Assisted, User-driven Checkpointing, etc.

ULFM makes these approaches portable across MPI implementations

Naturally Fault Tolerant Applications, Master-Worker, Domain Decomposition, etc.

ULFM allows for the deployment of ultra-scalable, algorithm specific FT techniques.

ULFM MPISpecification

Uncoordinated Checkpoint/Restart, Transactional FT, Migration, Replication, etc.

Algorithm Fault Tolerance

Application continues a simple communication pattern, ignoring failures

In-place restart (i.e., without disposing of non-failed processes) accelerates recovery, permits in-memory checkpoint

Page 2: Resilience Extensions for MPI: ULFMicl.cs.utk.edu/graphics/posters/files/SC13-ULFM.pdf · 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M LATENCY (us) 4 4.5 5 5.5 6 6.5 7 7.5 8 1 4 16 64 256

Resilience Extensions for MPI: ULFMULFM provides targeted interfaces to empower recovery strategies with adequate options to restore communication capabilities and global consistency, at the necessary levels only.

Sequoia AMG is an unstructured physics mesh application with a complex communication pattern that employs both point-to-point and collective operations. Its failure free performance is unchanged whether it is deployed with ULFM or normal Open MPI.

The failure of rank 3 is detected and managed by rank 2 during the 512 bytes message test. The connectivity and bandwidth between rank 0 and rank 1 are unaffected by failure handling activities at rank 2.

CONTINUE ACROSS ERRORS

In ULFM, failures do not alter the state of MPI communicators. Point-to-point operations can continue undisturbed between non-faulty processes. ULFM imposes no recovery cost on simple communication patterns that can proceed despite failures.

GROUP EXCEPTIONS

Consistent reporting of failures would add an unacceptable performance penalty. In ULFM, errors are raised only at ranks where an operation is disrupted; other ranks may still complete their operations. A process can use MPI_[Comm,Win,File]_revoke to propagate an error notification on the entire group, and could, for example, interrupt other ranks to join a coordinated recovery.

COLLECTIVE OPERATIONS

Allowing collective operations to operate on damaged MPI objects (Communicators, RMA windows or Files) would incur unacceptable overhead. The MPI_Comm_shrink routine builds a replacement communicator, excluding failed processes, which can be used to resume collective communications, spawn replacement processes, and rebuild RMA Windows and Files.

Master

W1

W2

Wn

Send (W1,T1)Submit T1

Send (W2,T1)Resubmit

Recv (ANY)Detected W1

Recv(P1): failureP2 calls RevokeP1

P2

P3

Pn

Recv(P1) Recv(P1): revoked

Recovery

P1

P2

P3

Pn

Bcast

Bcast

Shrink

Bcast

BA

ND

WID

TH

(G

bit

/s)

MESSAGE SIZE (Bytes)

ULFM Fault Tolerant MPI Performance with failuresIMB Ping-pong between ranks 0 and 1 (IB20G)

Open MPIFT Open MPI (w/failure at rank 3)

0

1

2

3

4

5

6

7

8

9

10

11

12

1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

LA

TE

NC

Y (

us

)

4

4.5

5

5.5

6

6.5

7

7.5

8

1 4 16 64 256 1K

-1%

-0.5%

+0%

+0.5%

+1%

8 16 32 64 128 256 512

DIF

FE

RE

NC

E IN

RU

NN

ING

TIM

E

NUMBER OF PROCESSES

Sequoia AMG Performance with Fault Tolerance

No

n-F

T is f

aste

rU

LF

M is f

aste

r

OPEN MPI ULFM IMPLEMENTATION PERFORMANCE