Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan BalajiArgonne National Laboratory
{wbland, huiweilu, sseo, balaji}@anl.gov
May 5, 2015
Lessons Learned Implementing User-Level Failure Mitigation in
MPICH
CCGrid 2015
2
User-Level Failure Mitigation (ULFM)
• What is ULFM?– A proposal and standardized way of handling fail-stop process
failures in MPI– Mechanisms necessary to implement fault tolerance in applications
and libraries in order to allow applications to continue execution after failures
• ULFM introduces semantics to define failure notification, propagation, and recovery within MPI
CCGrid 2015
0 1MPI_Recv
MPI_ERR_PROC_FAILED
Failure Notification Failure Propagation Failure Recovery
MPI_COMM_SHRINK()
3
Motivation & Goal
• ULFM is becoming the front-running solution for process fault tolerance in MPI– Not yet adopted into the MPI standard– Being used by applications and libraries and is being
• Introduce an implementation of ULFM in MPICH– MPICH is a high-performance and widely portable
implementation of the MPI standard– Implementing ULFM in MPICH will expedite adoption more
widely by other MPI implementations• Demonstrate that while still a reference implementation, the
runtime cost of the new API calls introduced is relatively low
CCGrid 2015
4
Implementation & Evaluation
• ULFM Implementation in MPICH
• Failure Detection– Local failures detected by Hydra and
netmods– Error codes are returned back to the
user from the API calls• Agreement
– Uses two group-based allreduce operations
– If either fails, an error is returned to the user
• Revocation– Non-optimized implementation
done with message flood• Shrinking
– All processes construct consistent group of failed procs via allreduce
CCGrid 2015
Shrinking
Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart.
16 32 64 128 256 5120
17.5
35
52.5
70 DupNo FailureOne Failure
ProcessesM
illise
cond
s (m
s)
Top Related