Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu,...

4
Wesley Bland, Huiwei Lu, Sangmin Seo , Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing User-Level Failure Mitigation in MPICH CCGrid 2015

Transcript of Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu,...

Page 1: Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing.

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan BalajiArgonne National Laboratory

{wbland, huiweilu, sseo, balaji}@anl.gov

May 5, 2015

Lessons Learned Implementing User-Level Failure Mitigation in

MPICH

CCGrid 2015

Page 2: Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing.

2

User-Level Failure Mitigation (ULFM)

• What is ULFM?– A proposal and standardized way of handling fail-stop process

failures in MPI– Mechanisms necessary to implement fault tolerance in applications

and libraries in order to allow applications to continue execution after failures

• ULFM introduces semantics to define failure notification, propagation, and recovery within MPI

CCGrid 2015

0 1MPI_Recv

MPI_ERR_PROC_FAILED

Failure Notification Failure Propagation Failure Recovery

MPI_COMM_SHRINK()

Page 3: Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing.

3

Motivation & Goal

• ULFM is becoming the front-running solution for process fault tolerance in MPI– Not yet adopted into the MPI standard– Being used by applications and libraries and is being

• Introduce an implementation of ULFM in MPICH– MPICH is a high-performance and widely portable

implementation of the MPI standard– Implementing ULFM in MPICH will expedite adoption more

widely by other MPI implementations• Demonstrate that while still a reference implementation, the

runtime cost of the new API calls introduced is relatively low

CCGrid 2015

Page 4: Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, balaji}@anl.gov May 5, 2015 Lessons Learned Implementing.

4

Implementation & Evaluation

• ULFM Implementation in MPICH

• Failure Detection– Local failures detected by Hydra and

netmods– Error codes are returned back to the

user from the API calls• Agreement

– Uses two group-based allreduce operations

– If either fails, an error is returned to the user

• Revocation– Non-optimized implementation

done with message flood• Shrinking

– All processes construct consistent group of failed procs via allreduce

CCGrid 2015

Shrinking

Slower than MPI_COMM_DUP because of failure detection. As expected, introducing failures in the middle of the algorithm causes large runtime increase as the algorithm must restart.

16 32 64 128 256 5120

17.5

35

52.5

70 DupNo FailureOne Failure

ProcessesM

illise

cond

s (m

s)