Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Building Fault Survivable MPI Programs with FT-MPI Using

Diskless Checkpointing

Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou,

Thara Angskun, George Bosilca, Jack Dongarra

Presented by Todd Gamblin

Background: Failure

MTTF of high-performance computers is getting to be shorter than execution times of HPC applications Even 10,000 processors can imply failures every hour BlueGene/L already has 65,000, and will have 131,000

Why so bad? Commodity parts are cheap to use, more and more

computers being built with them Commodity parts => Commodity-targeted MTTF

Great for your desktop, not so great for 100,000 of them

General solution: Checkpointing

Save application state at synch points so that we can recover from a fault

Save state to disk or other stable storage High overhead for copying data to disk Persistent, can use for 2-level fault tolerance, can survive

failure of all processors

Can keep redundant copies of state in memory of other CPUs

Called Diskless Checkpointing Faster, can’t survive total failure

What policy?

1. Runtime system does checkpointing Completely general, no programmer effort Must save everything Binary memory dumps rule out recovery on heterogeneous

systems Binary dumps don’t work in some cases (round-off error in

reversed FP computations during rollback causes failure)

2. Application does checkpointing (Authors like this one) Requires programmer effort Can streamline amt. of state that needs to be saved Can get consistency for free by placing checkpoints at

application’s synch points Can store machine-independent data, recover on diverse

systems

FT-MPI : Application level checkpointing

What happens to communicators after failure? Abort everything (default in all MPI) Failed processes just die, others keep running,

MPI_COMM_WORLD has holes Failed processes die, but MPI_COMM_WORLD shrinks and

ranks can change Failed processes are respawned, ranks are same,

MPI_COMM_WORLD same size What happens to messages on failure?

All ops that would have returned MPI_SUCCESS finish properly, even if a process died

All operations in a collective communication fail if a process fails

That’s it! Everything else is up to the application

Diskless Checkpointing

So… we should probably do something about those faults, since FT-MPI doesn’t.

The paper tells us how to restore the state for floating point data

2 Schemes1. Mirrored - store copies of data on neighbors

2. Checksum - Store checksum of FP values on neighbors

Neighbor-based Checksums

• Up to n failures, so long as checkpoint and compute processor don’t fail

• Redundant processors

• Survives up to floor(n/2) failures, again depending on distribution

• 2 neighbors can’t fail

• No redundant procs.

• Best fault tolerance of these

• Still can’t have neighbors fail

Basic Checksums

• Can’t withstand more than one failure

• Straight sum of FP numbers

• On failure, 1 eqn, 1 unknown =>recalculate the unknown

• Likelihood of failure depends on distribution of failures in groups

• Checkpoint encodings can be done in parallel

• Probability of failure is

Weighted Checksums

• Can survive as long as there are more live checksum processors than dead nodes

• Each checksum processor is the solution to an equation, which we’ll need to solve to regenerate data at each Pi:

• Multiple groups of the setup on the left.

• Can adapt weightings, number of checkpoint nodes to reliability of particular subgroups

Need to avoid numerical error

Recomputing checkpoints involves solving a system of equations

Need a well-conditioned weighting matrix to do this Also need any submatrix to be well-conditioned

Solution: Use a Gaussian random matrix Gaussian random matrices are well-conditioned (with high

probability) Nice property: Submatrix of a matrix with Gaussian random

values is Gaussian Average loss of 1 digit of precision on reconstruction Probability of the loss of 2 digits is 3.1e-11

See paper (actually another referenced paper) for details on proof of this.

Results Tested Checkpointing & FT-MPI with Conjugate Gradient

Solver Only checkpointing 3 vectors, 2 scalars (light load) More performance overhead than mirrored approach, but 1/2 the

storage overhead Performance of FT-MPI

Comparable to MPICH2 (slightly faster), 2x speed of MPICH 1 Overhead of weighted checkpointing

About 2% for 5 checkpoint nodes, 64 compute nodes Overhead of recovery

About 1% for 5 CP nodes, 64 compute nodes Numerical Error in residuals in solver < 5.0e-6

Questions

How easy would it be to automate FP checkpointing like this? It seems like a pain to add to everything. Authors suggsest adding to numerical packages Could we make a tool? CpPablo?

Can we make weights/groups of checkpointed processors adaptive? e.g. might want to assign groups based on hot/cold areas

in machine room What other ways are there around problems of

binary checkpointing in heterogeneous environments?

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

Documents

Transcript of Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing