Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing
-
Upload
alana-fulton -
Category
Documents
-
view
19 -
download
1
description
Transcript of Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing
Building Fault Survivable MPI Programs with FT-MPI Using
Diskless Checkpointing
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou,
Thara Angskun, George Bosilca, Jack Dongarra
Presented by Todd Gamblin
Background: Failure
MTTF of high-performance computers is getting to be shorter than execution times of HPC applications Even 10,000 processors can imply failures every hour BlueGene/L already has 65,000, and will have 131,000
Why so bad? Commodity parts are cheap to use, more and more
computers being built with them Commodity parts => Commodity-targeted MTTF
Great for your desktop, not so great for 100,000 of them
General solution: Checkpointing
Save application state at synch points so that we can recover from a fault
Save state to disk or other stable storage High overhead for copying data to disk Persistent, can use for 2-level fault tolerance, can survive
failure of all processors
Can keep redundant copies of state in memory of other CPUs
Called Diskless Checkpointing Faster, can’t survive total failure
What policy?
1. Runtime system does checkpointing Completely general, no programmer effort Must save everything Binary memory dumps rule out recovery on heterogeneous
systems Binary dumps don’t work in some cases (round-off error in
reversed FP computations during rollback causes failure)
2. Application does checkpointing (Authors like this one) Requires programmer effort Can streamline amt. of state that needs to be saved Can get consistency for free by placing checkpoints at
application’s synch points Can store machine-independent data, recover on diverse
systems
FT-MPI : Application level checkpointing
What happens to communicators after failure? Abort everything (default in all MPI) Failed processes just die, others keep running,
MPI_COMM_WORLD has holes Failed processes die, but MPI_COMM_WORLD shrinks and
ranks can change Failed processes are respawned, ranks are same,
MPI_COMM_WORLD same size What happens to messages on failure?
All ops that would have returned MPI_SUCCESS finish properly, even if a process died
All operations in a collective communication fail if a process fails
That’s it! Everything else is up to the application
Diskless Checkpointing
So… we should probably do something about those faults, since FT-MPI doesn’t.
The paper tells us how to restore the state for floating point data
2 Schemes1. Mirrored - store copies of data on neighbors
2. Checksum - Store checksum of FP values on neighbors
Neighbor-based Checksums
• Up to n failures, so long as checkpoint and compute processor don’t fail
• Redundant processors
• Survives up to floor(n/2) failures, again depending on distribution
• 2 neighbors can’t fail
• No redundant procs.
• Best fault tolerance of these
• Still can’t have neighbors fail
Basic Checksums
• Can’t withstand more than one failure
• Straight sum of FP numbers
• On failure, 1 eqn, 1 unknown =>recalculate the unknown
• Likelihood of failure depends on distribution of failures in groups
• Checkpoint encodings can be done in parallel
• Probability of failure is
Weighted Checksums
• Can survive as long as there are more live checksum processors than dead nodes
• Each checksum processor is the solution to an equation, which we’ll need to solve to regenerate data at each Pi:
• Multiple groups of the setup on the left.
• Can adapt weightings, number of checkpoint nodes to reliability of particular subgroups
Need to avoid numerical error
Recomputing checkpoints involves solving a system of equations
Need a well-conditioned weighting matrix to do this Also need any submatrix to be well-conditioned
Solution: Use a Gaussian random matrix Gaussian random matrices are well-conditioned (with high
probability) Nice property: Submatrix of a matrix with Gaussian random
values is Gaussian Average loss of 1 digit of precision on reconstruction Probability of the loss of 2 digits is 3.1e-11
See paper (actually another referenced paper) for details on proof of this.
Results Tested Checkpointing & FT-MPI with Conjugate Gradient
Solver Only checkpointing 3 vectors, 2 scalars (light load) More performance overhead than mirrored approach, but 1/2 the
storage overhead Performance of FT-MPI
Comparable to MPICH2 (slightly faster), 2x speed of MPICH 1 Overhead of weighted checkpointing
About 2% for 5 checkpoint nodes, 64 compute nodes Overhead of recovery
About 1% for 5 CP nodes, 64 compute nodes Numerical Error in residuals in solver < 5.0e-6
Questions
How easy would it be to automate FP checkpointing like this? It seems like a pain to add to everything. Authors suggsest adding to numerical packages Could we make a tool? CpPablo?
Can we make weights/groups of checkpointed processors adaptive? e.g. might want to assign groups based on hot/cold areas
in machine room What other ways are there around problems of
binary checkpointing in heterogeneous environments?