An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific...
Transcript of An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific...
![Page 1: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/1.jpg)
An introduction to
checkpointingfor scientific applications
[email protected]/CISM - FNRS/CÉCI
November 2013CISM/CÉCI training session
![Page 2: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/2.jpg)
What is checkpointing ?
![Page 3: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/3.jpg)
$ ./count123^C$ ./count12 3
$ ./count123^C$ ./count45 6
Without checkpointing: With checkpointing:
![Page 4: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/4.jpg)
$ ./count123^C$ ./count12 3
$ ./count123^C$ ./count45 6
Without checkpointing: With checkpointing:
Checkpointing:
'saving' a computation so that it can be resumed later
(rather than started again)
![Page 5: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/5.jpg)
Why do we need checkpointing ?
![Page 6: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/6.jpg)
Imagine a text editor without 'checkpointing' ...
![Page 7: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/7.jpg)
1. Fit in time constraints
2. Debugging, monitoring
3. Cope with NODE_FAILs
4. Gang scheduling and preemption
Goals of checkpointing in HPC:
![Page 8: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/8.jpg)
The idea:
Save the program state
every time a checkpoint is encountered
and restart from there upon (un)planned stop
rather than bootstrap again from scratch
Values in variablesOpen files...
Position in the codeSignal or event...
starting loops at iteration 0creating tmp files...
![Page 9: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/9.jpg)
The key questions ...
Transparency for developer
Portability to other systems
Size of state to save
Checkpointing overhead
Transparency for developer
Portability to other systems
Size of state to save
Checkpointing overhead
Do I need to write a lot of
additional code ?
Can I stop on one system and
restart on another ?
How many GB of disk does it
require ?
How many FLOPs lost to
ensure checkpointing ?
![Page 10: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/10.jpg)
Who's in charge of all that ?
Transparency for developer
Portability to other systems
Size of state to save
Checkpointing overhead
the application itself -- +++ -- -
a library - ++ -- -
the compiler + ++ - +
a run-time + + ++ +
the OS ++ - ++ ++
the hardware +++ -- +++ +++
![Page 11: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/11.jpg)
Today's agenda:
How to make your program checkpoint-able
-> concepts and examples
-> recipes (design patterns)
Slurm integration
How to make someone else's program checkpoint-able
-> BLCR
-> DMTCP
1
2
![Page 12: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/12.jpg)
Part One: Checkpointing when you have the code
![Page 13: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/13.jpg)
13
So you can play
On hmem: ~dfr/checkpoint.tgz
![Page 14: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/14.jpg)
Making a program checkpoint-able by saving its state every iteration and looking for a state file on startup.
1
![Page 15: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/15.jpg)
Python recipe
![Page 16: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/16.jpg)
R recipe
![Page 17: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/17.jpg)
Octave recipe
![Page 18: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/18.jpg)
Fortran recipe
![Page 19: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/19.jpg)
C recipe
![Page 20: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/20.jpg)
1. Look for a state file (name can be hardcoded, or,
better, passed as parameter)
2. If found, then restore state (initialize all variables with content of the file state)
Else, bootstrap (create initial state)
3. Periodically save the state
In the previous example : The state is just an integer Periodically means at each iteration
The general recipe
![Page 21: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/21.jpg)
2 Using UNIX signals to reduce overhead : do not save the state at each iteration -- wait for the signal.
![Page 22: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/22.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
![Page 23: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/23.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
^C
^Z
^D
fg, bg
kill -9
kill
![Page 24: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/24.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
e.g.
![Page 25: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/25.jpg)
UNIX processes can receive 'signals' from the user, the OS, or another process
e.g.
![Page 26: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/26.jpg)
UNIX processes can receive 'signals' with an associated default action
![Page 27: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/27.jpg)
UNIX processes can receive 'signals' and handle ('trap') them
![Page 28: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/28.jpg)
Previous C recipe
![Page 29: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/29.jpg)
C signal recipe
![Page 30: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/30.jpg)
C signal recipe
![Page 31: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/31.jpg)
Fortan signal recipe
![Page 32: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/32.jpg)
Fortan signal recipe
![Page 33: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/33.jpg)
Python signal recipe
![Page 34: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/34.jpg)
Octave signal recipe
![Page 35: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/35.jpg)
R signal recipe
![Page 36: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/36.jpg)
1. Register a signal handler (a function that will modify a global variable when recieving a signal)
2. Test the value of the global variable periodically (At a moment when the state is
consistent an easy to recreate)
3. If the value indicates so, save state to disk (and optionally gracefully stop)
In the previous example : The state is just an integer Periodically means at each iteration
The general recipe
![Page 37: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/37.jpg)
3 Use Slurm signaling abilities to manage checkpoint-able software in Slurm scripts on the clusters.
![Page 38: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/38.jpg)
scancel is used to send signals to jobs
![Page 39: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/39.jpg)
Python signal recipe
Example: use scancel --signal USR1 $SLURM_JOB_ID to force state dump for reviewing/debugging
![Page 40: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/40.jpg)
--signal to have Slurm send signals automaticallybefore the end of the allocation
![Page 41: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/41.jpg)
Example: send SIGINT 60 seconds before job is killed (so, here, after 2 minutes)
![Page 42: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/42.jpg)
Set non-zero return code when stopping because of a received signal
Fortran signal recipe
![Page 43: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/43.jpg)
Then you can have your job re-queued automatically
![Page 44: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/44.jpg)
Note the --open-mode=append
![Page 45: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/45.jpg)
Or chain the jobs...
![Page 46: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/46.jpg)
Set a non-zero exit code
C: exit(1)
Fortran: stop 1
Octave: exit( 1 )
R: quit( status=1 )
Python: sys.exit( 1 )
![Page 47: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/47.jpg)
Using a signal-based watchdogto re-queue the job just before it is killed
![Page 48: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/48.jpg)
4 Use serialization tools and libraries for efficient and persistent data storage on disk
![Page 49: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/49.jpg)
Standard data file format allow browsing, postprocessing and transmitting intermediate data
Data size Storage type
~10MB CSV
~10GB Zipped CSV or Binary
~100GB HDF5, sqlite
~ 10TB MongoDB, Postgres
![Page 50: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/50.jpg)
5 Parallel programs are better checkpointed after a global synchronization.
![Page 51: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/51.jpg)
Checkpoint here
In the fork-join model, checkpoint after a join and before a fork
Easily ensure state consistencyAllows restarting with a different number of threads
![Page 52: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/52.jpg)
![Page 53: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/53.jpg)
![Page 54: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/54.jpg)
![Page 55: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/55.jpg)
Part Two: Checkpointing when you do not have the code
![Page 56: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/56.jpg)
6 Use programs and libraries that enable other programs with checkpoint/restart capabilities.
![Page 57: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/57.jpg)
Such program needs to:
1. Access the process' memory (the c/r program forks itself as the process,
or uses a kernel module)
2. Access the processor state at any moment (it uses signals to interrupt the process and provoke storage of the registers on the stack)
3. Track the state changing actions (fork, exec, system, etc.) (wrap standard library functions with
LD_PRELOAD'ed custom functions)
4. Inject checkpointing code in the program (LD_PRELOAD a library with signal handlers)
![Page 58: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/58.jpg)
LD_PRELOAD magic
![Page 59: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/59.jpg)
LD_PRELOAD magic
![Page 60: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/60.jpg)
7 BLCR : the Berkely Labs Checkpoint/ Restart for Linux works with a kernel module and a shared library
![Page 61: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/61.jpg)
![Page 62: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/62.jpg)
● Fully SMP safe● Rebuilds the virtual address space and restores registers● Supports the NPTL implementation of POSIX threads (LinuxThreads is no
longer supported)● Restores file descriptors, and state associated with an open file● Restores signal handlers, signal mask, and pending signals.● Restores the process ID (PID), thread group ID (TGID), parent process ID
(PPID), and process tree to old state.● Support save and restore of groups of related processes and the pipes
that connect them.● Should work with nearly any x86 or x86_64 Linux system that uses a 2.6
kernel (see FAQ for most recent info). Verified to work on SuSE Linux 9.x and up; Red Hat 8 and 9; Red Hat Enterprise Linux version 3, 4and 5; Fedora Core 5 through 10; and many vanilla Linux kernels (from kernel.org) from 2.6.0 on up (and many more).
● Experimental support is present for PPC, PPC64 and ARM architectures. We consider this support experimental mainly because of our limited ability to test it.
● Xen dom0 an domU are both supported with Xen 3.1.2 or newer.● Tested with the GNU C library (glibc) versions 2.1 through 2.6
Advertised Features
![Page 63: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/63.jpg)
Recall the non-checkpointable program
![Page 64: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/64.jpg)
Run with cr_run ; restart with cr_restart
![Page 65: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/65.jpg)
The submission script looks for checkpoint and cr_runs or cr_restarts accordingly
![Page 66: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/66.jpg)
Two jobs are submittedA checkpoint is created periodically
![Page 67: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/67.jpg)
At restart, note ./count still write to res1while the submission script writes to res2
![Page 68: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/68.jpg)
Alternatively, use a signal watchdog
![Page 69: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/69.jpg)
Stick to node
![Page 70: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/70.jpg)
8DMTCP : Distributed MultiThreading CheckPointing works with an independent monitoring process and a shared library
![Page 71: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/71.jpg)
![Page 72: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/72.jpg)
● Distributed Multi-Threaded CheckPointing● Works with Linux Kernel 2.6.9 and later● Supports sequential and multi-threaded computations across
single/multiple hosts● Entirely in user space (no kernel modules or root privilege)● Transparent (no recompiling, no re-linking)● Written at Northeastern U. and MIT and under active development for 4+
years● LGPL'd and freely available● No remote I/O● Supports threads, mutexes/semaphoes, forks, shared memory, exec, and
many more
Advertised Features
What types of programs can DMTCP checkpoint?It checkpoints most binary programs on most Linux distributions. Some examples on which users have verified that DMTCP works are: Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU Common Lisp), emacs, vi/cscope, Open MPI, MPICH-2, OpenMP, and Cilk. See Supported Applications for further details. Our goal is to support DMTCP for all vanilla programs. If DMTCP does not work correctly on your program, then this is a bug in DMTCP. We would be appreciative if you can then file a bug report with DMTCP.
From their FAQ:
“
”
![Page 73: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/73.jpg)
Recall the non-checkpointable program
![Page 74: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/74.jpg)
Run with dmtcp_launch (runs monitoring daemon if necessary)
![Page 75: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/75.jpg)
Restart with dmtcp_restart_script.sh
![Page 76: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/76.jpg)
:q
Launch the coordinator and the program with automatic checkpointing every 30 seconds
![Page 77: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/77.jpg)
Launch coordinator and restart program
![Page 78: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/78.jpg)
![Page 79: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/79.jpg)
9Check whether your scientific software is checkpointable. Many of them are...
![Page 80: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/80.jpg)
![Page 81: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/81.jpg)
![Page 82: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/82.jpg)
![Page 83: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/83.jpg)
![Page 84: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/84.jpg)
Summary,Wrap-up and Conclusions.
[email protected]/CISM - FNRS/CÉCI
November 2013CISM/CÉCI training session
![Page 85: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/85.jpg)
Never click 'Discard' again...
![Page 86: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/86.jpg)
● Application-based checkpointing● Efficient: save only needed data● Coarse temporal granularity: Good for fault tolerance, bad for preemption● Requires effort by programmer
● Library-based (DMTCP)● Portable across platforms● Transparent to application● Can't restore all resources
● Kernel-based checkpointing (BLCR)● Not portable● Transparent to application● Needs root access to install● Can save/restore all resources
![Page 87: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/87.jpg)
● If you're the developer:
● Make initializations conditional● Save minimal reconstructable state periodically
● Save full workspace upon signal● Checkpoint after a synchronization
![Page 88: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/88.jpg)
The submission script(s)
● Either one big one or two small ones● Checkpoint periodically or --signal● Requeue automatically● Open-mode=append
![Page 89: An introduction to checkpointing - UCLouvain · An introduction to checkpointing for scientific applications damien.francois@uclouvain.be UCL/CISM - FNRS/CÉCI November 2013 CISM/CÉCI](https://reader030.fdocuments.in/reader030/viewer/2022040415/5f2c95054df82267396f6b55/html5/thumbnails/89.jpg)
BLCR, DMTCP, own recipe...