Http://ftg.lbl.gov/checkpoint [email protected] An Overview of Berkeley Lab’s Linux...

12
http://ftg.lbl.gov/ checkpoint [email protected] An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric Roman January 13 th , 2004 (Based on slides by Jason Duell)

Transcript of Http://ftg.lbl.gov/checkpoint [email protected] An Overview of Berkeley Lab’s Linux...

Page 1: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

An Overview of

Berkeley Lab’s

Linux Checkpoint/Restart

(BLCR)

Paul Hargrove with Jason Duell and Eric RomanJanuary 13th, 2004

(Based on slides by Jason Duell)

Page 2: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Linux checkpoint/restart

Outline

Project goals

System design

Entension interface

Current status

Future work

Page 3: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Uses of Checkpoint/Restart

Gang scheduling● No queue drain for maintenance, policy change● Higher utilization and/or more flexible scheduling

Process migration● Save job if node failure imminent● Pack jobs for optimal network performance

Periodic backup● Not our main focus● Application can always do more efficiently● But may be useful for systems with long jobs, fast I/O,

and/or high node failure rates

Page 4: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Implementation Strategies

Application-based checkpointing● Efficient: save only needed data as step completes● Good for fault tolerance: bad for preemption● Requires per-application effort by programmer

Library-based checkpointing● Portable across operating systems● Transparent to application (but may require relink, etc.)● Can't (generally) restore all resources (ex: process IDs)

● Can’t checkpoint shell scripts

Kernel-based checkpointing● Not portable, and harder to implement● Can save/restore all resources

Page 5: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Design Goals

Target: parallel scientific applications● MPI is a must ● But allow support for other programs/models, too● Esoteric features (ptrace, Unix domain sockets) have

lower implementation priority

Implemention: Linux kernel module● lower barrier to adoption than kernel patch● Allows upgrades, bug fixes, without reboot

Page 6: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Design Goals II

Provide ‘toolkit’ for distributed C/R● We provide single node checkpoint/restart● We don’t support distributed operating system features

• No built-in support for TCP sockets, bproc namespaces, etc.● We provide hooks to allow parallel runtimes/libraries to

implement distributed checkpoint/restart• So the MPI library needs to know about checkpointing, but user

applications don’t

Page 7: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Extension Interface

Callback functions● Registered at startup (or as needed)● Run at checkpoint time, then resume at restart/continue● Handle parallel coordination and/or unsupported objects

Two types of callbacks● Signal handler context

● Run with same PID (LinuxThreads); no thread-safety needed● But callback limited to calling signal-safe functions (small subset of POSIX)

● Separate thread context● Can call any function● But code needs to be thread-safe, and separate PID (LinuxThreads)

Critical sections● Use to protect uncheckpointable sections of code

Page 8: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Current Status

Support LAM-MPI jobs● Both TCP and Myrinet supported● Infrastructure in place for Infiniband, Quadrics● Process migration: currently must restart whole job

Simple semantics for open files● Reopen and seek to original position● Must be regular files (pipe support coming soon)● Files must exist in same location on filesystem

Single- and multi-threaded processes● checkpoint of ‘mpirun’ checkpoints whole MPI job● Will support process groups, sessions in future● Restore original PID

Page 9: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Current status II

Work with wide variety of 2.4 kernels● kernel.org versions 2.4.3 onwards● RedHat: 7.2 through 9 ● SuSE: 7.2 through 9.0 ● autoconf feature probing, so support of custom patched

kernels likely to be automatic● we’ll maintain 2.4 support once 2.6 comes out

Support both new and old pthreads● I.e., old “LinuxThreads”, plus new 2.6 pthreads

(backported to 2.4 by Red Hat)

Page 10: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Future Work

Support for sessions & process groups● Including pipes, mmaps, etc., shared within group● Full restoration of parent/child tree, with original PIDs

More semantics for files● Allow checksum of file, with restart error if it has changed● Allow saving contents of file (restore either clobbers, or opens

anonymously)● Support files that are not open at checkpoint time, but are

specified as being part of the checkpoint

Laundry list of other resources to support● Page 4 of “Design and Implementation” paper

Page 11: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Future Work II

Integration with parallel job systems● Funded to work within suite from DOE Scalable systems

software SciDAC. Work is in progress.● Possibility of OpenPBS, PBSPro support● Interested in others (LSF, SGE, SLURM, etc.)

More MPI implementations● MPICH 2 support anticipated● Vendor support (Quadrics)?● LAM/MPI support for partial/live migration

Page 12: Http://ftg.lbl.gov/checkpoint checkpoint@lbl.gov An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.

http://ftg.lbl.gov/[email protected]

Conclusion

http://ftg.lbl.gov/checkpoint

Papers (available from website):● “Design and Implementation of BLCR”: high-level system

design, including description of user API● “Requirements for Linux Checkpoint/Restart”: exhaustive

list of Unix features we will support (or not).● “A Survey of Checkpoint/Restart Implementations”:

focusing on open source versions that run on Linux ● “The LAM/MPI Checkpoint/Restart Framework: System-

Initiated Checkpointing”: implementation with LAM/MPI