GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery

Berlin, March 11th, 2004 1

GGF10 - GridCPR-WG

PARIS project-team Activities in Checkpoint Recovery

Christine Morin

[email protected]

PARIS INRIA project-team

IRISA – Rennes (France)

http://www.irisa.fr/paris


Cluster Federations

A particular case of grid Interconnection of several clusters of moderate size Homogeneity and heterogeneity

More and more homogeneous platforms: PC, Linux Heterogeneous networks (SAN, LAN, WAN) Clusters with different amount and kinds of resources

Considered applications Scientific applications (numerical simulation)

sequential and parallel applications based either on the shared memory or the message-passing communication paradigm

Code coupling applications Applications requiring a huge amount of resources (memory, computing

power) Dynamicity

A cluster may join or leave the federation at any time Individual nodes may fail in a cluster

SAN

SAN

LAN

WAN


Grid-aware OS for Cluster Federations

A single system image OS on each cluster A cluster appears as a single machine which offers a kind of standard

interface Mosix, Amoeba, Kerrighed

A cluster federation is seen as a set of pairs Structured peer to peer (P2P) network (instead of a hierarchy)

Fully decentralized control Native support for dynamicity Designed for scalability

Size of the routing tables bounded by log(N) Probabilistic log(N) bounds on the number of routing hops

“Standardization” of the APIs (IRIS project) Promising work to take into account the network's topology and

security issues (Pastry) Structured P2P systems usually provide distributed hash tables (DHT)

Building block for higher level services

DSM

DFS

CPU


Current Work on Checkpoint Recovery

Cluster Federation Execution of multithreaded applications in cluster federations

A coherence protocol for cached copies of volatile objects in peer-to-peer systems (multiple failures tolerated)

Hierarchical checkpointing protocol for code coupling applications Cluster SSI image operating system: Kerrighed

Full Posix thread interface Global process and memory management Configurable global scheduler

High availability Dynamic resource management for tolerating cluster

reconfigurations (node addition, eviction or failure) Checkpoint recovery mechanisms


Goals for Checkpoint Recovery in Kerrighed

Experimental platform for checkpointing strategies for parallel applications

Basic mechanisms common to different checkpointing protocols in MP and SM systems

Being able to checkpoint any kind of parallel application

Transparent checkpointing Implementation in a single system of

various checkpointing strategies To allow the programmer to

choose a suitable strategy for a particular application

To be able to compare several strategies with realistic (industrial) applications

Avoid code duplication in the system Robustness Fair comparison

Common framework Checkpoint and rollback servers Checkpoint numbering

Dependency management Unified model for message-passing

and shared memory models Direct Dependency Vector (DDV)

management Message logging Incremental checkpointing Checkpointing in background Communication system

Atomic multicast Stable storage

Different implementations Disk Memory


Checkpoint Recovery in Kerrighed: Current Status and Work Directions

Current Status Linux-based Kerrighed prototype

(2.4) Small kernel patch and a set of

modules Transparent checkpoint recovery for

(computing) individual processes Virtualization of a process in the

cluster Unique ghost mechanism for

process migration, checkpointing and restoration

Easy specialization of the stable storage implementation

Ghost can be sent to or retrieved from network, memory or disk

Work Directions Complete the debugging of

coordinated checkpointing (and recovery) for multithreaded and message-passing based applications

Checkpointable locks and barriers in a cluster

Disk I/O management Posix extension for a proper

integration of transparent checkpointing/recovery in the operating system

Ghost process

Memory Disk Network

Duplication Migration Checkpoint/restart


Hierarchical Checkpoint Recovery for Cluster Federations

Relaxed inter-cluster synchronism

to reflect the architecture Coordinated checkpointing in a cluster Communication-induced checkpointing

between clusters Independent checkpoints in each

cluster Forced checkpoints when a

communication generates a new dependency

Force a checkpoint only if the sender has saved a checkpoint since its last send

Several cluster checkpoints are kept Management of Direct Dependency

Vectors (DDV) to detect dependencies DDV included in inter-cluster messages DDV associated with cluster

checkpoints Garbage collection of useless cluster

checkpoints

Evaluation by discrete-event simulation

Works well if Few inter-cluster

communications Inter-cluster

communications « quasi-unidirectional »

Simulation Processing Display

Simulation Simulation


Future Work Checkpoint recovery in the large (we plan to hire a PhD student)

Dealing with applications with huge data sets executed in cluster federations Follow-up of our preliminary work on a hierarchical checkpointing protocol for

code coupling applications in cluster federations Based on Kerrighed experimental platform

Not only basic coordinated checkpointing but also various variants of independent and communication-induced strategies

Standard interface and basic building blocks Implementation in Kerrighed of ideas studied in previous projects

ICARE fault tolerant software DSM Combining replication inherent to the DSM with the replication needed for

ensuring recovery data stability Extension of the coherence protocol to manage recovery data in memory

HA-PSLS Integration of a DSM and a parallel file system Up-grading ICARE

Cohabitation of persistent and memory checkpoints Swap management (to avoid memory size limitation and to evict recovery data from

memory) Mapped file management (in-place checkpoints)


http://www.kerrighed.org

Kerrighed is registered as a community trademark.

[email protected]


Software Distribution

Kerrighed web site http://www.kerrighed.org (open since mid-November 2002) Open source under GPL licence Current version: Kerrighed V0.81 based on Linux 2.4.24

Kerrighed users mailing-list [email protected] (created in April 2003)

Kerrighed forum (created February 2004)

Notes Kerrighed is a registered trademark Kerrighed deposit at APP for each public release

Kerrighed tutorial (in conjunction with ICS’04, Saint-Malo (France), June 27th, 2004)


RoadMap for Kerrighed Prototype

March 2004 MPI (with migration)

April 2004 Kerrighed V1.00 (SSI-OSCAR) SGFD

January 2005 Kerrighed V1.10 64 bits (opteron) Checkpointing for parallel applications

July 2005 Kerrighed V2.0 High availability


Current Support: EDF

Kerrighed research prototype (2000-2003) CRECO EDF/INRIA

CIFRE Ph.D. grant (Geoffroy Vallée) Industrial Post-Doc (Renaud Lottiaux)

Experimentations with first industrial applications provided by EDF HRM1D, CATHARE, Cyrano 3, Aster

Kerrighed integration in OSCAR (2004-2005) INRIA Industrial Post-Doc (G. Vallée) with EDF & ORNL SSI-OSCAR


Current Support: DGA

Kerrighed robustness and full set of functionalities (2003-2005) COCA PEA funded by DGA

Partnership with CGEY and ONERA-CERT 2 full time engineers (Renaud Lottiaux, David Margery)

Experimentations with industrial applications Ligase, Gorf3D, Mixsar, RTI HLA


Current Kerrighed Team (being part of the PARIS project-team)

Faculty Christine Morin (DR, INRIA)

PhD students Pascal Gallard (INRIA) Gaël Utard (INRIA) Louis Rilling (ENS-Cachan)

Post-doc Geoffroy Vallée (PDI-EDF)

Engineers Renaud Lottiaux (INRIA) David Margery (INRIA)

Invited researcher Isaac Scherson (UCI)

Master students Jamal Ghaffour Etienne Rivière

Former members Ramamurthy Badrinath

(assistant professor, IIT Kharagpur, India)

May 2002 – April 2003 Viet Hoa Dinh (engineer)

September 2001-September 2002

Jean-Yves Burlett (Master student, univ. Rennes 1)

February-June 2001 Sébastien Monnet (Master

student, univ. Rennes 1) February-June 2003

H. Maka (Bachelor student, IIT Kharagpur)

May-July 2003


Academic Collaborations

University of Ulm, Germany Checkpointing for shared memory parallel applications

Rutgers University, USA Myrinet, Infiniband Self healing clusters

ORNL SSI-OSCAR

University of California, Irvine, USA Global scheduling

Deakin University, Australia SSI (informal contacts)

GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery

Documents

Transcript of GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery