Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Checkpoint Based Recovery from Power Failures

Christopher SutardjaEmil Stefanov

Goals• Consistent checkpoint– A consistent snapshot of memory for a specific time in the

past.• Safe even under power failure– The checkpoint is never “in transition”

• Small storage overhead– Not much more than double the memory.

• Low performance overhead– Should not stall the processor for too long.

• Scalable– Scales well in large core networks such as meshes.

Related Work

• On the feasibility of incremental checkpointing for scientific computing by J. Sancho et al– Speculates about the future role of checkpointing in

parallel machines.– As the number of processing nodes grows

exponentially, failure of any one node becomes much more likely.

– Error correction codes and other redundancies would introduce too much overhead when used alone.

– As a result, researching Checkpoint recovery is growing in importance.

Related Work

• Modular Checkpointing for Atomicity by L. Ziarek et al.– Introduces an abstraction called stabilizers to

make checkpointing easier.– Targets message-passing machines• Makes consistent checkpointing more challenging.

Emil

How do stablizers work? What exactly are they?

Related Work

• SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery by D. Sorin et al.– Explores the concept of checkpointing in logical

time. – Multiple checkpoints.– Each dirty cache line has a tag indicating when it

was modified relative to a checkpoint.– Low execution overhead.– Not safe from power failures.

Related Work• ReVive: cost-effective architectural support for

rollback recovery in shared-memory multiprocessors by M. Prvulovic et al.– Explores different ways of rollback recovery in shared-

memory multiprocessor systems. Considers:• the scope of the checkpoint• memory• checkpointing mechanism.

– Achieves about 6% checkpointing overhead.– Not safe from power failures.– Not geared towards non-volatile memory: requires

fast writes.

Related Work

• Efficient Initialization and Crash Recovery for Log-based File Systems over Flash Memory by Chin Wu et al.– As Flash Memory becomes cheaper and denser, the

uses for Flash increase. – Uses flash for recovering file systems.– Yet another use of flash for recovery.– Use a log-based method to accelerate remounting

after system crash by minimizing the amount of information that has to be changed upon reboot.

DRAMDRAM

DRAMDRAM

DRAMDRAM

DRAMDRAM

CoreCore

L1L1

L2L2

DRAMDRAM

DRAMDRAM

DRAMDRAM

DRAMDRAM

Checkpoint ACheckpoint ACache Checkpoint Controller

Cache Checkpoint Controller Checkpoint BCheckpoint B

Checkpoint ACheckpoint ACache

Checkpoint Controller


L1L1

L2L2 Check pointCheck point

Buffer Buffer

LogLog

Check pointCheck point

Buffer Buffer

LogLog


Buffer Buffer

LogLog


Buffer Buffer

LogLog

DRAM Checkpointer

DRAM Checkpointer

DRAM Checkpointer

DRAM Checkpointer

Address DecoderAddress Decoder

DRAM Checkpointer

DRAM Checkpointer

DRAM Checkpointer

DRAM Checkpointer

Checkpoint CoordinatorCheckpoint Coordinator

Checkpoint ACheckpoint A

Checkpoint BCheckpoint BCoreCore

Checkpointing Techniques• For Caches and Cores:– Each cache/core has two flash storages adjacent to it.

• One is for the previous checkpoint• One for the current checkpoint.

– During a checkpoint, the cache/core internal state is copied to flash storage.

• For DRAM:– The checkpointing system snoops on DRAM.– DRAM changes are continuously logged to flash

memory.– A chain of parallel buffers ensues that DRAM

checkpointing almost never causes a stall.

Responsibilities of the Main Components

• Checkpoint Coordinator– Notifies the nodes and DRAM checkpointers that a

checkpoint is beginning.• DRAM Checkpointer– Continuously logs DRAM changes.– Checkpoints when instructed by the coordinator.

• Cache Checkpoint Controller– Checkpoints the adjacent cache when instructed

by the coordinator.

Steps for Checkpointing (1 of 2)

1. The coordinator sets the checkpoint signal to 1.2. In parallel each

a. Core:i. Pauses processing instructions.ii. Copies internal state to flash memory.

b. Cache Checkpoint Controller:i. Copies cache internal state to flash memory (data is copied

one line at a time).c. DRAM Checkpointer:

i. Flushes buffer to flash log.ii. Notifies checkpoint coordinator that the buffer has been

flushed.

Steps for Checkpointing (2 of 2)

3. The coordinator sets the checkpoint signal to 0.4. In parallel each

a. Core:i. Flips flash memory bit to indicate the new checkpoint

buffer.

b. Cache Checkpoint Controller:i. Flips flash memory bit to indicate the new checkpoint

buffer.

c. DRAM Checkpointer:i. Marks checkpoint boundary in flash log.





L1L1

L2L2

Checkpoint ACheckpoint A

Checkpoint BCheckpoint BCoreCore

F F F F F F F F


Buffer Buffer

LogLog


Buffer Buffer

LogLog


Buffer Buffer

LogLog


Buffer Buffer

LogLog

Address DecoderAddress Decoder

Previous Checkpoint Changes

Next Checkpoint

Changes

endstart

Buffered Changes

Previous Checkpoint

(random access)

Recovering1. Determining which Checkpoint to use

a. System checks which Checkpoint is the most recentb. If the most recent checkpoint was in progress during crash, the older

checkpoint is used.2. Restoring Previous State

a. Each architectural register is rewritten.b. Each cache is written to by its adjacent FLASH buffer (one cache line

at a time)c. Main Memory is recoveredd. Take advantage of pipelined write if available.

3. Resume Executiona. Resume program counterb. Notify that CPU’s that the system is restoring from a checkpoint

(single bit)

Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.

Documents

Transcript of Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.