Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
1
Transcript of Checkpoint Based Recovery from Power Failures Christopher Sutardja Emil Stefanov.
Checkpoint Based Recovery from Power Failures
Christopher SutardjaEmil Stefanov
Goals• Consistent checkpoint– A consistent snapshot of memory for a specific time in the
past.• Safe even under power failure– The checkpoint is never “in transition”
• Small storage overhead– Not much more than double the memory.
• Low performance overhead– Should not stall the processor for too long.
• Scalable– Scales well in large core networks such as meshes.
Related Work
• On the feasibility of incremental checkpointing for scientific computing by J. Sancho et al– Speculates about the future role of checkpointing in
parallel machines.– As the number of processing nodes grows
exponentially, failure of any one node becomes much more likely.
– Error correction codes and other redundancies would introduce too much overhead when used alone.
– As a result, researching Checkpoint recovery is growing in importance.
Related Work
• Modular Checkpointing for Atomicity by L. Ziarek et al.– Introduces an abstraction called stabilizers to
make checkpointing easier.– Targets message-passing machines• Makes consistent checkpointing more challenging.
Related Work
• SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery by D. Sorin et al.– Explores the concept of checkpointing in logical
time. – Multiple checkpoints.– Each dirty cache line has a tag indicating when it
was modified relative to a checkpoint.– Low execution overhead.– Not safe from power failures.
Related Work• ReVive: cost-effective architectural support for
rollback recovery in shared-memory multiprocessors by M. Prvulovic et al.– Explores different ways of rollback recovery in shared-
memory multiprocessor systems. Considers:• the scope of the checkpoint• memory• checkpointing mechanism.
– Achieves about 6% checkpointing overhead.– Not safe from power failures.– Not geared towards non-volatile memory: requires
fast writes.
Related Work
• Efficient Initialization and Crash Recovery for Log-based File Systems over Flash Memory by Chin Wu et al.– As Flash Memory becomes cheaper and denser, the
uses for Flash increase. – Uses flash for recovering file systems.– Yet another use of flash for recovery.– Use a log-based method to accelerate remounting
after system crash by minimizing the amount of information that has to be changed upon reboot.
DRAMDRAM
DRAMDRAM
DRAMDRAM
DRAMDRAM
CoreCore
L1L1
L2L2
DRAMDRAM
DRAMDRAM
DRAMDRAM
DRAMDRAM
Checkpoint ACheckpoint ACache Checkpoint Controller
Cache Checkpoint Controller Checkpoint BCheckpoint B
Checkpoint ACheckpoint ACache
Checkpoint Controller
Cache Checkpoint Controller Checkpoint BCheckpoint B
L1L1
L2L2 Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
DRAM Checkpointer
DRAM Checkpointer
DRAM Checkpointer
DRAM Checkpointer
Address DecoderAddress Decoder
DRAM Checkpointer
DRAM Checkpointer
DRAM Checkpointer
DRAM Checkpointer
Checkpoint CoordinatorCheckpoint Coordinator
Checkpoint ACheckpoint A
Checkpoint BCheckpoint BCoreCore
Checkpointing Techniques• For Caches and Cores:– Each cache/core has two flash storages adjacent to it.
• One is for the previous checkpoint• One for the current checkpoint.
– During a checkpoint, the cache/core internal state is copied to flash storage.
• For DRAM:– The checkpointing system snoops on DRAM.– DRAM changes are continuously logged to flash
memory.– A chain of parallel buffers ensues that DRAM
checkpointing almost never causes a stall.
Responsibilities of the Main Components
• Checkpoint Coordinator– Notifies the nodes and DRAM checkpointers that a
checkpoint is beginning.• DRAM Checkpointer– Continuously logs DRAM changes.– Checkpoints when instructed by the coordinator.
• Cache Checkpoint Controller– Checkpoints the adjacent cache when instructed
by the coordinator.
Steps for Checkpointing (1 of 2)
1. The coordinator sets the checkpoint signal to 1.2. In parallel each
a. Core:i. Pauses processing instructions.ii. Copies internal state to flash memory.
b. Cache Checkpoint Controller:i. Copies cache internal state to flash memory (data is copied
one line at a time).c. DRAM Checkpointer:
i. Flushes buffer to flash log.ii. Notifies checkpoint coordinator that the buffer has been
flushed.
Steps for Checkpointing (2 of 2)
3. The coordinator sets the checkpoint signal to 0.4. In parallel each
a. Core:i. Flips flash memory bit to indicate the new checkpoint
buffer.
b. Cache Checkpoint Controller:i. Flips flash memory bit to indicate the new checkpoint
buffer.
c. DRAM Checkpointer:i. Marks checkpoint boundary in flash log.
Checkpoint ACheckpoint ACache Checkpoint Controller
Cache Checkpoint Controller Checkpoint BCheckpoint B
Checkpoint ACheckpoint ACache Checkpoint Controller
Cache Checkpoint Controller Checkpoint BCheckpoint B
L1L1
L2L2
Checkpoint ACheckpoint A
Checkpoint BCheckpoint BCoreCore
F F F F F F F F
Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
Check pointCheck point
Buffer Buffer
LogLog
Address DecoderAddress Decoder
Previous Checkpoint Changes
Next Checkpoint
Changes
endstart
Buffered Changes
Previous Checkpoint
(random access)
Recovering1. Determining which Checkpoint to use
a. System checks which Checkpoint is the most recentb. If the most recent checkpoint was in progress during crash, the older
checkpoint is used.2. Restoring Previous State
a. Each architectural register is rewritten.b. Each cache is written to by its adjacent FLASH buffer (one cache line
at a time)c. Main Memory is recoveredd. Take advantage of pipelined write if available.
3. Resume Executiona. Resume program counterb. Notify that CPU’s that the system is restoring from a checkpoint
(single bit)