Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the...
-
Upload
meagan-moody -
Category
Documents
-
view
214 -
download
1
Transcript of Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the...
![Page 1: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/1.jpg)
Checkpointing and Recovery
![Page 2: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/2.jpg)
Purpose
• Consider a long running application– Regularly checkpoint the application
• Expensive task
– In case of failure, restore to the previous checkpoint
• What happens in case of a distributed application– One (or more) processes fail
– Restoration to previous checkpoint should be done consistently
![Page 3: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/3.jpg)
Examples
![Page 4: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/4.jpg)
What to Save?
• Depends on application– Could be as simple as just program counter
information– Could be the state of the entire process,
including messages received, etc
![Page 5: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/5.jpg)
Stable Storage
• Checkpoints must survive failure of processes (including failure during a disk write)– A simple approach for stable storage
![Page 6: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/6.jpg)
Approaches
• Asynchronous– The local checkpoints at different processes are
taken independently
• Synchronous– The local checkpoints at different processes are
coordinated– They may not be at the same time
![Page 7: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/7.jpg)
Asynchronous Checkpointing
• Problem– Domino effect
Failed process
![Page 8: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/8.jpg)
Other Issues with Asynchronous Checkpointing
• Useless checkpoints
• Need for garbage collection
• Recovery requires significant coordination
![Page 9: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/9.jpg)
Asynchronous Checkpointing (Continued)
• Identify dependency between different checkpoint intervals
• This information is stored along with checkpoints in a stable storage
• When a process repairs, it requests this information from others to determine the need for rollback
![Page 10: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/10.jpg)
Two Examples of Asynchronous Checkpointing
• Bhargava and Lian
• Wang et al
![Page 11: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/11.jpg)
Algorithm by Bhargava et al
• Draw an edge from ci, x to cj,y if either
– i = j and y = x+1
– i j and a message m is sent from Ii, x and received in Ij, y
• Where Ii, x is the interval between ci, x-1 and ci, x
• Rollback recovery line used for recovery as well as garbage collection
![Page 12: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/12.jpg)
Algorithm by Wang et al
• Difference– If a message sent from Ii, x is received in Ij, y then draw
an edge between cj, x-1 to cj, y
• Recovery line obtained is similar to that by by Bhargava and Lian
• Advantage– Number of useful checkpoints is at most N(N+1)/2
• This can be shown that the number of checkpoints that are ahead of recovery line
![Page 13: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/13.jpg)
Coordinated Checkpointing
• Using diffusing computation– How can we use diffusing computation to
obtain a consistent snapshot?
![Page 14: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/14.jpg)
Algorithm by Tamir and Sequin
• Blocking checkpoint– A coordinator decides when a checkpoint is taken
– Coordinator sends a request message to all
– Each process• Stops executing
• Flushes the channels
• Takes a tentative checkpoint
• Replies to coordinator
– When all processes send replies, the coordinator asks them to change it to a permanent checkpoint
![Page 15: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/15.jpg)
Algorithm by Tamir and Sequin
• How many checkpoints need to be stored per process?
![Page 16: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/16.jpg)
Checkpointing in Timed Systems
• If perfectly synchronized clocks?
![Page 17: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/17.jpg)
Checkpointing in Timed Systems
• What if clocks are loosely synchronized?– Max clock drift, , is known?
• All processes take a checkpoint at a fixed (local) time – After the checkpoint, a process does not send any
messages for 2– The set of local checkpoints is guaranteed to be
consistent
![Page 18: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/18.jpg)
Minimal Checkpoint Coordination
• Approach by Koo and Toueg– Require processes to take a checkpoint only if
they have to
![Page 19: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.](https://reader036.fdocuments.in/reader036/viewer/2022082817/56649e4c5503460f94b417c6/html5/thumbnails/19.jpg)
Logging Protocols
• Pessimistic
• Optimistic
• Causal