Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab...
-
Upload
jalynn-sturman -
Category
Documents
-
view
239 -
download
0
Transcript of Scalable Algorithms for Global Snapshots in Distributed Systems Rahul Garg IBM India Research Lab...
Scalable Algorithms for Global Snapshots in Distributed Systems
Rahul Garg IBM India Research Lab
Vijay K. Garg Univ. of Texas at Austin
Yogish Sabharwal IBM India Research Lab
Motivation for Global Snapshot Checkpoint to tolerate faults
Take global snapshot periodically On failure, restart from the last checkpoint
Global property detection Detecting deadlock, loss-of-a-token etc.
Distributed Debugging Inspecting the global state
Consistent and inconsistent cuts
G1 is not consistent
G2 is consistent but m3 must be recorded
P1
P2
P3
m1
m2
m3
G1G2
Model of the System
No shared clock
No shared memory
Processes communicate using messages
Messages are reliable
No upper bound on delivery of messages
Checkpoint
A process must be red to receive a red message A white process turns red on receiving a red message
Any white message received by a red process must be recorded as in-transit message
P
Qwr
rrrw
ww
Classification of Messages w – white process (pre-recording local state) r – red process (post-recording) e.g. rw – sent by a red process, received by a white process
Previous Work
Chandy and Lamport’s algorithm Assumes FIFO channels Requires one message (marker) per channel
Marker indicates the end of white messages
Mattern’s algorithm Schulz, Bronevetsky et al.
Work for non-FIFO channels Require a message that indicates the total number of white
messages sent on the channel
Results
Algorithm Message Complexity
Message Size
Space
CLM O(N2) O(1) O(N)
Grid-based O(N3/2) O(N) O(N)
Tree-based O(N log N log W/n) O(1) O(1)
Centralized O(N log W/n) O(1) O(1)
Grid-based Algorithm
Idea 1 Previously: send number of white messages/channel This algorithm: the total number of white messages
destined to a process
Idea 2 Previously: send N messages of size O(1) Now: send N messages of size N
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
whiteSent = 1 0 32 1 04 0 0[ 1 0 3 ] [ 2 1 0 ] [ 4 0 0 ]
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
[ 1 2 3 ] [ 2 1 0 ]
[ 1 4 1 ]
+
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
For each processor of second row: Count of messages sent to it
from processors in third row[ 1 2 3 ] [ 2 1 0 ]
[ 1 4 1 ][ 4 7 4 ]
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
[ 4 7 4 ]
+
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
[ 4 7 4 ]
[ 2 1 2 ][ 1 0 1 ]
Grid-based Algorithm
Algorithm for P(r,c) Step 1: send row i of matrix to P(r,i) Step 2: compute cumulative count for row c
Send this count to P(c,c) Step 3: if (r=c) // diagonal entry
Receive count from all processes in the column Send jth entry to P(c,j)
For each processor of second row: Count of messages sent to it
from all processors
[ 7 8 6 ]
Tree/Centralized Algorithms
Idea Previously: maintain white messages sent for every
destination These algorithms: nodes maintain local deficits
Local deficit = white messg sent – white messg recvd
Total deficit = Sum of all local deficits
Distributed Message Counting Problem W in-transit messages destined for N processors Detect when all messages have been received W tokens: a token is consumed when a message is
received
Tree/Centralized Algorithms
Distributed Message Counting Algorithm Arrange nodes in suitable data structure Distribute tokens equally to all processors at start
w = W/n Each node has a color:
Green (Rich) : has more than w/2 tokens Yellow (Debt-free) : has <= w/2 tokens Orange (Poor) : has no tokens and has received
a white message
Tree-based Algorithm: High level idea Arrange nodes as a binary tree
Progresses in rounds In each round all the nodes start off rich A token is consumed on receiving a message Debt-free node cannot have a rich child
Ensured by transfer of tokens
Starting a new round When root is no longer rich ½ tokens consumed
Tree-based Algorithm
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
I2
I1
Tree-based Algorithm - Example
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
Violates I1
Swap Request
Swap Accept
Tree-based Algorithm - Example
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
Tree-based Algorithm - Example
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
Split Request
Split Accept
Violates I3
Tree-based Algorithm - Example
Invariants I1: Yellow process cannot have green child I2: Root is always green I3: Any orange node eventually becomes yellow
Violates I2
Tree-based Algorithm - Example
Reset Round Recalculate remaining tokens W’ ( <= nw/2 = W/2 ) Start new round with W’ Redistribute tokens equally All nodes turn Green
Violates I2
Tree-based Algorithm – Analysis Number of rounds
If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )
Number of control messages per round O( log n ) control messages per color change Whenever color changes, some green node turns yellow
O( n ) color changes per round # of control messages per round = O( n log n )
Total control messages = O( n log n log W/n )
Centralized Algorithm
Idea In tree-based algorithm, every color change requires
search for a green node to split/swap tokens with Requires O( log n ) control messages
Can we find a green node with O(1) control messages? Master node (tail) maintains list of all green nodes
Master
Centralized Algorithm – Analysis Number of rounds
If W < 2n, only O( n ) messages are required Tokens reduce by half in every round # of rounds = O( log W/n )
Number of control messages per round O( 1 ) control messages per color change Whenever color changes, some green node turns yellow
O( n ) color changes per round # of control messages per round = O( n )
Total control messages = O( n log W/n )
Lower Bound
Observation Suppose there are W outstanding tokens Some process must generate a control message on
receiving W/n white messages
W/n W/nW/n W/n W/n W/n
Send W/n white messages to that processor Remaining tokens = (n-1)W/n Repeat Argument recursively
Tokens remaining after i control messages >= ((n-1)/n)i . W # of control messages = ( n log W/n )
Experimental Results
Total Latencies
01020304050607080
N=32, W=2880992 N=64, W=5764032 N=128,W=11536256
N=256,W=23105280
N=512,W=46341632
N/W
Mill
isec
onds
Grid Tree Centralized
Experimental Results
Average Message Counts (W=40,000)
0
50
100
150
200
250
300
32 64 128 256 512
N
Co
un
t Grid
Tree
Centralized
Conclusions
Global Snapshots in distributed systems Distributed Message Counting problem Optimal algorithm
Message Complexity O( n log W/n ) Matching lower bound Centralized algorithm
Open Problem Decentralized algorithm ?