Dynamic atomic storage without consensus
description
Transcript of Dynamic atomic storage without consensus
DYNAMIC ATOMIC STORAGE WITHOUT CONSENSUS
Aguilera, Keidar, Malkhi, Shraer, J. ACM 58, 2, 2011Sarai Duek
THE PROBLEM Implement an read/write register in a dynamic system.
|Read|Write|Reconfig
atomic
THE PROBLEMWhat is atomicity?
THE PROBLEMAtomicity is when each operation appears to occur at some point between its invocation and response.
R
W
R
W
THE PROBLEMAtomicity is when each operation appears to occur at some point between its invocation and response.
What is liveness?
THE PROBLEMAtomicity is when each operation appears to occur at some point between its invocation and response.
Liveness is a guarantee that the system will make progress under some conditions (e.g. majority).
THE PROBLEM
P0
P1
P2
P3
t-resilient R/W storage guarantees progress if fewer than t processes crash. For an n-process system, it is well known that t-resilient R/W storage exists when t < n/2, and does not exist when t ≥ n/2.
W
R
P2
P3
⊥×
THE PROBLEM
P3
P2
P4
In a dynamic system the majority can change. And liveness is achieved by reconfig operation.
reconfig1(+,4)
P0
P1
THE PROBLEM The model|Unknown and unbounded universe of processes ∏.
|Asynchronous reliable communication channels between each pair of processes.
|Processes can be added, removed, crash or halt.
p3p6
p7
p8
p4
p1
p2
p5
p…
p9 p…
THE PROBLEMA view is a set of changes.
Changes lead to a new configuration of processes.
Liveness conditions|The set of crashed processes and those whose removal is pending is a minority of the current or any pending future views.
|No new reconfig operations will be invoked for “sufficiently long” for the started operations to complete.
p0
p1
p3
p2 p4
p5
THE PROBLEM |MWMR – Any process can write and read.|Written values are unique – (val, pid, ts). |Every process in the system knows the initial view.|We say, by convention, that a reconfig(Init) completes by time 0.|Members of view w store information about the current view.
Changes – {Remove, Add}View – Set of changes For view w:w.remove – removal set w.join – join set w.members – set w.join\w.remove V(t) – union of all sets c such that a reconfig(c) completes by time tInit = V(0)P(t) – set of pending changes at time tF(t) – set of processes that crashed by time t
THE PROBLEM Dynamic Service LivenessIf at every time t in the execution, fewer than |V(t).members|/2 processes out of V(t).members ∪ P(t).join are in F(t) ∪ P(t).remove, and the number of different changes proposed in the execution is finite, then the following hold:|Eventually, the enable operations event occurs at every active process that was added by a complete reconfig operation.|Every operation invoked at an active process eventually completes.
Changes – {Remove, Add}View – Set of changes For view w:w.remove – removal set w.join – join set w.members – set w.join\w.remove V(t) – union of all sets c such that a reconfig(c) completes by time tInit = V(0)P(t) – set of pending changes at time tF(t) – set of processes that crashed by time t
THE PROBLEMDynamic Service Livenessat every time t in the execution, fewer than |V(t).members|/2 processes out of V(t).members ∪ P(t).join are in F(t) ∪ P(t).remove.
p0
p1
p6
p9
V(t)
p4
p5
p2
p3
P(t).remove
p8
p10
p7
F(t)××
×
P(t).join {¿ ¿ ¿ ¿ ¿
3 {¿ ¿ ¿
4 .5
THE ALGORITHM OUTLINE Write – phase
|generate next sequence number|send a message with the value and the sequence number to all processes
|each recipient updates its replica and sends ack
|writer waits for majority of acks|Read configurations information|If a new view was discovered then restart read – phase in the new view (followed by a write – phase again).
Read – phase|Read configurations information|If a new view was discovered then restart read – phase in the new view.
|send a request to all processes|each recipient sends back current value of its replica
|wait for the majority to reply|return value associated with largest sequence number
Read – phasesend a request to all processes|each recipient sends back current value of its replica
|wait for the majority to reply|return value associated with largest sequence number
Write – phase |generate next sequence number|send a message with the value and the sequence number to all processes
|each recipient updates its replica and sends ack
|writer waits for majority of acks
THE ALGORITHM OUTLINE Reconfiguration
|write information about the new view to the quorum of the old one
|execute the read and write phases, starting in the old view.
WEAK OBJECTArrive and query obey the following semantics:|Integrity|Validity|Monotonicity of queries|Non-empty common intersection|Termination
Allows a fixed set of processes P to use two operations| Arrivei(c)
| Queryi()
WEAK OBJECT
Each process pi in P has a value field pi.val
SWMR – only pi can use pi.val.write(c) but all processes can use pi.val.read()
The weak object algorithm
Operation arrivei(c) if collect() = Ø then pi.val.wirte(c)
return OK
Operation queryi()
C1 collect()
if C1 = Ø then return Ø
C2 collect()
return C2
Procedure collect() C Ø
foreach pi P
c pi.val,read()
if c then C C U {c} return C
WEAK OBJECT
The weak object algorithm
Operation arrivei(c) if collect() = Ø then pi.val.wirte(c)
return OK
Operation queryi()
C1 collect()
if C1 = Ø then return Ø
C2 collect()
return C2
Procedure collect() C Ø
foreach pi P
c pi.val.read()
if c then C C U {c} return C
P0
P1
P3
P2
P4
P5
arrive(v1)
arrive(v2)
C = { }
P0v1
P5v2
WEAK OBJECT
The weak object algorithm
Operation arrivei(c) if collect() = Ø then pi.val.wirte(c)
return OK
Operation queryi()
C1 collect()
if C1 = Ø then return Ø
C2 collect()
return C2
Procedure collect() C Ø
foreach pi P
c pi.val.read()
if c then C C U {c} return C
P1
P3
P2
P4
query()
C = { }
P0v1
P5v2
C = {v1}C = {v1, v2}
querya{ }
queryb{ }
WEAK OBJECT
The weak object algorithm
Operation arrivei(c) if collect() = Ø then pi.val.wirte(c)
return OK
Operation queryi()
C1 collect()
if C1 = Ø then return Ø
C2 collect()
return C2
Procedure collect() C Ø
foreach pi P
c pi.val.read()
if c then C C U {c} return C
collect {a}
collect {a, b}
querya queryb
collect {a}
collect {b}
THE ALGORITHM
operation readi (): pickNewTSi ← FALSE newView ← Traverse(∅,⊥) NotifyQ(newView) return vi
max
operation writei (v): pickNewTSi ← TRUE newView ← Traverse(∅, v) NotifyQ(newView) return OK
operation reconfigi (cng): pickNewTSi ← FALSE newView ← Traverse(cng, ⊥) NotifyQ(newView) return OK
procedure NotifyQ(w) if did not receive {NOTIFY, w } then send {NOTIFY, w } to w.members wait for {NOTIFY, w} from majority of w.members
THE ALGORITHM
procedure Traverse(cng, v) desiredView ← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← any ∈ Front s.t. | | = s if (i w.members) then halti if w desiredView then arrivei (w, desiredView \ w) ChangeSets ← ReadInView(w) if ChangeSets ∅ then Front ← Front \ {w} foreach c ∈ ChangeSets desiredView ← desiredView ∪ c Front ← Front ∪ {w ∪ c} else ChangeSets ← WriteInView(w, v) while ChangeSets ∅ curViewi ← desiredView return desiredView
Traverse is used to look for the next view considering all the changes suggested so far.
THE ALGORITHM
procedure Traverse(cng, v) desiredView ← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← one ∈ Front s.t. | | = s if (i w.members) then halti if w desiredView then arrivei (w, desiredView \ w) ChangeSets ← ReadInView(w) if ChangeSets ∅ then Front ← Front \ {w} foreach c ∈ ChangeSets desiredView ← desiredView ∪ c Front ← Front ∪ {w ∪ c} else ChangeSets ← WriteInView(w, v) while ChangeSets ∅ curViewi ← desiredView return desiredView
Initview
THE ALGORITHM
procedure Traverse(cng, v) desiredView ← curViewi ∪ cng Front ← {curViewi} do s ← min{|| : ∈ Front} w ← any ∈ Front s.t. | | = s if (i w.members) then halti if w desiredView then arrivei (w, desiredView \ w) ChangeSets ← ReadInView(w) if ChangeSets ∅ then Front ← Front \ {w} foreach c ∈ ChangeSets desiredView ← desiredView ∪ c Front ← Front ∪ {w ∪ c} else ChangeSets ← WriteInView(w, v) while ChangeSets ∅ curViewi ← desiredView return desiredView
V1
V2
V3
V4
V5
V6
Initview
Initial
Front
Front after
iteration 1
Front after
iteration4
Front after
iteration6
{(+,3)}
{(+,3), (-,1),
(+,4)}{(-,1), (+,4)}
{(+,5), (-,1),(+,4)}{(+,7)}
{(+,5)}
{(+,7)} {(+,3),
(+,5)}
InitView U{(+,3), (+,5), (-,1),(+,4), (+,7)}
=
Edge returned from ReadInViewEdge updated by
Pi
THE ALGORITHM procedure ReadInView(w)
ChangeSets ← queryi (w) ContactQ(R, w.members) return ChangeSets
procedure WriteInView(w, v) if pickNewTSi then (pickNewTSi, vi
max , tsimax) ←(FALSE, v, (tsi
max .num+ 1, i)) ContactQ(W, w.members) ChangeSets ← queryi (w) return ChangeSets
Procedure ContactQ sends a write-request including vi
max and tsimax when writing
a quorum, and a whenreading a quorum.
ESTABLISHED VIEWS
The unique sequence of established views E is constructed as follows:| the first view in E is the initial view Init| if w is in E, then the next view after w in E is w’
= w ∪ c, where c is an element chosen arbitrarily from the intersection of all sets C∅ returned by some query(w) operation in the execution.
THANK YOU