RAMBO: Reconfigurable Atomic Memory for Dynamic Networks
description
Transcript of RAMBO: Reconfigurable Atomic Memory for Dynamic Networks
RAMBO: Reconfigurable Atomic Memory for Dynamic
NetworksSeth Gilbert Nancy Lynch Alexander Shvartsman
Presenter: Anastasia Braginsky (December 2013)
Slides partially borrowed from Seth Gilbert (DSN ’03) and Edward Bortnikov (talk)
RAMBO name» Reconfigurable Atomic Memory for Basic Objects
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm» The reconfiguration service» Conclusions
Distributed Shared MemoryReadWrite(7)
Write(0)
Atomic Consistency (linearizability)
» Definition: Each operation appears to occur at some point between its invocation and response
» Sufficient condition: For each object x, all the read and write operations for x can be partially ordered by , so that:
1. No operation has infinitely many other operations ordered before it
2. is consistent with the order of invocations and responses:
there are no operations such that 1 completes before 2 starts, yet 21
3. All write operations are ordered with respect to each other and with respect to all the reads.
4. Every read returns the value of the last write preceding it in
op A completes before op B begins, then B returns the results of A
Read 7
Write(7)
ReadWrite(7)
Write(0)
Suggestions?» Central server?
˃ Performance bottleneck
˃ Single point of failure
» So multiple servers need to replicate the content
˃ And do not stop the world if some reconfiguration is needed
» But now how to find the latest value of replicated object?
Distributed Networked System» All-to-all connectivity, but messages can be lost,
delayed, or re-ordered
» No global clock or synchronization mechanism - asynchrony
» Nodes can fail
» A distributed networked system can be static (fixed set of participating nodes) or dynamic
And if everything fails?» Memory access operations are guaranteed to
terminate under certain assumptions
» Static: ˃ The majority of replicas need to be active
˃ Network delays are bounded
» Dynamic:˃ Dynamically changing subset of replicas need to be active during certain periods
» Otherwise… Sorry… ˃ Operations may not terminate
10
Dependable Systems and Networks 2003
Quorums
Write(7)Read
11
Dependable Systems and Networks 2003
Dynamic Atomic Memory
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm» The reconfiguration service» Conclusions
Static Quorum Systems» Upfal and Wigderson (85)
˃ First general scheme for emulating shared-memory in the message-passing system
˃ majority sets of readers and writers
…» Attiya, Bar-Noy and Dolev (90/95)
˃ Dijkstra Award in 2011˃ Including extensions to the original algorithm [N. Lynch and A.
Shvartsman. Robust emulation of shared memory using dynamic quorum-acknowledged broadcast. 1997]
A(ttiya) B(ar-Noy) D(olev)
» Algorithm uses replication to achieve fault-tolerance and availability
» n nodes
» The system tolerates at most n/2-1 crashes
ABD for a single register» Each node i maintains the local value of the register
» valuei and tagi = <seq, pid>
» Tags are compared lexicographically
» Each new write assigns a unique tag (pid to break ties)
» Read and write operations have two phases
˃ Query replicas for information
˃ Propagate information to replicas
» Send to everyone, majority should response
Read: Phase IValue = 5Tag = <4,j>Node i
Value = 5Tag = <4,j>Node j
Value = 3Tag = <5,k>Node k
Value = 6Tag = <5,l>Node l
Query
Query
Query
Read: Phase IValue = 5Tag = <4,j>Node i
Value = 5Tag = <4,j>Node j
Value = 3Tag = <5,k>Node k
Value = 6Tag = <5,l>Node l
Response3< ,5,k>
Response 6, <5,l>
Read: Phase IIValue = 6Tag = <5,l>Node i
Value = 5Tag = <4,j>Node j
Value = 3Tag = <5,k>Node k
Value = 6Tag = <5,l>Node l
Propagate 6, <5,l>
Propagate 6, <5,l>
Propagate 6, <5,l>
Read: Phase II
Value = 5Tag = <4,j>Node j
Value = 6Tag = <5,l>Node k
Value = 6Tag = <5,l>Node i
Value = 6Tag = <5,l>Node l
Acknow-legment
Acknow-legment
Consistency» Two majorities have non-empty intersection
» There is at least one node participating in Propagation phase of previous operation and in Query phase of this one
» All writes ordered by their tags
Too long waiting for the majority?
» Use quorum systems
» Quorum is a subset of nodes
» Any two quorums intersect
» The size of the set can be much less than the majority
» The majority-based implementations tolerate crashes of any minority
» The quorum-based implementations require that the nodes in at least one quorum do not crash
Consensus» Set of processes need to agree an a value
» Nodes propose several values for consideration
» Any solution must satisfy:
˃ Agreement: no two processes decide on different values
˃ Validity: the value decided was proposed by some node
˃ Termination: all correct processes reach a decision
» Consensus termination can not be guaranteed in the presence of even a single process crash
» Paxos is an implementation of a consensus
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm» The reconfiguration service
RAMBO multi-reader, multi-writer» Short term: Quorum-based Replication – to provide
fault tolerance˃ Read- and write- quorums collected into configurations˃ Any quorum-configuration can be installed in any time
» Long term: Reconfiguration – to cope with changing participants
» Participants can join and fail
26
Tues
day,
June
24
Rambo» Decouple read/write ops and reconfiguration
˃ fast read/write ops, even if recon slow
» A stable state (no reconfigurations) is similar to the static two-phase ABD, but ˃ Extended for multi-writer registers˃ Generalized to use quorum systems
» New participants can join the service by contacting at least one existing participant
Quorums Reconfigurations
» Performed concurrently with any ongoing reads and writes
» Multiple reconfigurations can be in progress concurrently
» Reconfiguration involves˃ Introduction of a new configuration˃ Garbage collection of obsolete configuration(s)
Dependable Systems and Networks 2003
28
Tues
day,
June
24
Rambo stabilization» frequent reconfiguration?
» clocks out of synch?
•messages lost?
•messages delayed?
Network stabilizesRambo stabilizes
Three Sub-Protocols» Joiner
˃ Joiner is notified by a device that the device wants to join˃ The device provides the initial world view (set of devices that this
device thinks has already joined)˃ Joiner contacts this world and retrieves the information necessary for
the new device to participate
» Reader-Writer: Executing read-write operations and old configurations garbage-collection
» Recon: Producing new configurations
Configuration map» Each participant maintain a configuration map –
cmap – to store the sequence of configurations
» For node i, cmapi(k) is ˃ the configuration number k if configuration is active˃ or a notification that this configuration doesn’t yet exist ˃ or a notification that this configuration was already garbage collected
» This sequence evolves as new configurations are introduced by Recon and as all configurations are garbage collected
CMAP Evolution c0
c0 c1
c0 c1 c2
± c1 c2
± ± c2
. . .
. . .
. . .
. . .
. . .
± ± ± c3 . . .. . .
Reader-Writer» Each read or write executes in the context of one or more active
configurations (must use all active configurations)
» Reads and writes proceed concurrently with ongoing reconfigurations
» Two phases» Query phase – information is retrieved from one (or more) read-quorums of all
active configurations» Propagate phase – information is updated in one (or more) write-quorums of all
active configurations
» Garbage-Collection (GC) – removing old configurations » Notifying about old configuration(s)» Propagating information from old configuration to the next
RAMBO Assumptions » Assumptions regarding RAMBO behavior:
˃ Regularly sends gossip messages to the participants
˃ The initial world views overlap sufficiently such that every node that has joined the system is aware about every other node soon enough
˃ Every configuration remains viable until sufficiently long after the next new configuration is installed
˃ Reconfigurations are not initiated too frequent
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm» The reconfiguration service
The system» Set of devices
communicating via all-to-all asynchronous message-passing network
» I : totally ordered set of device identifiers
2
1
5
7
6
4
3
A node or a participant
The system» Set of devices communicating
via all-to-all asynchronous message-passing network
» I : totally ordered set of device identifiers
» Nodes may fail by stopping (all components) without worning
2
1
5
7
6
4
3
JoinerRead-Write
Recon
JoinerRead-Write
Recon
JoinerRead-Write
Recon
JoinerRead-Write
Recon
JoinerRead-Write
Recon
JoinerRead-Write
Recon
JoinerRead-Write
Recon
Shared Memory Read/Write Objects
» X : set of object identifiers
» For each object xX, Vx is the set of values that x may take
on
» (v0)x – the initial value of object x
» (i0)x – the initial creator of object x, the node that is initially
responsible for object x (this responsibility can be delegated)
» T = N x I : set of tags, used to order the values written to the
system
Configurations» C : set of configuration identifiers
» Each identifier cC is assosiated with unique configuration consisting of:˃ members(c) – a finite subset of I˃ read-quorums(c) – a set of finite subsets of members(c)˃ write-quorums(c) – a set of finite subsets of members(c)
» For every cC, for every Rread-quorums(c), and for every Wwrite-quorums(c) : RW≠
RAMBO APIDomains» I = set of Nodes » V = set of Values» C = set of Configurations
Input (Request) Join(J) // J – initial world view
Read
Write(v)
Recon (c, c’) // reconfiguration request
Fail
Output (Response) Join-ack
Read-ack(v)
Write-ack
Recon-ack // request has been proceeded
Report (c) // new configuration
Inputs and Outputs are all asynchronous
per node iI and object xX
Requests’ Well-Formedness» No requests after fail
» Each client issues at most one join request and waits for
acknowledgement before any further requests
» Before issuing a new read/write/recon wait for previous
acknowledgment
» Each client issues at most one recon(*,c) request (configuration
identifiers are unique)
» Client can request reconfiguration from c to c’ only if c was
installed and all members of c’ have already joined
Responses’ Well-Formedness» No responses after fail
» Responses comes only upon requests
Reconfiguration service APIDomains» I = set of Nodes » V = set of Values» C = set of Configurations
Input (Request) Join
Recon(c,c’)
Request-config (k)// the client has learned of every configuration preceding k
Fail
Output (Response) Join-ack
Recon-ack
New-config(c,k)// the kth configuration has been agreed upon
Report(c)
Inputs and Outputs are all asynchronous
per node iI and object xX
Recon Service Specification» Recon˃ Chooses configurations˃ Tells members of the previous and new configuration.˃ Informs Reader-Writer components (new-config).
» Behavior (assuming well-formedness):˃ Agreement: Two configs never assigned to same k.˃ Validity: Any announced new-config was previously requested
by someone.˃ No duplication: No configuration is assigned to more than one
k.
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm» The reconfiguration service
Suppress explicit mention of x
» The shared memory is described as the composition of a separate implementation for each object xX
» V, v0, c0, and i0 as shorthand for
» Vx, (v0)x, (c0)x, and (i0)x
Recon
Read-Write
User of node i
Joiner
Recon
Read-Write
User of node j
Joiner
Joiner automata state
» status {idle, joining, active, failed}, initially idle
» others-status, a mapping from Recon and Reader-Writer to {idle, joining, active}, initially everywhere idle
» initial-world (iw) I, initially
Join(J)
Joiner automataReco
n
RWUser
of node i
Joiner status=idle
iw=
Recon
Read-Write
User of node j
Joiner
Join(J)
Joiner automata
Recon
Read-Write
User of node j
Joiner
join
Recon
RWUser
of node i
Joiner status = joining
iw=J
Hope at least one will
answer…
Join(J)
Joiner automata
Recon
Read-Write
User of node j
Joiner
join
join
Recon
RWUser
of node i
Joiner status = joining
iw=J
join
Recon
RWUser
of node i
Joiner status = joining
iw=J
Joiner automata
Recon
Read-Write
User of node j
Joiner
Joiner automata
Recon
Read-Write
User of node j
Joiner
Join-ack
Recon
RWUser
of node i
Joiner status = active
iw=J
R/W Automata State» status {idle, joining, active, failed}, initially idle» world I, initially » value V, initially v0 + tag T, initially (0,i0)
» cmap - what configurations are currently active according to this node understanding
» op – record that keeps track of the status of a current locally initiated read/write operation˃ op includes acc the set of clients that have sent responses
» pnum-local – counts phases of locally-initiated operations» pnum-vector[] – records latest known phase numbers for all
locations» gc - record that keeps track of the status of a current locally-
initiated GC operation
“Recent” messages» A message from i to j is deemed “recent” by j if i
knows about j’s current phase
i jpnum-vector[j]i = pnum-localj
» i has received a message from j that was sent after j began the new phase and was received prior to i sending message to j
Reader-Writer automata: joinReco
n
Read-Write
world=world{k}
User of
node i
Joiner
Recon
Read-Write
User of node j
Joiner
Join(k)
if (status == idle) {if (k==i) status=active;else status=joining;
}
once status is
active
Join-ack
Reader-Writer of node i» In each phase Reader-Writer contacts a set of
quorums ˃ At least one for each active configuration
» To obtain recent value, tag and cmap
» This by sending and receiving messages in the background
Send periodically from i to j
» If status is active than send to every j in your world
world, value, tag, cmap, pnum-local, pnem-vector[j]
Receive from j, if i isn’t idle or failed:status = active; // in case it was joiningworld=world world-received;if (tag-received > tag) (value,tag)=(value-received, tag-received);
// learn about new and old configurationscmap = update(cmap-received);pnum-vector[j] = max(pnum-vector[j],pnum-received);
if (ongoing-operation and message-is-recent)if (op.configuration is consistent)op.acc = op.acc {j};else restart operation;
else if (ongoing-gc and message-is-recent)gc.acc = gc.acc {j};
Read/Write Query PhaseReco
n
RW:Wait for at least one
quorum from each active c to update you
User of
node i
Joiner
Recon
Read-Write
User of node j
Joiner
ReadWrite(v)
if (status ≠ idle) and (status ≠ fail) {pnum-local++;op.type = read/write;op.phase = query;op.acc = ;
}
Read/Write Propagation PhaseReco
n
R: propagate value to be
returnedW: update
value and tag and propagate
User of
node i
Joiner
Recon
Read-Write
User of node j
Joiner
Enoughupdates
if (status ≠ idle) and (status ≠ fail) {pnum-local++;op.type = read/write;op.phase = propagate;op.acc = ;
}
once at leastone
write-quorumfrom eachactive c
updates you
Read/write-ack
Garbage-collection (GC)» Old configurations identifiers are garbage-collected at
each node i
˃ If node i hears that another node has already garbage-collected some configuration
˃ If node i itself initiated garbage-collection of configurations with identifiers l<k if it has learned about configuration k
» During GC information is propagated from one configuration to the next
» GC proceeds concurrently with reads and writes on the same node
GC in Two Phases» Query phase
˃ Communicate with both a read-quorum and a write-quorum of every active configuration l<k
˃ Accomplish two tasks:+ Ensure that all nodes with “old” configurations learn about
existence of configuration k and learn that that all configurations smaller than k are garbage-collected
+ Collect recent tag and value from old configurations. By the end of the Query Phase we have received a value as recent as any written prior to the GC beginning
» Propagation phase˃ Propagate value and tag to a write-quorum of configuration k
GC termination» No “fixed point” test
» Do not evolve with newer learned configurations
» GC has fixed amount of work
» GC can terminate also if discover that someone
else has already garbage-collected all the
configurations smaller than k
Proof Sketch» ≤ ordering of tags between sequential GC operations
˃ ∩ between the R-quorum of CMAP[k] and W-quorum of CMAP[k+1]
» Ordering between sequential GC and R/W˃ ≤ ordering of tags between the GC and READ operations˃ < ordering of tags between the GC and WRITE operations
» Ordering between sequential R and W˃ ≤ ordering between */R˃ < ordering between */W˃ Either there is a common configuration C
+ Tag conveyed through the quorum ∩ property ˃ … or the tag info is conveyed through the GC of some
configuration in between
Outline» Introduction» Background
˃ Static Quorum Systems˃ Consensus
» RAMBO high level overview» Preliminaries» The RAMBO algorithm
» The reconfiguration service
Sequence of configurations» Recall, for node i, cmapi(k) is the configuration
number k» Recon always emits a unique new configuration k to
be stored at cmapi(k)» Any node i that is member of its latest known
configuration c=cmapi(k) can propose a new configuration at any time
» Different proposals are reconciled by executing consensus among the members of c (e.g. using a version of Paxos)
Although consensus may be slow
» In fact, in some situations, it may not even terminate…
» Reconfiguration doesn’t delay reads and writes˃ Provided at least one quorum set is alive for
each active configuration
The reconfiguration service» Built using a collection of global consensus services Cons(k,c)
˃ One for each k>0 and for each cC
» Goal – to reach agrement among members of configuration c
» API for Cons(k,c) for fixed k and c˃ Input
+ init(c’)k,c,i c, c’C, imembers(c) at most one event per k,c,i+ faili cC, imembers(c)
˃ Output+ decide(c’)k,c,i c,c’C, imembers(c)
Recon Automata State» status {idle, active, failed}, initially idle» rec-status {idle} (active x N+), initially idle
˃ Whether a reconfiguration request has been submitted, and if so, with respect to which configuration index
» rec-cmap - what configurations are assigned to each index k
» cons-data (N+(C x C)) : initially everywhere
Recon:If active, then
among all configuration indexes in rec-cmap find the
next one
Read-Write
User of node i
Joiner
Recon
Read-Write
User of node j
Joiner
Recon(c, c’)
init(c’)k,c,i
Recon:Update rec-
cmap
Read-Write
User of node i
Joiner
Recon
Read-Write
User of node j
Joiner
decide)c(’’k,c,i
Rec-ack
new-config(c,k)
Report(c)
CONCLUSIONS» RAMBO – an algorithm for implementing a
reconfigurable read/write shared memory in an asynchronous message-passing system
» RAMBO guarantees atomicity, regardless of network instability and timing asynchrony
» When network is stable, RAMBO guarantees good performance under reasonable assumptions
» RAMBO can be viewed as a framework for refinements and optimizations
THANK YOU !