OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert,...

33
OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert, Vincent Gramoli, Peter M Musial, Alexander A Shvartsman
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of OPODIS 05 Reconfigurable Distributed Storage for Dynamic Networks Gregory Chockler, Seth Gilbert,...

OPODIS 05

Reconfigurable Distributed Storage for Dynamic

NetworksGregory Chockler, Seth Gilbert,

Vincent Gramoli, Peter M Musial, Alexander A Shvartsman

OPODIS 05

Goals

Reconfigurable Distributed Storage (RDS)• Atomic consistency (read/write)• Fault Tolerance

…in Dynamic and Asynchronous Systems.

OPODIS 05

Distributed Storage

OPODIS 05

Distributed Storage

Data is replicated at several network locations

OPODIS 05

Distributed Storage

Write

Read

Operation policy

OPODIS 05

…in Dynamic Networks

OPODIS 05

Distributed Storage in Dynamic Networks

OPODIS 05

Distributed Storage in Dynamic Networks

leaving nodesjoining nodes

OPODIS 05

Distributed Storage in Dynamic Networks

OPODIS 05

Distributed Storage in Dynamic Networks

…requires a reconfiguration process.

OPODIS 05

Distributed Storage in Dynamic Networks

…by achieving agreement.

OPODIS 05

Model

• Distributed– Connected set of processors– Each processor has a unique id i I– MWMR, any processor is a potential client

• Asynchronous– Asynchronous processors – Point-to-point asynchronous unreliable

channels• Dynamic

– Processors join and leave the system– Processors may crash

OPODIS 05

What is a configuration?

• Configuration <members, read-quorums, write-quorums>– members is a set of processors,– read-quorums, write-quorums two sets of quorums RQ read-quorums, WQ write-quorums

• RQ members • WQ members • RQ WQ (only for a given configuration)

• Every client maintains a set of configurations, initially containing the default one.

OPODIS 05

Single Object Operations Overview

After [ABD95]• tag = <c,i> N I, val a possible value

• val = Read()i

(<c,j>,val)=query();[prop(<c,j>,val);]

• Write(val)i (<c’,j>,val’)=query();prop(<c’++,i>,val);

1.(tag,val) query(NULL): gathers (tag,val) pairs of all processors of a RQ and returns the one with the largest tag.

2.NULL prop(tag,val): updates (tag,val) pairs at all processors of a WQ.

Write tag

Read tag

OPODIS 05

Reconfiguration Design Goals

• Sound– Totally ordered configurations

• Flexible – No dependences between configurations

• Non-intrusive– Makes possible concurrent read/write

operations

• Fast– Strengthening fault tolerance

OPODIS 05

Decoupling Reconfiguration

• Reconfiguration = Replacing Configurations– {I} Installing a new configuration– {R} Removing old configuration(s)

• If {R} ≺ {I} Operations are delayed

• If {I} ≺ {R} Stronger configuration viability assumption is required

OPODIS 05

Solution

({R} ≺ {I}) ({I} ≺ {R})

{I} // {R}

Tighter coupling between removal and installation

OPODIS 05

RDS Reconfiguration• Reconfiguration is based on Paxos (3 phases leader-based consensus alorithm)• l is the leader• c is the current configuration• configs is the set of active configurations• A ballot has a unique identifier b and a value v,

which is a configuration• Paxos phases:

– Prepare: l creates a new ballot and chooses/gets the value to propose.

– Propose: l proposes <b,v> and gathers votes from a majority.

– Propagate: l propagates decision

OPODIS 05

RDS Reconfiguration

l

RQWQ

Recon(c,c’)

OPODIS 05

RDS Reconfiguration

l

RQWQ

Prepare phaseRecon(c,c’) •Creates a new larger ballot b

OPODIS 05

RDS Reconfiguration

l

RQWQ

<1a, b>

Prepare phaseRecon(c,c’)

OPODIS 05

RDS Reconfiguration

l

RQWQ

<1a, b>

<1b, b, configs, <b’’, c’’>>

•Updates its ballot’s value v with the one received •Updates its configs set

Prepare phaseRecon(c,c’)

OPODIS 05

RDS Reconfiguration

l

RQWQ

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

Propose phaseRecon(c,c’)

OPODIS 05

RDS Reconfiguration

l

RQWQ

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

<2b, b, c, v, tag, val>

Recon(c,c’)

<2b, b, c, v, tag, val>

Propose phase

•Updates their tag and val•Adds v to their configs set

OPODIS 05

RDS Reconfiguration

l

RQWQ

<1a, b>

<1b, b, configs, <b’’, c’’>>

<2a, b, c, v>

<2b, b, c, v, tag, val><3a, c, v, tag, val>

<3a, c, v, tag, val>

Recon(c,c’)

<2b, b, c, v, tag, val>

Propagation phase

•Update their tag and val•Remove configuration c from their configs set

<3a, c, v, tag, val>

OPODIS 05

Proving Atomicity

• Ordering configurations

• Ordering operations

Theorem 1: The set of installed configurations in the system is totally ordered.

Theorem 2: If operation 1 precedes operation 2 then 1’s tag is not larger than 2’s tag.

OPODIS 05

Additional Assumptions

• Eventual stabilization with– Unique leader l – Message delay bound d (unkown to the algorithm) – Gossip with frequency d– Restricted reconfiguration rate– Some quorums remain alive in active configurations

ts

ts: System stabilization time

Let’s tr be the Request time

2d

tl: Algorithm stabilization time

tl

OPODIS 05

Reconfiguration Latency

Worst case scenario: Last reconfiguration was done by a different leader.

Prepare

max(tl, tr)

Propose Propagate

2d 2d d

te

te: end timeReconfiguration is complete

5d

OPODIS 05

Reconfiguration Latency

Other cases: The leader made the previous reconfiguration.

max(tl, tr)

Propose Propagate

2d d

te

te: end timeReconfiguration is complete

3d

OPODIS 05

Operation Latency

Phase latency: • 2d is sufficient for the phase round trip.• In some cases (pending reconfiguration), the phase might be delayed twice.

1st round trip

Operation latency: • Operations are bounded by 8d.• In some cases, the propagation phase of the read operation can be ignored, leading to a possible bound of 2d.

2nd round trip

2d 2d

New configuration discovered

OPODIS 05

Experimental Results

• IOA to Java code following set of rules.

• Implementation of Attiya, Bar-Noy, and Dolev algorithm « ABD » (w/o Reconfiguration) and RDS which shares parts of the ABD code.

• Using majority-based configurations.

• Measuring operation latency1. While varying configuration size2. While varying algorithm instances

OPODIS 05

Experimental Results

• Operation latency of RDS is competitive with ABD, confirming the theory.

• Reconfiguration messages contain operation information which might accelerate operations in RDS.

OPODIS 05

Conclusion

• RDS, Reconfigurable Distributed Storage.• With sound, flexible, non-intrusive and

fast reconfiguration.• It solves two problems in one:

Configuration replacement and Consensus.

• Reconfiguration is inexpensive (time).• Fault tolerance is strenghtened.• RAMBO can become more agressive: it is

exactly what we did here!