Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar...

30
Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013

Transcript of Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar...

Time Bounds for Shared Objects

in Partially Synchronous Systems

Jennifer L. WelchDagstuhl Seminar on

Consistency in Distributed Systems Feb 2013

Acknowledgment

Joint work with Jiaqi WangHyunyoung LeeEdward Talmage

Jiaqi Wang’s M.S. thesis, CSE, TAMU, 2011

PODC 2011 brief announcement

2

3

Model Fixed set of n nodes Nodes communicate through reliable

message-passingdelay in range [d−u,d]

Nodes have approximately synchronized clocks with skew εε ≥ (1−1/n)u [Lundelius and Lynch

1984]no clock drift

No node failures

Problem

Each node runs an application process Application processes communicate

through (logically) shared variables arbitrary data types

How to implement the shared variables that the application processes use?

Desired consistency condition is linearizability

Focus on elapsed time of implemented operations

4

Related Work: Lower Bounds

Lipton and Sandberg 1988 |read|+|write| ≥ d (for sequential

consistency)

Attiya and Welch 1991, 1994 |read| ≥ u/4 |write/enq/push| ≥ u/2

Mavronicolas and Roth 1991, 1992, 1999 |read/write| ≥ min{ε/2,u/2} |read|+|write| ≥ d + min{ε/2,u/2} 5

Related Work: Lower Bounds

Kosa 1994, 1999 Generalize arguments in Attiya & Welch for arbitrary

data types Inspired by classification of operations by Weihl 1988

based on commutativity for op that "does not commute w/ itself”:

|op| ≥ d implies |deq/pop| ≥ d

for op1 and op2 that “immediately do not commute”: |op1| + |op2| ≥ d implies

|read/deq|+|write/enq| ≥ d for op that is a “pure mutator”:

|op| ≥ u/2 implies |write/enq/push| ≥ u/2

for op that is an “accessor”: |op| ≥ u/2 implies |read/peek| ≥ u/2

6

7

Related Work: Upper Bounds for Read-Write Registers Mavronicolas and Roth 1991, 1992,

1999: |read| ≤ βd+3u+min{ε,u}+γ |write| ≤ (1−β)d + 3uβ is tradeoff parameter in [o,1−u/d)γ is a small constant

Chaudhuri, Gawlich and Lynch 1993: |read| ≤ u + c |write| ≤ d + u − cc is tradeoff parameter in [0,d]

Related Work: General Upper Bounds

Folklore algorithm #1: centralized (single copy):

send operation invocation to node with the copy

node with copy serializes invocations and updates the copy

node with copy sends response to invoker. Each operation takes 2d time

8

Related Work: General Upper Bounds

Folklore algorithm #2: Use atomic broadcast (full replication):

broadcast invocation upon receipt do the operation invoker waits for broadcast time and

provides response Each operation takes h time, where h is

broadcast time: h = 2d

9

Overview of Our Results

Lower bound #1: (1 – 1/n)u for operations which can be

executed in any order but result in different states for different orders

includes write, push and enq improves previously known bound of

u/2uses classic shifting technique

10

Overview of Results

Lower bound #2:d + min{ε,u,d/3} for operations that

“immediately” do not commute with themselves (invalidate each other)

includes RMW, pop, deq improves previous lower bound of d uses a new shifting technique which

provides a larger bound by shifting by a larger amount, then manipulating the new execution to fix message delays that are too big or too small

11

Overview of Results

New generic algorithm for any data type Partitions operations into

pure accessors (don’t change state) pure mutators (don’t observe state) other

Upper bounds are, for any X in [0,d+ε−u], d + ε − X for pure accessor ε + X for pure mutator d + ε for other

Improves on folklore algorithms (2d time per op)

12

Bounds for Read-Modify-Write Register

13

operation lower bound upper bound

read-modify-write

d + min{ε,u,d/3} d + ε (all X)

read u/2 [Kosa] u (X = d+ε−u)

write (1−1/n)u ε (X = 0)

read + write

d [Lipton & Sandberg]

d + 2ε (all X)Recall ε can be as small as (1−1/n)u

Bounds for Queue

14

operation lower bound upper bound

enq (1−1/n)u ε (X = 0)

deq d + min{ε,u,d/3} d + ε (all X)

peek u/2 [Kosa] u (X = d+ε−u)

peek + enq

d + min{ε,u,d/3} d + 2ε (all X)

Recall ε can be as small as (1−1/n)u

Bounds for Stack

15

operation lower bound upper bound

push (1−1/n)u ε (X = 0)

pop d + min{ε,u,d/3} d + ε (all X)

peek u/2 [Kosa] u (X = d+ε−u)

peek + push

d + min{ε,u,d/3} d + 2ε (all X)

Recall ε can be as small as (1−1/n)u

Terminology

operation: operation w/o arg and return value. Ex: read 0peration instance: operation w/ arg and return value.

Ex: read(-,3). legal op sequence: one of the sequences in the

sequential spec of the data type. Ex: for register, every read returns value of latest preceding write

equivalent sequences of ops, ρ1 and ρ2: for all op sequences ρ3, ρ1.ρ3 is legal iff ρ2.ρ3 is legal

OP is a mutator: there exist op sequence ρ and op instance in OP s.t. ρ.op and ρ are not equivalent

OP is an accessor: there exist legal op sequence ρ and op instance in OP s.t. ρ.op is illegal

Pure mutator: mutator but not accessor Pure accessor: accessor but not mutator 16

Lower Bound #1 (write, push, enq, etc.)

If for all operation sequences ρ and all

instances op1 and op2 of OP, ρ.op1 and ρ.op2 legal => ρ.op1.op2 and ρ.op2.op1 are both legal, and

there exists operation sequence ρ and instances op1,op2,...,opn of OP s.t.

ρ.opi is legal, i = 1,...,n andfor all permutations π1 and π2 of op1,...,opn,

last(π1) ≠ last(π2) => ρ.π1 and ρ.π2 are not equivalent

then |OP| ≥ (1 − 1/n)u.

17

Classic Shifting Proof Idea

Assume in contradiction there is an implementation with |OP| < (1 − 1/n)u

Specify a carefully designed reference execution Specify which operations are invoked when, message

delays, and clock skews

Shift the real times when events occur in reference execution to get a new execution that still should be correct, but because of the shifting, the semantics of OP are violated Carefully design shift amounts to keep msg delays and

clock skews within bounds

18

19

Classic Shifting Picture

p1

ρ observing ops

p2

p3

p4

linearized last

p1

ρ observing ops

p2

p4

p3linearized last

Wrong!shift p3

op1

op2

op3

op4

op1

op2

op3

op4

Shifting Proof Idea: Some Details

Reference execution: Execute ρ sequentially (from 2nd condition) Have n procs concurrently invoke op1,...,opn

Argue that the responses of the concurrent operations are the same as for the opi’s

Execute a sequence of operations that “observe” the result of the concurrent operations

Specify the message delays carefully Identify the last operation of the permutation into which the

opi’s are linearized Shift carefully so that this last operation finishes before the

first one starts => permutation in which the operations are linearized in shifted execution has different last operation

Since different last operations produce non-equivalent states, “observer” sequence is incorrect, contradiction

20

Lower Bound #2 (rmw, pop, deq, etc.)

Ifthere exist operation sequence ρ

and instances op1 and op2 of OP s.t. ρ.op1 and ρ.op2 are both legal and ρ.op1.op2 and ρ.op2.op1 are both illegal

then |OP| ≥ d + min{ε,u,d/3}.

21

Proof Idea

New shifting method:Shift reference execution by a (larger)

amount so that there is one pair of nodes with too large message delay

Chop the shifted execution as late as possible before first violation of message delay bound

Different nodes are chopped at different, carefully chosen, points that form a consistent cut

Extend prefix of shifted execution from the cut to have correct message delays

22

23

Proof Idea

p1

p2

op1 = op(arg1,resp1)

op2 = op(arg2,resp2)

reference execution: op1 starts at t,op2 starts at t+m,m = min{ε,u,d/3}

shift p2 by −m

p1

p2

op(arg1,resp1’)

op(arg2,resp2’)

shift amount of m is too large forclassic shift – use new shift andoperation properties to prove thatresp1’ = resp1 and resp2’ = resp2.Thus operations are still op1 and op2.

p1

p2

op(arg1,resp1’’)

op(arg2,resp2’’)

shift p1 by mshift amount of m is too large forclassic shift – use new shift andoperation properties to prove thatresp1’’ = resp1 and resp2’’ = resp2.Thus operations are still op1 and op2.Con

tradiction

Algorithm Intuition for Mutators

Mutators must be executed in same order at every node

On invocation, broadcast to all nodes w/ timestamp If pure mutator, wait ε+X and return to user

wait d−u to simulate minimum message delay to self, when broadcast is received, add to pending set

Wait long enough (u+ε) to ensure that no operation with smaller timestamp can be received and then execute locally all pending ops with smaller or equal timestamp If not pure mutator, then return to user

24

Algorithm Intuition for Pure Accessors

Pure accessors only need to execute locally so no need to exchange messages

This allows squeezing the timing, since we only have to make sure no remote invocations with smaller timestamps will arrive after the pure accessor executes and returns

Give pure accessor a special timestamp X in the past

Wait d+ε−X time, then execute locally all pending ops with smaller timestamp, execute locally the pure accessor, and return to user

25

Algorithm when a pure accessor aop(arg) is

invoked at node i at clock time T: set timer to respond to (aop,arg,

(T−X,i)) for d+ε−X in the future when timer to respond to

(aop,arg,ts) expires: execute all ops in pending set with

timestamp < ts, in timestamp order, and cancel associated execute timers

execute aop respond to user

when a non pure accessor op(arg) is invoked at node i at clock time T:

if op is a pure mutator then set timer to respond to (op,arg,(T,i)) for ε+X in the future

set timer to add (op,arg(,T,i)) to pending set for d−u in the future

send (op,arg,(T,i)) msg to all other nodes

when timer to respond to pure mutator (mop,arg,ts) expires:

respond to user

when timer to add (op,arg,ts) to pending set expires or (op,arg,ts) msg is received:

add (op,arg,ts) to pending set timer to execute (op,arg,ts)

for u+ε in the future

when timer to execute (op,arg,ts) expires:

execute all ops in pending set with timestamp ≤ ts, in timestamp order, and cancel associated execute timers

if i is the invoker of (op,arg,ts) then respond to user

26

Algorithm Example: Operations in Isolation

27

p0

t

p1

real time

p2

t+d+ε−X

invoke readexecute readreturn read

t+ε+X t+d−u t+d+ε

invoke write respond write add write execute write

execute write

add write

execute write

add write

invoke RMW add RMWexecute RMWrespond RMW

execute RMWadd RMW

add RMW execute RMW

Algorithm Example: Operations Interacting (T2 < T1)

28

p0

t

p1

real time

p2

t+d+ε−X

invoke readexecute readreturn read

t+ε−X t+d−u t+d+ε

invoke write respond write add write execute write

execute write

add write

execute write

add write

invoke RMW add RMWexecute RMWrespond RMW

execute RMWadd RMW

add RMW execute RMW

T1

T2

Algorithm Analysis

Linearizability shown in a standard way (provide an ordering of the operations and show it satisfies the properties) Mutators are linearized by timestamps Accessors fit in between to reflect what they

saw Time bounds:

pure accessor: timer ensures d+e−X pure mutator: timer ensures e+X other: two timers ensure (d−u)+(u+e) =

d+e X is a parameter to trade off the time of pure accessors and

pure mutators (as in [Mavronicolas and Roth 1999] for registers)

29

Conclusion

Summary: Showed improved lower bounds on elapsed

time of operations for linearizable implementations of arbitrary data types in partially synchronous systems

Presented generic algorithm for the problem Tight and almost tight bounds in many cases

for some common data types Open problems:

Tighten gaps Consider clock drift, failures, churn,… Other consistency conditions? 30