CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

39
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 15: Broadcast 1

description

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS. Fall 2011 Prof. Jennifer Welch. Broadcast Specifications. Recall the specification of a broadcast service given in the last set of slides: Inputs : bc-send i ( m ) an input to the broadcast service - PowerPoint PPT Presentation

Transcript of CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Page 1: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

CSCE 668DISTRIBUTED ALGORITHMS AND SYSTEMS

Fall 2011Prof. Jennifer WelchCSCE 668

Set 15: Broadcast 1

Page 2: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Broadcast Specifications

CSCE 668Set 15: Broadcast

2

Recall the specification of a broadcast service given in the last set of slides:

Inputs: bc-sendi(m) an input to the broadcast service pi wants to use the broadcast service to send m

to all the procs Outputs: bc-recvi(m,j)

an output of the broadcast service broadcast service is delivering msg m, sent by

pj, to pi

Page 3: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Broadcast Specifications

CSCE 668Set 15: Broadcast

3

A sequence of inputs and outputs (bc-sends and bc-recvs) is allowable iff there exists a mapping from each bc-recvi(m,j) event to an earlier bc-sendj(m) event s.t. is well-defined: every msg bc-recv'ed was

previously bc-sent (Integrity) restricted to bc-recvi events, for each i, is

one-to-one: no msg is bc-recv'ed more than once at any single proc. (No Duplicates)

restricted to bc-recvi events, for each i, is onto: every msg bc-sent is received at every proc. (Liveness)

Page 4: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Ordering Properties

CSCE 668Set 15: Broadcast

4

Sometimes we might want a broadcast service that also provides some kind of guarantee on the order in which messages are delivered.

We can add additional constraints on the mapping : single-source FIFO or totally ordered or causally ordered

Page 5: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Single-Source FIFO Ordering

CSCE 668Set 15: Broadcast

5

For all messages m1 and m2 and all pi and pj, if pi sends m1 before it sends m2, and if pj receives m1 and m2, then pj receives m1 before it receives m2.

Phrased carefully to avoid requiring that both messages are received. that is the responsibility of a liveness

property

Page 6: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Totally Ordered

CSCE 668Set 15: Broadcast

6

For all messages m1 and m2 and all pi and pj, if both pi and pj receive both messages, then they receive them in the same order.

Phrased carefully to avoid requiring that both messages are received by both procs. that is the responsibility of a liveness

property

Page 7: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Happens Before for Broadcast Messages

CSCE 668Set 15: Broadcast

7

Earlier we defined "happens before" relation for events.

Now extend this definition to broadcast messages.

Assume all communication is through broadcast sends and receives.

Msg m1 happens before msg m2 if some bc-recv event for m1 happens before (in

the old sense) the bc-send event for m2, or m1 and m2 are bc-sent by the same proc. and m1

is bc-sent before m2 is bc-sent.

Page 8: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Example of Happens Before for Broadcast Messages

CSCE 668Set 15: Broadcast

8

m1

m2

m3

m4

m1 happens before m3 and m4

m2 happens before m4

m3 happens before m4

Page 9: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Causally Ordered

CSCE 668Set 15: Broadcast

9

For all messages m1 and m2 and all pi, if m1 happens before m2, and if pi receives both m1 and m2, then pi receives m1 before it receives m2.

Phrased carefully to avoid requiring that both messages are received. that is the responsibility of a liveness

property

Page 10: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Example

CSCE 668Set 15: Broadcast

10

a

b

single-source FIFO?

totally ordered?

causally ordered?

Page 11: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Example

CSCE 668Set 15: Broadcast

11

a b

single-source FIFO?

totally ordered?

causally ordered?

Page 12: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Example

CSCE 668Set 15: Broadcast

12

a

b

single-source FIFO?

totally ordered?

causally ordered?

Page 13: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Algorithm BB to Simulate Basic Broadcast on Top of Point-to-Point

CSCE 668Set 15: Broadcast

13

When bc-sendi(m) occurs: pi sends a separate copy of m to every

processor (including itself) using the underlying point-to-point message passing communication system

When can pi perform bc-recvi(m)? when it receives m from the underlying

point-to-point message passing communication system

Page 14: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Basic Broadcast Simulation

CSCE 668Set 15: Broadcast

14

…Alg BBBB0

bc-sendi bc-recvi

sendi recvi

asynch pt-to-pt message passing

BBn-1

bc-sendj bc-recvj

sendj recvjbasic broadcast

Page 15: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Basic Broadcast Algorithm

CSCE 668Set 15: Broadcast

15

Assume the underlying point-to-point message passing system is correct (i.e., conforms to the spec given in previous set of slides).

Check that the simulated broadcast service satisfies: Integrity No Duplicates Liveness

Page 16: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Single-Source FIFO Algorithm

CSCE 668Set 15: Broadcast

16

Assume the underlying communication system is basic broadcast.

when ssf-bc-sendi(m) occurs: pi uses the underlying basic broadcast service to

bcast m together with a sequence number pi increments sequence number by 1 each time it

initiates a bcast when can pi perform ssf-bc-recvi(m)?

when pi has bc-recv'ed m with sequence number T and has ssf-bc-recv'ed messages from pj (the ssf-bc-sender of m) with all smaller sequence numbers

Page 17: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Single-Source FIFO Algorithm

CSCE 668Set 15: Broadcast

17

SSF alg(timestamps)

basic bcastalg (n copies)

point-to-point message passing

user of SSF bcast

ssf-bc-send ssf-bc-recv

bc-send

send

bc-recv

recv

basicbcast

ssfbcast

Page 18: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Asymmetric Algorithm for Totally Ordered Broadcast

CSCE 668Set 15: Broadcast

18

Assume underlying communication service is basic broadcast.

There is a distinguished proc. pc

when to-bcasti(m) occurs: pi sends m to pc (either assume the basic

broadcast service also has a point-to-point mechanism, or have recipients other than pc ignore the msg)

when pc receives m from pi from the basic broadcast service: append a sequence number to m and bc-send it

Page 19: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Asymmetric Algorithm for Totally Ordered Broadcast

CSCE 668Set 15: Broadcast

19

when can pi perform to-bc-recv(m)? when pi has bc-recv'ed m with sequence

number T and has to-bc-recv'ed messages with all smaller sequence numbers

Page 20: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Asymmetric Algorithm Discussion

CSCE 668Set 15: Broadcast

20

Simple Only requires basic broadcast But pc is a bottleneck Alternative approach next…

Page 21: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Symmetric Algorithm for Totally Ordered Broadcast

CSCE 668Set 15: Broadcast

21

Assume the underlying communication service is single-source FIFO broadcast.

Each proc. tags each msg it sends with a timestamp (increasing). Break ties using proc. ids.

Each proc. keeps a vector of estimates of the other proc's timestamps: If pi 's estimate for pj is k, then pi will not

receive any later msg from pj with timestamp k. Estimates are updated based on msgs received

and "timestamp update" msgs

Page 22: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Symmetric Algorithm for Totally Ordered Broadcast

CSCE 668Set 15: Broadcast

22

Each proc. keeps its timestamp to be ≥ all its estimates: when pi has to increase its timestamp because

of the receipt of a message, it sends a timestamp update msg

A proc. can deliver a msg with timestamp T once every entry in the proc's vector of estimates is at least T.

Page 23: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Symmetric Algorithm

CSCE 668Set 15: Broadcast

23

when to-bc-sendi(m) occurs:

ts[i]++

add (m,ts[i],i) to pending

invoke ssf-bc-sendi((m,ts[i]))

when ssf-bc-recvi((m,T)) from pj

occurs:

ts[j] := T

add (m,T,j) to pending

if T > ts[i] then

ts[i] := T

invoke ssf-bc-sendi("ts-up",T)

when ssf-bc-recvi("ts-up",T)

from pj occurs:

ts[j] := T

invoke to-bc-recvi(m,j) when:

(m,T,j) is entry in pending with

smallest (T,j)

T ≤ ts[k] for all k

result: remove (m,T,j) from

pending

Page 24: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

CSCE 668Set 15: Broadcast

24

SSF alg(timestamps)

basic bcastalg (n copies)

point-to-point message passing

symmetric TO alg

ssf-bc-send ssf-bc-recv

bc-send

send

bc-recv

recv

basicbcast

user of TO bcast

to-bc-send to-bc-recv

ssfbcast

TObcast

Page 25: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Symmetric Algorithm

CSCE 668Set 15: Broadcast

25

Lemma (8.2): Timestamps assigned to msgs form a total order (break ties with id of sender).

Theorem (8.3): Symmetric algorithm simulates totally ordered broadcast service.

Proof: Must show top-level outputs of symmetric algorithm satisfy 4 properties, in every admissible execution (relies on underlying ssf-bcast service being correct).

Page 26: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Symmetric Alg.

CSCE 668Set 15: Broadcast

26

Integrity: follows from same property for ssf-bcast.No Duplicates: follows from same property for ssf-

bcast.Liveness: Suppose in contradiction some pi has some entry

(m,T,j) stuck in its pending set forever, where (T,j) is the smallest timestamp of all stuck entries.

Eventually (m,T,j) has the smallest timestamp of all entries in pi's pending set.

Why is (m,T,j) stuck at pi? Because pi's estimate of some pk's timestamp is stuck at some value T' < T.

But that would mean either pk never receives (m,T,j) or pk's timestamp-update msg resulting from pk receiving (m,T,j) is never received at pi, contradicting correctness of the SSF broadcast.

Page 27: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Symmetric Alg.

CSCE 668Set 15: Broadcast

27

Total Ordering: Suppose pi invokes to-bc-recv for msg m with timestamp (T,j), and later it invokes to-bc-recv for msg m' with timestamp (T',j'). Show (T,j) < (T',j').

By the code, if (m',T',j') is in pi's pending set when pi invokes the to-bc-recv for m, then (T,j) < (T',j').

Suppose (m',T',j') is not yet in pi's pending set at that time.

When pi invokes the to-bc-recv for m, precondition ensures that T ≤ ts[j']. So pi has received a msg from pj' with timestamp ≥ T.

By the SSF property, every subsequent msg pi receives from pj' will have timestamp > T, so T' must be > T.

Page 28: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Causal Ordering Algorithms

CSCE 668Set 15: Broadcast

28

The symmetric total ordering algorithm ensures causal ordering: timestamp order extends the happens-

before order on messages. Causal ordering can also be attained

without the overhead of total ordering, by using an algorithm based on vector clocks…

Page 29: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Causal Order Algorithm

CSCE 668Set 15: Broadcast

29

when co-bc-sendi(m) occurs:

vt[i]++

invoke co-bc-recvi(m)

invoke bc-sendi((m,vt))

when bc-recvi((m,w)) from pj occurs:

add (m,w,j) to pending

invoke co-bc-recvi(m,j) when:

(m,w,j) is in pending

w[j] = vt[j] + 1

w[k] ≤ vt[k] for all k ≠ j

result:

remove (m,w,j) from pending

vt[j]++

Note: vt[j] records how many msgs from pj have been co-bc-recv'ed by pi

Code for pi :

Page 30: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Causal Order Algorithm Discussion

CSCE 668Set 15: Broadcast

30

Vector clocks are implemented slightly differently than in the point-to-point case.

In point-to-point case, we exploited indirect (transitive) information about messages received by other procs.

In the broadcast case, we don't need to do that, since very proc will eventually receive every message directly.

Page 31: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Causal Order Algorithm Example

CSCE 668Set 15: Broadcast

31

Algorithm delays the delivery of the C.O. msgs until causal order property won't be violated.

(0,1,0) (0,2,0) (0,3,0)

(1,3,0)

Page 32: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Causal Order Algorithm (Sketch)

CSCE 668Set 15: Broadcast

32

Lemma (8.6): The local array variables vt serve as vector clocks.

Theorem (8.7): The algorithm simulates causally ordered broadcast, if the underlying communication system satisfies (basic) broadcast.

Proof: Integrity and No Duplicates follow from the same properties of the basic broadcast. Liveness requires some arguing. Causal Ordering follows from the lemma.

Page 33: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Reliable Broadcast

CSCE 668Set 15: Broadcast

33

What do we require of a broadcast service when some of the procs can be faulty?

Specifications differ from those of the corresponding non-fault-tolerant specs in two ways:

1. proc indices are partitioned into "faulty" and "nonfaulty"

2. Liveness property is modified…

Page 34: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Reliable Broadcast Specification

CSCE 668Set 15: Broadcast

34

Nonfaulty Liveness: Every msg bc-sent by a nonfaulty proc is eventually bc-recv'ed by all nonfaulty procs.

Faulty Liveness: Every msg bc-sent by a faulty proc is bc-recv'ed by either all the nonfaulty procs or none of them.

Page 35: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Discussion of Reliable Bcast Spec

CSCE 668Set 15: Broadcast

35

Specification is independent of any particular fault model.

We will only consider implementations for crash faults.

No guarantee is given concerning which messages are received by faulty procs.

Can extend this spec to the various ordering variants: msgs that are received by nonfaulty procs

must conform to the relevant ordering property.

Page 36: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Spec of Failure-Prone Point-to-Point Message Passing System

CSCE 668Set 15: Broadcast

36

Before we can design an algorithm to implement reliable (i.e., fault-tolerant) broadcast, we need to know what we can rely on from the lower layer communication system.

Modify the previous point-to-point spec from the no-fault case in two ways:

1. partition proc indices into "faulty" and "nonfaulty"

2. Liveness property is modified…

Page 37: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Spec of Failure-Prone Point-to-Point Message Passing System

CSCE 668Set 15: Broadcast

37

Nonfaulty Liveness: every msg sent by a nonfaulty proc to any nonfaulty proc is eventually received.

Note that this places no constraints on the eventual delivery of messages to faulty procs.

Page 38: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Reliable Broadcast Algorithm

CSCE 668Set 15: Broadcast

38

when rel-bc-sendi(m) occurs:

invoke sendi(m) to all procs

when recvi(m) from pj occurs:

if m has not already been recv'ed then

invoke sendi(m) to all procs

invoke rel-bc-recvi(m)

Page 39: CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Correctness of Reliable Bcast Alg

CSCE 668Set 15: Broadcast

39

Integrity: follows from Integrity property of underlying point-to-point msg system.

No Duplicates: follows from No Duplicates property of underlying point-to-point msg system and the check that this msg was not already received.

Nonfaulty Liveness: follows from Nonfaulty Liveness property of underlying point-to-point msg system.

Faulty Liveness: follows from relaying and underlying Nonfaulty Liveness.