CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service...

42
CMPT 431 Lecture IX: Coordination And Agreement

Transcript of CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service...

Page 1: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

CMPT 431

Lecture IX: Coordination And Agreement

Page 2: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

2CMPT 431 © A. Fedorova

A Replicated Serviceclient servers

network

client

master

slave

slave

W

W

WR

R

W write Wdata replication R read

Page 3: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

3CMPT 431 © A. Fedorova

A Need For Coordination And Agreement

client servers

network

client

master

slave

slave

Must coordinate election of a new

master

Must agree on a new master

Page 4: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

4CMPT 431 © A. Fedorova

Roadmap

• Today we will discuss protocols for coordination and agreement

• This is a difficult problem because of failures and lack of bound on message delay

• We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions

• We will look at several problems requiring communication and agreement: distributed mutual exclusion, election

• We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus

Page 5: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

5CMPT 431 © A. Fedorova

Distributed Mutual Exclusion (DMTX)

• Similar to a local mutual exclusion problem• Processes in a distributed system share a resource• Only one process can access a resource at a time• Examples:

– File sharing– Sharing a bank account– Updating a shared database

Page 6: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

6CMPT 431 © A. Fedorova

Assumptions and Requirements

• A synchronous system• Processes do not fail• Message delivery is reliable (exactly once)• Protocol requirements:

Safety: At most one process may execute in the critical section at a timeLiveness: Requests to enter and exit the critical section eventually succeedFairness: Requests to enter the critical section are granted in the order in which they were received

Page 7: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

7CMPT 431 © A. Fedorova

Evaluation Criteria of DMTX Algorithms

• Bandwidth consumed– proportional to the number of messages sent in each entry and

exit operation• Client delay

– delay incurred by a process and each entry and exit operation• System throughput

– the rate at which processes can access the critical section (number of accesses per unit of time)

Page 8: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

8CMPT 431 © A. Fedorova

DMTX Algorithms

• We will consider the following algorithms:– Central server algorithm– Ring-based algorithm– An algorithm based on voting

Page 9: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

9CMPT 431 © A. Fedorova

The Central Server Algorithm

Page 10: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

10CMPT 431 © A. Fedorova

The Central Server Algorithm

• Performance:– Entering a critical section takes two messages (a request message

followed by a grant message)– System throughput is limited by the synchronization delay at the

server: the time between the release message to the server and the grant message to the next client)

• Fault tolerance– Does not tolerate failures– What if the client holding the token fails?

Page 11: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

11CMPT 431 © A. Fedorova

A Ring-Based Algorithm

Page 12: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

12CMPT 431 © A. Fedorova

A Ring-Based Algorithm (cont)

• Processes are arranged in the ring• There is a communication channel from process pi to process

(pi+1) mod N• They continuously pass the mutual exclusion token around the

ring• A process that does not need to enter the critical section (CS)

passes the token along• A process that needs to enter the CS retains the token; once it

exits the CS, it keeps on passing the token• No fault tolerance• Excessive bandwidth consumption

Page 13: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

13CMPT 431 © A. Fedorova

Maekawa’s Voting Algorithm

• To enter a critical section a process must receive a permission from a subset of its peers

• Processes are organized in voting sets• A process is a member of M voting sets• All voting sets are of equal size (for fairness)

Page 14: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

14CMPT 431 © A. Fedorova

Maekawa’s Voting Algorithm

p1

p2

p3

p4

• Intersection of voting sets guarantees mutual exclusion

• To avoid deadlock, requests to enter critical section must be ordered

Page 15: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

15CMPT 431 © A. Fedorova

Elections

• Election algorithms are used when a unique process must be chosen to play a particular role:– Master in a master-slave replication system– Central server in the DMTX protocol

• We will look at the bully election algorithm• The bully algorithm tolerates failstop failures• But it works only in a synchronous system with reliable

messaging

Page 16: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

16CMPT 431 © A. Fedorova

The Bully Election Algorithm

• All processes are assigned identifiers• The system always elects a coordinator with the highest

identifier:– Each process must know all processes with higher identifiers

than its own• Three types of messages:

– election – a process begins an election– answer – a process acknowledges the election message– coordinator – an announcement of the identity of the

elected process

Page 17: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

17CMPT 431 © A. Fedorova

The Bully Election Algorithm (cont.)

• Initiation of election:– Process p1 detects that the existing coordinator p4 has

crashed an initiates the election– p1 sends an election messages to all processes with higher

identifier than itself

election

p1 p2 p3 p4election

Page 18: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

18CMPT 431 © A. Fedorova

The Bully Election Algorithm (cont.)

• What happens if there are no crashes:– p2 and p3 receive the election message from p1 send back the

answer message to p1 , and begin their own elections– p3 sends answer to p2

– p3 receives no answer message from p4, so after a timeout it elects itself as a leader (knowing it has the highest ID)

election

p1 p2 p3 p4election election election

answeranswer

answer

coordinator coordinator

Page 19: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

19CMPT 431 © A. Fedorova

The Bully Election Algorithm (cont.)

• What happens if p3 also crashes after sending the answer message but before sending the coordinator message?

• In that case, p2 will time out while waiting for coordinator message and will start a new election

election

p1 p2 p3 p4election election election

answer answeranswer

p2

Page 20: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

20CMPT 431 © A. Fedorova

The Bully Election Algorithm (summary)

• The algorithm does not require a central server• Does not require knowing identities of all the processes• Requires knowing identities of processes with higher IDs• Survives crashes• Assumes a synchronous system (relies on timeouts)

Page 21: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

21CMPT 431 © A. Fedorova

Consensus With General Failures

• The algorithms we’ve covered so far tolerated only failstop failures

• Let’s look at reaching consensus in presence of more general failures– Omission– Byzantine

Page 22: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

22CMPT 431 © A. Fedorova

Consensus

• All processes agree on the same value (or set of values)• When do you need consensus?

– Leader (master) election– Mutual exclusion– Transaction involving multiple parties (banking)

• We will look at several variants of consensus problem– Consensus– Byzantine generals

Page 23: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

23CMPT 431 © A. Fedorova

System Model

• There is a set of processes Pi

• There is a set of values {v0, …, vN-1} proposed by processes• Each processes Pi decides on di

• di belongs to the set {v0, …, vN-1}• Assumptions:

– Synchronous system (for now)– Failstop failures– Byzantine failures– Reliable channels

Page 24: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

24CMPT 431 © A. Fedorova

Consensus

Step 1Propose.

P1

P2 P3

v1

v3 v2

Consensus algorithm

Step 2Decide.

P1

P2 P3

d1

d3 d2

Courtesy of Jeff Chase, Duke University

Page 25: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

25CMPT 431 © A. Fedorova

Consensus (C)

Pi selects di from {v0, …, vN-1}.

All Pi select the same vk (make the same decision)

di = vk

Courtesy of Jeff Chase, Duke University

Page 26: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

26CMPT 431 © A. Fedorova

Conditions for Consensus

• Termination: All correct processes eventually decide.• Agreement: All correct processes select the same di.• Integrity: If all correct processes propose the same v, then

di = v

Page 27: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

27CMPT 431 © A. Fedorova

Consensus in a Synchronous System Without Failures

• Each process pi proposes a decision value vi

• All proposed vi are sent around, such that each process knows all proposed vi

• Once all processes receive all proposed v’s, they apply to them the same function, such as: minimum(v1, v2, …., vN)

• Each process pi sets di = minimum(v1, v2, …., vN)• The consensus is reached• What if processes fail? Can other processes still reach an

agreement?

Page 28: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

28CMPT 431 © A. Fedorova

Consensus in a Synchronous System With Failstop & Omission Failures

• We assume that at most f out of N processes fail• To reach a consensus despite f failures, we must extend

the algorithm to take f+1 rounds• At round 1: each process pi sends its proposed vi to all

other processes and receives v’s from other processes • At each subsequent round process pi sends v’s that it has

not sent before and receives new v’s• The algorithm terminates after f+1 rounds• Let’s see why it works…

Page 29: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

29CMPT 431 © A. Fedorova

Proof that Consensus is Reached

• Will prove by contradiction• Suppose some correct process pi possesses a value that another correct process

pj does not possess• This must have happened because some other processes pk sent that value to pi

but crashed or before sending it to pj (or lost the message)• The crash must have happened in round f+1 (last round). Otherwise, pi would

have sent that value to pj in round f+1• But how come pj have not received that value in any of the previous rounds? • There must have been a crash at every previous round – some process sent the

value to some other processes, but did not send it to pj

• But this implies that there must have been f+1 failures• This is a contradiction: we assumed at most f failures

Page 30: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

30CMPT 431 © A. Fedorova

A Take-Away Point

• If you cannot build a fully failproof algorithm...• Build an algorithm that is guaranteed to tolerate some

number f of failures• Then build a system that has fewer than f failures with

high probability

Page 31: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

31CMPT 431 © A. Fedorova

Byzantine Generals Problem (BG)

• Two types of generals: commander and subordinates

• A commander proposes an action (vi).

• Subordinates must agree

di = vleader

vleader

leader orcommander

subordinate orlieutenant dj = vleader

Courtesy of Jeff Chase, Duke University

Page 32: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

32CMPT 431 © A. Fedorova

Conditions for Consensus

• Termination: All correct processes eventually decide.• Agreement: All correct processes select the same di.• Integrity: If the commander is correct than all correct

processes decide on the value that the commander proposed

Page 33: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

33CMPT 431 © A. Fedorova

Consensus in a Synchronous System With Byzantine Failures

• Byzantine failure: a process can forward to another process an arbitrary value v

• Byzantine generals: the commander... – says to one lieutenant that v = A – says to another lieutenant that v = B

• We will show that consensus is impossible with only 3 generals

• Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals

Page 34: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

34CMPT 431 © A. Fedorova

BG: Impossibility With Three General

• Scenario 1: p2 must decide v (by integrity condition)

• But p2 cannot distinguish between Scenario 1 and Scenario 2

• If it decides to believe the general, it will decide v in Scenario 2• By symmetry, p3 will decide u in Scenario 2

• p2 and p3 will have reached different decisions

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p1 (Commander)

p2 p3

1:u1:v

2:1:v

3:1:u

Faulty processes are shown shaded

“3:1:u” means “3 says 1 says u”.

Scenario 1 Scenario 2

Page 35: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

35CMPT 431 © A. Fedorova

Solution With Four Byzantine Generals

• We can reach consensus if there are 4 generals and at most 1 is faulty

• Intuition: use the majority rule

Correct process

Who is telling the

truth?

Majority rules!

Page 36: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

36CMPT 431 © A. Fedorova

Solution With Four Byzantine Generals

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

Faulty processes are shown shadedp4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p1 (Commander)

p2 p3

1:w1:u

2:1:u

3:1:w

p4

1:v

4:1:v

2:1:u 3:1:w

4:1:v

Round 1: The commander sends v to all other generalsRound 2: All generals exchange values that they sent to commanderThe decision is made based on majority

Page 37: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

37CMPT 431 © A. Fedorova

Solution With Four Byzantine Generals

p1 (Commander)

p2 p3

1:v1:v

2:1:v

3:1:u

p4

1:v

4:1:v

2:1:v 3:1:w

4:1:v

p2 receives: {v, v, u}. Decides vp4 receives: {v, v, w}. Decides v

Page 38: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

38CMPT 431 © A. Fedorova

Solution With Four Byzantine Generals

p1 (Commander)

p2 p3

1:w1:u

2:1:u

3:1:w

p4

1:v

4:1:v

2:1:u 3:1:w

4:1:v

p2 receives: {u, w, v}. Decides NULLp4 receives: {u, v, w}. Decides NULLp3 receives: {w, u, v}. Decides NULL

The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)

Page 39: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

39CMPT 431 © A. Fedorova

Consensus in an Asynchronous System

• In the algorithms we’ve looked at consensus has been reached by using several rounds of communication

• The systems were synchronous, so each round always terminated

• If a process has not received a message from another process in a given round, it could assume that the process is faulty

• In an asynchronous system this assumption cannot be made! • Fischer-Lynch-Patterson (1985): No consensus can be

guaranteed in an asynchronous communication system in the presence of any failures.

• Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

Page 40: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

40CMPT 431 © A. Fedorova

Consensus in Practice

• Real distributed systems are by and large asynchronous• How do they operate if consensus cannot be reached?• Assume a synchronous system: use manual fault resolution if

something goes wrong• Fault masking: assume that failed processes always recover, and

define a way to reintegrate them into the group.– If you haven’t heard from a process, just keep waiting…– A round terminates when every expected message is received.

• Failure detectors: construct a failure detector that can determine if a process has failed.– A round terminates when every expected message is received, or the

failure detector reports that its sender has failed.

Page 41: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

41CMPT 431 © A. Fedorova

Failure Detectors• First problem: how to detect that a member has failed?

– pings, timeouts, beacons, heartbeats– recovery notifications

• Is the failure detector accurate? – Does it accurately detect failures?

• Is the failure detector live? – Are there bounds on failure detection time?

• In an asynchronous system, it impossible for a failure detector to be both accurate and live

Page 42: CMPT 431 Lecture IX: Coordination And Agreement. 2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write.

42CMPT 431 © A. Fedorova

Summary

• Coordination and agreement are essential in real distributed systems

• Real distributed systems are asynchronous• Consensus cannot be reached in an asynchronous distributed

system• Nevertheless, people still build useful distributed systems that

rely on consensus• Fault recovery and masking are used as mechanisms for helping

processes reach consensus• Popular fault masking and recovery techniques are transactions

and replication – the topics of the next few lectures