PAXOS Lecture by Avi Eyal Based on: Deconstructing Paxos – by Rajsbaum Paxos Made Simple – by...

21
PAXOS Lecture by Avi Eyal Based on: Deconstructing Paxos – by Rajsbaum Paxos Made Simple – by Lamport Reconstructing Paxos – by Rajsbaum
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    250
  • download

    2

Transcript of PAXOS Lecture by Avi Eyal Based on: Deconstructing Paxos – by Rajsbaum Paxos Made Simple – by...

PAXOS

Lecture by Avi EyalBased on:

Deconstructing Paxos – by Rajsbaum

Paxos Made Simple – by Lamport

Reconstructing Paxos – by Rajsbaum

Our Goals

• Agree on values (Consensus)

• Arrange those values in a “Total Order”

1 2 3 4 5

23 55 2

The Scene• Complete graph• Asynchronous system and no FIFO• Machine may crash (first we deal with “crash-

stop”)• No Byzantine errors• No corruption of messages• The number of machines is known• The system stabilizes after a finite time

A word about stability

By the FLP theorem, Consensus is not solvable in an asynchronous system if even a single process might crash.

We assume that after an unknown finite time, every process that crashes, crashes for good, and every active process is active for good (i.e. no process is unstable forever)

Consensus

• If process Pi proposes a value over and over, then either Pi crashes or Pi decides.

• If Pi decides on a value, then eventually every correct process decides the same value.

Consensus

How can we assure that only a single value is chosen when some machines are unstable?

Do we need a consistent leader?

What if we had more than one leader at a time?

Consensus

Decision will be taken by at least half the processes, and we will make sure that the rest get the message.

We will show that we do NOT need a consistent leader at that point, but…

If we have 2 leaders, they might fail each other.

Proposers & Witnesses

“Read”• Make sure that more than half of the witnesses

will not work with someone whose round number is less than mine.

• Get a decided value if exists.

“Write”

• Set a value to more than half of the witnesses

[“read”, k]

[ackRead, k, writej, vj] or

Proposer Witness

[nackRead, k]

[“write”, k, v*]

[ackWrite, k] or

[nackWrite, k]

Update readj

Update writej, vj

Update v* or abort

Decide v* or abort

ConsensusPropose(v)

k=k+n

Send [“read”, k,] to all

Wait for n/2 replies [ackRead, k’, v’]

if received any nackRead abort

v*=v’ with max k’ or v if none exists

Send [“write”, k, v*] to all

Wait for n/2 replies [ackWrite, k]

if received a nackWrite abort

decide(v*)

Upon receive [read, k]

if k < readi or k < writei

reply [nackRead, k]

else

readi=k

reply [ackRead, k, writei, vi]

Upon receive [write, k, v*]

if writei > k or readi > k

reply [nackWrite, k]

writei = k

vi = v*

reply [ackWrite, k]

Some notes about the Consensus algorithm

• It is possible that Pi proposes a value, does not decide, and then Pj can decide this value even if Pi has crashed (after “write”).

• When 2 leaders are proposing simultaneously, possibly none of them will decide.

• If less than half the processes have answered the “write” query, we cannot be sure what the decided value will be. (It depends if the next proposer will get an answer from them or not).

Total Order

• If Pi delivers m then eventually every correct process delivers m.

• If Pi delivers m, m’ in this order then Pj delivers m, m’ in the same order.

Total Order

Can we do that without a leader?

For how long will we need that leader?

What if we had more than one leader?

The Paxos Algorithm

• Each process maintains the id of it’s current leader

• Proposing values is done through the leader

• The leader sequences the orders and then uses Consensus in order to agree on the sequence.

The Paxos Algorithm

The messages proposed contain values and order numbers.

A leader may take care of a few orders at the same time.

Pi

Leader

Pj

m m’

Propose(6, m)

Propose(7, m’)

Decide(7, m*)

Decide(6, m’*)

(7,m*)

(6,m’*)(6,m’*)

(7,m*)

Data Structures

• TO_Delivered[]

• TO_Undelivered[]

• AwaitToBeDelivered[] used upon delivery

• nextBatch

The Paxos Algorithm – leaderConverge(L, m)

returned = abort

while (returned == abort)

returned = propose(L, m) // Repeat until dicide

send [decision, L, m] to all processes

Upon new message m

Verify that m has not yet been delivered

find k that does not have a Converge(k, *) active

Converge(k, m)

The Paxos Algorithm – processUpon new message m or leader change

Verify that m has not yet been delivered

Send TO_Undelivered+m to the leader.

Upon receive m from Pj [decision/update, kj, m)

stop Converge(kj, *) if active

if kj = nextBatch deliver (kj, m) and return

if kj < nextBatch update Pj of his missing messages

if kj > nextBatch

AwaitToBeDelivered[kj] = m //Will be used upon delivery

send [update, nextBatch-1, TO_Delivered] to all in order to be updated

Fail Recovery

• Each process holds readi, writei, vi, TO_Delivered and nextBatch on a stable storage in order recover consistently after a crash.

• If a leader proposes, crashes, recovers and proposes again, he might consider an answer for the second proposal as an answer for the first one. Replies to the proposer should contain the msg.

• A process should remember all the messages and should answer the same for same messages, in case a proposer proposed twice with the same value.

Tradeoffs

• If we know that most of the processes never crash, we can rely on them instead of using the stable storage.

• If there are unstable processes, who elect themselves as leader over and over, we can store for each process the leaders of all other processes. A process will then elect a leader only if most of the processes have elected that leader (assuming most processes never crash).