L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen!...

44
L20: Replicated state machines with Paxos Sam Madden 6.033 Spring 2014

Transcript of L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen!...

Page 1: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

L20:  Replicated  state  machines  with  Paxos  

     

Sam  Madden  6.033  Spring  2014  

Page 2: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Overall  Goal  

Building  a  stateful  server  (e.g.,  database)  that  remains  available  in  the  presence  of  node  failures  

Page 3: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Last  Eme:  Replicated  State  Machine  (RSM)  

•  Key  idea:    send  all  replicas  same  commands  in  same  order  – This  will  keep  them  in  sync  

•  Idea:    make  one  node  a  primary  (coordinator)  to  order  all  transacEons  and  send  to  others  

•  What  if  primary  fails?    Introduce  a  view  server  that  keeps  track  of  current  primary  /  replica  

Page 4: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

RSM  w/  View  Server  Example  

1:  R1,  R2  

View  Server  

Replica  1  

Replica  2  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

Backup    Primary  

Page 5: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  

View  Server  

Replica  1  

Replica  2  

Client  

Get  view  

Send  request  to  primary  

Send  request  to  backup  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

Page 6: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

Page 7: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

Send  request  to  primary  

Send  request  to  backup  

Page 8: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

View:  1:  R1,  R2  

Page 9: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  2:  R2,  -­‐-­‐  

View:  1:  R1,  R2  

Send  request  to  Replica  1  

Send  request  to  Replica  2  

Replica  2  refuses  to  process  request  

Error!  

Error!  

Moment  R2  learns  about  new  view  it  becomes  ac4ve  

Page 10: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  2:  R2,  -­‐-­‐  

View:  1:  R1,  R2  

Page 11: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  1:  R1,  R2  

View:  2:  R2,  -­‐-­‐  

View:  1:  R1,  R2  

Get  view  

Page 12: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  2:  R2,  -­‐-­‐  

View:  2:  R2,  -­‐-­‐  

View:  2:  R2,  -­‐-­‐  

Get  view  

Page 13: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

1:  R1,  R2  2:  R2,  -­‐-­‐  

View  Server  

Replica  1  

Replica  2  

Client  

View:  2:  R2,  -­‐-­‐  

View:  2:  R2,  -­‐-­‐  

View:  2:  R2,  -­‐-­‐  

Send  request  to  primary  

Send  request  to  backup  

Page 14: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Problem:    How  to  make  view  server  fault  tolerant  

Page 15: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Paxos  Goal  

N  >=  3  nodes,  trying  to  agree  about  some  value  (e.g.,  the  next  view  server  for  the  RSM  protocol)    

Page 16: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Paxos  properEes  

l  All  nodes  agree  on  a  value,  despite  node  failures,  network  failures,  delays  l  E.g.,  X  is  the  next  operaEon  to  execute  l  E.g.,  Y  is  the  next  primary    

l  Fault  tolerant:  succeeds  if  fewer  than  N/2  nodes  fail  l  Liveness  not  guaranteed    

l  AssumpEon:  nodes  are  fail-­‐stop  

Page 17: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Setup  

Servers  each  run  3  processes:    Proposer  –  proposes  new  values    Acceptor  –  accepts  (or  rejects)  proposals    Learner  –  client  waiEng  for  new  value  

   

Page 18: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Paxos  rule  l  If  a  majority  of  nodes  in  an  earlier  proposal  number  accepted  a  value,  later  proposals  must  accept  the  same  value  

 l  State  maintained  by  acceptor:  

l  Np:  largest  proposal  seen  in  prepare  l  Na:  largest  proposal  seen  in  accept  l  Va:  value  accepted  for  proposal  Na    

l  State  must  be  persistent  across  reboot  

Page 19: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Paxos  Propose(V): choose unique N, > Np send Prepare(N) to acceptors if Prepare_OK(Na, Va) from majority: V' = Va with highest Na, or V if none send Accept(N, V') to acceptors if Accept_OK(N) from majority: send Decided(V') to learners

Prepare(N): if N > Np: log Np = N reply Prepare_OK(Na, Va)

Accept(N, V): if N ≥ Np: log Na = N, log Va = V reply Accept_OK(Na, Va)

         Proposer  

Acceptor  

l  Np:  largest  proposal  seen  in  prepare  

l  Na:  largest  proposal  seen  in  accept  

l  Va:  value  accepted  for  proposal  Na  

Page 20: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Paxos  Propose(V): choose unique N, > Np send Prepare(N) to acceptors if Prepare_OK(Na, Va) from majority: V' = Va with highest Na, or V if none send Accept(N, V') to acceptors if Accept_OK(N) from majority: send Decided(V') to learners

Prepare(N): if N > Np: log Np = N reply Prepare_OK(Na, Va)

Accept(N, V): if N ≥ Np: log Na = N, log Va = V reply Accept_OK(Na, Va)

         Proposer  

Acceptor  

l  Np:  largest  proposal  seen  in  prepare  

l  Na:  largest  proposal  seen  in  accept  

l  Va:  value  accepted  for  proposal  Na  

Page 21: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Prep(10)  

Prep(10)  

Prep(10)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Example  1  

Page 22: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok,  None  

Ok,  None  

Ok,  None  

Page 23: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Acc(10,x)  

Acc(10,x)  

Acc(10,x)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Page 24: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok  

Ok  

Ok  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Commit  point  when  majority  of  acceptors  log  value  

Page 25: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors/Learners

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

A1—3  can  share  with  learners  

Dec(10,x)  

Dec(10,x)  

Dec(10,x)  

Page 26: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Prep(10)  

Prep(10)  

Prep(10)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Example  2  

Page 27: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

2

Prep(11)  

Prep(11)  

Prep(11)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Np=11  

Page 28: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok,  None  

Ok,  None  

Ok,  None  

2

Log  Np=11  

Log  Np=11  

Log  Np=11  

Page 29: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok,  None  

Ok,  None  

Ok,  None  

2

Log  Np=11  

Log  Np=11  

Log  Np=11  

Page 30: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

2

Log  Np=11  

Log  Np=11  

Log  Np=11  

P1  aeempts  commit  first,  fails  

Acc(10,x)  

Acc(10,x)  

Acc(10,x)  

Rejected  because  Np  >  10    (This  is  why  we  must  record  Np)    

Decided  Message  Omieed  for  Brevity  

Page 31: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Prep(10)  

Prep(10)  

Prep(10)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Example  3  

2

Page 32: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok,  None  

Ok,  None  

Ok,  None  

2

Page 33: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

2

Acc(10,x)  

Acc(10,x)  

Acc(10,x)  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Page 34: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok  

Ok  

Ok  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

2

Page 35: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

2

Prep(11)  

Prep(11)  

Prep(11)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Np=11  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Page 36: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Ok,  x  

Ok,  x  

Ok,  x  

2

New  proposer  learns  previously  commieed  value  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Np=11  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Decided  Message  Omieed  for  Brevity  

Page 37: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Prep(10)  

Prep(10)  

Prep(10)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Example  4  

2

Page 38: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok,  None  

Ok,  None  

Ok,  None  

2

Page 39: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

2

Acc(10,x)  

Acc(10,x)  

Acc(10,x)  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Page 40: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Log  Np=10  

Log  Np=10  

Log  Np=10  

Ok  

Ok  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  2

Page 41: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

2

Prep(11)  

Prep(11)  

Prep(11)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Np=11  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Page 42: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

Ok,  x  

Ok,  x  2

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Np=11  

P2  learns  value  is  x;  has  to  propagate  it  even  if  hears  from  non-­‐majority    

Ok,  none  

Page 43: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

2

3

1 1

Proposers Acceptors

2

P2  learns  commieed  value  is  x,  tells  A3  

Acc(11,x)  

Acc(11,x)  

Acc(11,x)  

Log  Np=10  

Log  Np=10  

Log  Np=10  

Log  Np=11  

Log  Np=11  

Log  Na=10  

Log  Va=x  

Log  Na=10  

Log  Va=x  

Log  Np=11  

Log  Na=11  

Log  Va=x  

Log  Na=11  

Log  Va=x  

Log  Na=11  

Log  Va=x  

Decided  Message  Omieed  for  Brevity  

Page 44: L20:%Replicated%state%machines% with%Paxos%web.mit.edu/6.033/2014/ · Np:!largestproposal!seen! in!prepare!! Na:!largestproposal!seen! in!accept! Va:!value!accepted!for! proposal!Na.

Summary  

l   Consistency:  single-­‐copy  semanEcs    l  Replicated  state  machines  provide  single-­‐copy  

l  Key  issue:  agreeing  on  order  of  operaEons  l  Hard  case:  network  parEEon  

 l  Paxos  allows  replicas  to  reach  consensus,  in  presence  of  machine  and  network  failures  l  Widely  used  in  pracEce  [Chubby,  ZooKeeper,  etc.]