Practical Byzantine Fault Tolerance - School of Computingstutsman/cs6963/public/pbft.pdf · What is...

Post on 12-May-2018

222 views 8 download

Transcript of Practical Byzantine Fault Tolerance - School of Computingstutsman/cs6963/public/pbft.pdf · What is...

PracticalByzantineFaultTolerance

CastroandLiskovSOSP99

Whythispaper?

• Kindofincrediblethatit’sevenpossible• LetaloneapracticalNFSimplementationwithit

• Sofarwe’veonlyconsideredfail-stopmodel

• Quiteabitofresearchinthisarea• Muchlessreal-worlddeployment• Mostsystemsbeingbuilttodaydon’tspantrustdomains• Hardtoreasonaboutbenefitsoncompromise

WhatisByzantineBehavior?

• Anythingthatdoesn'tfollowourprotocol.• Maliciouscode/nodes.• Buggycode.• Faultnetworksthatdelivercorruptedpackets.• Disksthatcorrupt,duplicate,lose,orfabricatedata.• Nodesimpersonatingothers.• Joiningclusterwithoutpermission.• Operatingwhentheyshouldn't(e.g.unexpectedclockdrift).

• Serviceopsonapartitionafterpartitionwasgiventoanother• Reallywickedbadstuff:anyarbitrarybehavior.• Subjecttorestriction:independence;willcomebacktothis.

Review:Primary/Backup

• Wantlinearizable semantics• f+1replicastotolerateffailures• Runsintoproblemswhen“viewchanges”areneeded(Lab2).

Primary Backup

Backup

put(X,1)put(X,1)

put(X,1)

Review:Consensus

• Replicatedlog=>replicatedstatemachine• Allexecutesamecommandsinsameorder

• Consensusmoduleensuresproperlogreplication• Makesprogressifanymajorityofserversareup• 2f+1serverstoremainavailablewithuptoffailures

• Failuremodel:fail-stop(notByzantine),delayed/lostmessages

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

add jmp mov shlLog

ConsensusModule

StateMachine

Servers

Clients

shl

3f+1?

• Atf+1wecantolerateffailuresandholdontodata.• At2f+1wecantolerateffailuresandremainavailable.• Whatdowegetfor3f+1?• SMRthatcantoleratefmaliciousorarbitrarilynastyfailures

First,aFewIssues

1. Caveat:Independence2. Spoofing/Authentication

TheCaveat:Independence

• AssumesindependentnodefailuresforBFT!• Isthisabigassumption?• Weactuallyhadthisassumptionwithconsensus• Ifnodesfailinacorrelatedwayitamplifiesthelossofasinglenode• Iffactoris>fthensystemstillwedges.

• Putanotherway:forPaxos toremainavailablewhensoftwarebugscanproducetemporallyrelatedcrasheswhatdoweneed?• 2f+1independentimplementations…

TheStruggleforIndependence

• Samehere:fortrueindependencewe’llneed3f+1implementations• Butitismoreimportanthere

1. Nodesmaybeactivelymaliciousandthatshouldbeok.• Buttheyarelookingforourweakspotandwillexploittoamplifytheireffect.

2. If>ffailureshereanything canhappentothedata.• Attackermightchangeit,deleteit,etc…We’llneverknow.

• Requiresdifferentimplementations,operatingsystems,rootpasswords,administrators.Ugh!

Spoofing/Authentication

get(X)

MaliciousPrimary?

• Mightlie!

get(X)

X=10

X=10

X=-1

MaliciousPrimary?

• Mightlie!• Solution:directresponsefromparticipants

get(X)

X=10

X=10

X=-1

X=10

X=10

MaliciousPrimary?

• Mightlie!• Solution:directresponsefromparticipants• Problemagain:primaryjustliesmore get(X)

X=10

X=10

X=-1

X=10

X=10

X=-1X=-1

TheNeedforCrypto

• Needtobeabletoauthenticatemessages• Public-keycryptoforsignatures• Eachclientandserverhasaprivateandpublickey• Allhostsknowallpublickeys• Signedmessagesaresignedwithprivatekey• Publickeycanverifythatmessagecamefromhostwiththeprivatekey• Whilewe’reonit:we’llneedhashes/digestsalso

AuthenticatedMessages

• Clientrejectsduplicatesorunknownsignatures

S1

S2

S3

get(X)

X=10

X=10

X=-1,signedS1

X=10,signedS3

X=10,signedS2

X=-1,signedS1X=-1,signedS??

Howisthispossible?Why3f+1?

• First,remembertherules• Mustbeabletomakeprogresswithnminusfresponses• n=3f+1• Progresswith3f+1- f=2f+1• Often4total,progresswith3

• Why?Incasethosefwillneverrespond

Try2f+1,f=1

• Goal:make(safe)progresswithonly2of3responses.

S1

S2

S3

C1get(X)

X=10

X=10

X=??

X=10

X=10

Try2f+1,f=1

• Problem:whatifS3wasn’tdown,butslow• InsteadthefailureisacompromisedS2• Clientcanwaitforf+1matchingresponses

S1

S2

S3

C1get(X)

X=10

X=-1

X=10

X=10

X=-1

Try2f+1,f=1

• Problem:whatifS3isbehind,doesn’tknowvalueofXyet?• Can’tdistinguishtruthwithoutf+1knowngoodvalues• Fix:replicatetoatleast2f+1,toleratefslow/down=>3f+1• 2f+1- f=f+1,enoughtodeterminetruthinfaceofflies

S1

S2

S3

C1get(X)

X=10

X=-1

X=??

X=10

X=-1

3f+1

• Progresswithonly2f+1responsesandsafe• Among2f+1onlyfcanbebogus.f+1>f.

S1

S2

S3

S4

C1get(X)

X=10

X=10

X=10

X=??

X=10

X=10

X=-1

10,000ftView

1. Clientsendsrequesttoprimary.2. Primarysendsrequesttoallbackups.3. Replicasexecutetherequestandsendthereply

totheclient.4. Clientwaitsforf+1responseswiththesame

result.

ProtocolPieces

• Dealwithfailureofprimaries• Viewchanges(Lab2/4style)• SimilartoRaft,VR

• Mustorderoperationswithinaview• Mustensureoperationsexecutewithintheirview

Views

• Systemgoesthroughaseriesofviews• Inviewv,replica(vmod(3f+1))isdesignatedprimary• Responsibleforselectingtheorderofoperations• Assignsanincreasingsequencenumbertoeachoperation

• Tentativeordersubjecttoreplicasaccepting• Maygetrejectedifanewviewisestablished• Oriforderisinconsistentwithprioroperations

RequestHandlingPhases

• Innormal-caseoperation,usetwo-phaseprotocolforrequestr:• Phase1(pre-prepare,prepare)goal:• Ensureatleastf+1honestreplicasagreethatIfrequestrexecutesinviewv,willexecutewithseqn

• Phase2(prepare,commit)goal:• Ensureatleastf+1honestreplicasagreethatRequestrhasexecutedinviewvwithseqn

• 2PC-like:• Phase1quibbleaboutorder,Phase2atomicity

Phase1

• ClienttoPrimary{REQUEST,op,timestamp,clientId}sc

• PrimarytoReplicas{PRE-PREPARE,view,seqn,h(req)}sp,req

• ReplicastoReplicas{PREPARE,view,seqn,h(req),replicaId}sri

We define the committed and committed-local predi-cates as follows: committed is true if and onlyif prepared is true for all in some set of

1 non-faulty replicas; and committed-localis true if and only if prepared is true and hasaccepted 2 1 commits (possibly including its own)from different replicas that match the pre-prepare for ;a commit matches a pre-prepare if they have the sameview, sequence number, and digest.The commit phase ensures the following invariant: if

committed-local is true for some non-faultythen committed is true. This invariant and

the view-change protocol described in Section 4.4 ensurethat non-faulty replicas agree on the sequence numbersof requests that commit locally even if they commit indifferent views at each replica. Furthermore, it ensuresthat any request that commits locally at a non-faultyreplica will commit at 1 or more non-faulty replicaseventually.Each replica executes the operation requested byafter committed-local is true and ’s state

reflects the sequential execution of all requests withlower sequence numbers. This ensures that all non-faulty replicas execute requests in the same order asrequired to provide the safety property. After executingthe requested operation, replicas send a reply to the client.Replicas discard requests whose timestamp is lower thanthe timestamp in the last reply they sent to the client toguarantee exactly-once semantics.We do not rely on ordered message delivery, and

therefore it is possible for a replica to commit requestsout of order. This does not matter since it keeps the pre-prepare, prepare, and commit messages logged until thecorresponding request can be executed.Figure 1 shows the operation of the algorithm in the

normal case of no primary faults. Replica 0 is the primary,replica 3 is faulty, and is the client.

X

request pre-prepare prepare commit replyC

0

1

2

3

Figure 1: Normal Case Operation

4.3 Garbage CollectionThis section discusses the mechanism used to discardmessages from the log. For the safety condition to hold,messagesmust be kept in a replica’s log until it knows that

the requests they concern have been executed by at least1 non-faulty replicas and it can prove this to others

in view changes. In addition, if some replica missesmessages that were discarded by all non-faulty replicas,it will need to be brought up to date by transferring allor a portion of the service state. Therefore, replicas alsoneed some proof that the state is correct.Generating these proofs after executing every opera-

tion would be expensive. Instead, they are generatedperiodically, when a request with a sequence number di-visible by some constant (e.g., 100) is executed. We willrefer to the states produced by the execution of these re-quests as checkpoints and we will say that a checkpointwith a proof is a stable checkpoint.A replicamaintains several logical copies of the service

state: the last stable checkpoint, zero ormore checkpointsthat are not stable, and a current state. Copy-on-writetechniques can be used to reduce the space overheadto store the extra copies of the state, as discussed inSection 6.3.The proof of correctness for a checkpoint is generated

as follows. When a replica produces a checkpoint,it multicasts a message CHECKPOINT to theother replicas, where is the sequence number of thelast request whose execution is reflected in the stateand is the digest of the state. Each replica collectscheckpoint messages in its log until it has 2 1 ofthem for sequence number with the same digestsigned by different replicas (including possibly its ownsuch message). These 2 1 messages are the proof ofcorrectness for the checkpoint.A checkpoint with a proof becomes stable and the

replica discards all pre-prepare, prepare, and commitmessages with sequence number less than or equal tofrom its log; it also discards all earlier checkpoints and

checkpoint messages.Computing the proofs is efficient because the digest

can be computed using incremental cryptography [1] asdiscussed in Section 6.3, and proofs are generated rarely.The checkpoint protocol is used to advance the low

and high water marks (which limit what messages willbe accepted). The low-water mark is equal to thesequence number of the last stable checkpoint. The highwater mark , where is big enough so thatreplicas do not stall waiting for a checkpoint to becomestable. For example, if checkpoints are taken every 100requests, might be 200.

4.4 View ChangesThe view-change protocol provides liveness by allowingthe system tomake progress when the primary fails. Viewchanges are triggered by timeouts that prevent backupsfrom waiting indefinitely for requests to execute. Abackup iswaiting for a request if it received a valid request

5

Phase1

• EachreplicawaitsforPRE-PREPARE+2fmatchingPREPAREmessages• Putsthesemessagesinitslog• Thenwesayprepared(req,v,n,i)isTRUE• Ifprepared(req,v,n,i)isTRUEforhonestreplicarithenprepared(req',v,n,j)wherereq'!=req FALSEforanyhonestrj• Sonootheroperationcanexecutewithviewvsequencenumbern

We define the committed and committed-local predi-cates as follows: committed is true if and onlyif prepared is true for all in some set of

1 non-faulty replicas; and committed-localis true if and only if prepared is true and hasaccepted 2 1 commits (possibly including its own)from different replicas that match the pre-prepare for ;a commit matches a pre-prepare if they have the sameview, sequence number, and digest.The commit phase ensures the following invariant: if

committed-local is true for some non-faultythen committed is true. This invariant and

the view-change protocol described in Section 4.4 ensurethat non-faulty replicas agree on the sequence numbersof requests that commit locally even if they commit indifferent views at each replica. Furthermore, it ensuresthat any request that commits locally at a non-faultyreplica will commit at 1 or more non-faulty replicaseventually.Each replica executes the operation requested byafter committed-local is true and ’s state

reflects the sequential execution of all requests withlower sequence numbers. This ensures that all non-faulty replicas execute requests in the same order asrequired to provide the safety property. After executingthe requested operation, replicas send a reply to the client.Replicas discard requests whose timestamp is lower thanthe timestamp in the last reply they sent to the client toguarantee exactly-once semantics.We do not rely on ordered message delivery, and

therefore it is possible for a replica to commit requestsout of order. This does not matter since it keeps the pre-prepare, prepare, and commit messages logged until thecorresponding request can be executed.Figure 1 shows the operation of the algorithm in the

normal case of no primary faults. Replica 0 is the primary,replica 3 is faulty, and is the client.

X

request pre-prepare prepare commit replyC

0

1

2

3

Figure 1: Normal Case Operation

4.3 Garbage CollectionThis section discusses the mechanism used to discardmessages from the log. For the safety condition to hold,messagesmust be kept in a replica’s log until it knows that

the requests they concern have been executed by at least1 non-faulty replicas and it can prove this to others

in view changes. In addition, if some replica missesmessages that were discarded by all non-faulty replicas,it will need to be brought up to date by transferring allor a portion of the service state. Therefore, replicas alsoneed some proof that the state is correct.Generating these proofs after executing every opera-

tion would be expensive. Instead, they are generatedperiodically, when a request with a sequence number di-visible by some constant (e.g., 100) is executed. We willrefer to the states produced by the execution of these re-quests as checkpoints and we will say that a checkpointwith a proof is a stable checkpoint.A replicamaintains several logical copies of the service

state: the last stable checkpoint, zero ormore checkpointsthat are not stable, and a current state. Copy-on-writetechniques can be used to reduce the space overheadto store the extra copies of the state, as discussed inSection 6.3.The proof of correctness for a checkpoint is generated

as follows. When a replica produces a checkpoint,it multicasts a message CHECKPOINT to theother replicas, where is the sequence number of thelast request whose execution is reflected in the stateand is the digest of the state. Each replica collectscheckpoint messages in its log until it has 2 1 ofthem for sequence number with the same digestsigned by different replicas (including possibly its ownsuch message). These 2 1 messages are the proof ofcorrectness for the checkpoint.A checkpoint with a proof becomes stable and the

replica discards all pre-prepare, prepare, and commitmessages with sequence number less than or equal tofrom its log; it also discards all earlier checkpoints and

checkpoint messages.Computing the proofs is efficient because the digest

can be computed using incremental cryptography [1] asdiscussed in Section 6.3, and proofs are generated rarely.The checkpoint protocol is used to advance the low

and high water marks (which limit what messages willbe accepted). The low-water mark is equal to thesequence number of the last stable checkpoint. The highwater mark , where is big enough so thatreplicas do not stall waiting for a checkpoint to becomestable. For example, if checkpoints are taken every 100requests, might be 200.

4.4 View ChangesThe view-change protocol provides liveness by allowingthe system tomake progress when the primary fails. Viewchanges are triggered by timeouts that prevent backupsfrom waiting indefinitely for requests to execute. Abackup iswaiting for a request if it received a valid request

5

WhyNoDoublePrepares?

prepared(req,v,n,i)→notprepared(req’,v,n,j)forhonestri andrjHonestintersectionofmaximallydisjoint2f+1setsisnon-empty

2f+1

2f+1

Phase2

• Problem:Justbecausesomeotherreq'won'texecuteat(v,n)doesn'tmeanreq will

Problem:Prepared!=Committed

• S3prepared,butcouldn’tgetPREPAREout• S2becomesprimaryinnewview• Can’tfindPRE-PREPARE+2fPREPAREsinanylog

• S1:{S1,S2},S2:{S1,S2},S4:{}• Newprimarymustfill‘hole’sologcanmoveforward

C

S1

S2

S3

S4

Pre-prepare

Prepares

ViewChange NewView

Phase2

• Makesureopdoesn'texecuteuntilprepared(req,v,n,i)isTRUEforf+1non-faultyreplicas• Wesaycommitted(req,v,n)isTRUEwhenthispropertyholds• Howdoesreplicaknowcommitted(req,v,n)holds?• Addonemoremessage:ri ->R{COMMIT,view,seqno,h(req),replicaId}• Once2f+1COMMITsatanode,thenapplyopandrespondtoclient

ViewChanges

• Allowsprogressifprimaryfails(orisslow)• Ifoperationonbackuppendingforlongtime{VIEW-CHANGE,view+1,seqn,ChkPointMgs,P,i}si• NewprimaryissuesNEW-VIEWonce2fVCmsgs

• IncludessignedVIEW-CHANGEsasproofitcanchangeview• Q:Whatgoeswrongwithoutthis?

• Then,foreachseqno sinceloweststablecheckpoint• UsePfromabove:setofsetsofPRE-PREPARE+2fPREPARES• Forseqno withvalidPRE-PREPARE+2fPREPARE,reissuePRE-PREPAREinv+1

• Forseqno notinP,{PRE-PREPARE,v+1,seqno,null}

• Oncecommittedatleastf+1non-faultyreplicashaveagreedontheoperationanditsplacementinthetotalorderofoperations• Evenacrossviewchanges

Checkpoints/GC

• NeedtooccasionallysnapshotSMandtruncatelog• Problem:howcanonereplicatrustthecheckpointofanother?• Idea:at(seqn mod100)broadcast{CHECKPOINT,seqn,h(state),i}si• Once2f+1CHECKPOINTshavebeencollectedthencantrustCHECKPOINTatseqn withcorrectdigest(atleastf+1non-faultyservershaveacorrectcheckpointatseqn)

Liveness– ViewChanges

• Interestingissue:can’tletasinglenodestartaviewchange!• Why?Couldlivelock thesystembyspammingviewchanges.• Resolution:waitforf+1serverstotimeoutandindependentlysendVIEW-CHANGErequests.• Interactswithanoptimization:totrytoensurethatviewchangessucceedifanynodethatgetsmorethanf+1VIEW-CHANGErequestsissuesoneaswell.• ThispreventscaseswheretheytimeoutslowlyandthentheoldestVIEW-CHANGEissuerrollsovertoVIEW-CHANGEv+2.

• Havetobecarefulstill:needtowaitonthisoptimizationuntilf+1VIEW-CHANGESawayfromv.

• Why?Otherwisemightbedoingthebiddingofamaliciousnode.

Discussion

• Whatproblemdoesthissolve?• Wouldyourbossbeokwith4designs/implementations?• Howcansystemtoleratemorethanf(non-simultaneous)failuresoveritslifetime?• Periodicallyrecovereachserver?Couldhelpsome…• Whatifprivatekeycompromised?

• Importantpoint:itispossibletooperateinthefaceofByzantinefaults• Maybeevenefficiently

PerformanceTricks

• Don’thavereplicasrespondwithoperationresults,justdigests• Onlyprimaryhastogiveresult

• Delays:clienttoprimary,pre-prepare,prepare,commit,reply• Idea:commitpreparedoperationstentatively.• Ifwrong,rollback.• Operationsunlikelytofailtocommitiftheypreparesuccessfully.

• Tentativelyexecutereadsagainsttentativeoperations,butwithholdreplyuntilalloperationsreadfromhavecommitted.

Crypto

• Can’tafforddigitalsignaturesonallmessagestoauthenticate• Insteadallpairsofhostsshareasecretkey• SendMACofeachmessage(h(m+secretkey))toverifyintegrity,authenticity.• Problem:whataboutmessageswithmultiplerecipients?

• e.g.clientoperationrequestmessage?• Can’tletfaultynodesspoofoperations.• PutavectorofMACsinforthemessage,oneforeverynodeinthesystem.

• Probably4or7hosts.Constanttimetoverify,lineartogenerate.• 37replicas,MACvectorsstill100xfastertogeneratethan1024bitRSAsig.

• Outputisalsosmallerthana1024bitsig.

WhyPre-prepare,Prepare,Commit?• Pre-prepare

• Broadcastviewno,seqn,andmessagedigest.• Backupaccepts

• Ifdigestisokforthemessage• Backupisinsameview• Hasn’tacceptedapre-prepareforseqno inviewno withadifferentdigest.

• Ifitacceptsitbroadcastsprepare• Prepare• Commit

• Similartoourdecided;informseveryoneofthechosenvalue• Difference:can’ttakesender’swordforit,needproofthattheclusteragrees.

Phase2

• Justbecausesomeotherreq'won'texecuteat(v,n)doesn'tmeanreq will• Supposeri iscompromisedrightafterprepared(req,v,n,i)• Supposenootherreplicareceivedri's PREPARE• SupposefreplicasareslowandneverevenreceivedthePRE-PREPARE• Nootherhonestreplicawillknowtherequestprepared!• Particularlyifpfails,requestmightnotgetexecuted!