Fault Tolerance - Computer...

65
1/65 Fault Tolerance

Transcript of Fault Tolerance - Computer...

Page 1: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

1/65

Fault Tolerance

Page 2: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

2/65

Fault Tolerance

I Fault tolerance is the ability of a distributed system to provideits services even in the presence of faults.

I A distributed system should be able to recover automaticallyfrom partial failures without seriously affecting availability andperformance.

Page 3: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

3/65

Fault Tolerance: Basic Concepts (1)

I Being fault tolerant is strongly related to what are calleddependable systems. Dependability implies the following:

I Availability: The percentage of the time the system isoperational.

I Reliability: The time interval for which the system isoperational.

I SafetyI Maintainability

I In-class Exercise: How is is availability different fromreliability? How about a system that goes down for 1millisecond every hour? How about a system that never goesdown but has to be shut down two weeks every year?

Page 4: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

4/65

Fault Tolerance: Basic Concepts (2)

I A system is said to fail when it cannot meet its promises. Forexample, if it cannot provide a service (or only provide itpartially).

I An error is a part of the system’s state that may lead tofailure. The cause of an error is called a fault.

I Types of faults:I Transient: Occurs once and then disappears. If a operation is

repeated, the fault goes away.I Intermittent: Occurs, then vanishes of its own accord, then

reappears and so on.I Permanent: Continues to exist until the faulty component is

replaced.

I In-class Exercise: Give examples of each type of fault foryour project!

Page 5: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

5/65

Failure Models (1)

Page 6: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

6/65

Failure Models (2)

I Fail-stop: failures that can be reliably detectedI Fail-noisy: failures that are eventually detectedI Fail-silent: we cannot detect between a crash failure or

omission failureI Fail-safe: failures that do no harmI Fail-arbitrary or Byzantine Failures:

I Arbitrary failures where a server is producing output that itshould never have produced, but which cannot be detected asbeing incorrect.

I A faulty server may even be working with other servers toproduce intentionally wrong answers!

Page 7: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

7/65

Failure Masking by Redundancy

I Information Redundancy. For example, adding extra bits (likein Hamming Codes, see the book Coding and InformationTheory) to allow recovery from garbled bits.

I Time Redundancy. Repeat actions if need be.I Physical Redundancy. Extra equipment or processes are added

to make the system tolerate loss of some components.

Page 8: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

8/65

Process Resilience

Achieved by replicating processes into groups.

I How to design fault-tolerant groups?I How to reach an agreement within a group when some

members cannot be trusted to give correct answers?

Page 9: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

9/65

Flat Groups Versus Hierarchical Groups

Page 10: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

10/65

Failure Masking Via Replication

I Primary-backup protocol. A primary coordinates all writeoperations. If it fails, then the others hold an election toreplace the primary.

I Replicated-write protocols. Active replication as well asquorum based protocols. Corresponds to a flat group.

I A system is said to be k fault tolerant if it can survive faults ink components and still meet its specifications.

I For fail-silent components, k+1 are enough to be k faulttolerant.

I For Byzantine failures, at least 2k+1 extra components areneeded to achieve k fault tolerance.

I Requires atomic multicasting: all requests arrive at all serversin same order. This can be relaxed to just be for writeoperations.

Page 11: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

11/65

Consensus in Faulty Systems

The model: a very large collection of clients send commands to agroup of processes that jointly behave as a single, highly robustprocess. To make this work, we need to make an importantassumption:In a fault-tolerant process group, each nonfaulty process executesthe same commands, in the same order, as every other nonfaultyprocess.

I Flooding Consensus (for fail-stop failures)I Paxos Consensus (for fail-noisy failures)I Byzantine Agreement (for arbitrary failures)

Page 12: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

12/65

Flooding Consensus

I A client contacts a group member (P1, . . . ,Pn) requesting it toexecute a command. Every group member maintains a list ofcommands: some from clients and some from other group members.

I In each round, a process Pi sends its list of proposed commands it hasseen to every other process. At the end of a round, each processmerges all received proposed commands into a new list, from which itdeterministically selects a new command to execute (if possible)

I A process will move on to the next round without having made adecision, only when it detects that another process has failed.

P4

P3

P2

P1

decide

decide

decide

Page 13: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

13/65

Consensus in Systems with Arbitrary Faults

I Two Army ProblemI Non-faulty generals with unreliable communication.

Page 14: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

14/65

Byzantine Agreement Problem (1)

Problem:I Red army in the valley, n blue generals each with their own

army surrounding the red army.I Communication is pairwise, instantaneous and perfect.I However m of the blue generals are traitors (faulty processes)

and are actively trying to prevent the loyal generals fromreaching agreement. The generals know the value m.

Goal: The generals need to exchange their troop strengths. At theend of the algorithm, each general has a vector of length n. If ithgeneral is loyal, then the ith element has their correct troopstrength otherwise it is undefined.Conditions for a solution:

I All loyal generals decide upon the same plan of actionI A small number of traitors cannot cause the loyal generals to

adopt a bad plan

Page 15: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

15/65

Byzantine Agreement Problem (2)

Setup:

I Assume that we have n processes, where each process i will provide avalue vi to other processes. The goal is to let each process construct avector of length n such that if process i is non-faulty, V [i ] = vi .Otherwise, V [i ] is undefined.

I We assume that there are at most m faulty processes.

Algorithm:

Step 1 Every non-faulty process i sends vi to every other process usingreliable unicasting. Faulty processes may send anything.

I Moreover, since we are using unicasting, the faulty processes may senddifferent values to different processes.

Step 2 The results of the announcements of Step 1 are collected together inthe form of a vector of length n.

Step 3 Every process passes its vector from Step 2 to every other process.Step 4 Each process examines the ith element of each of the newly received

vectors. If any value has a majority, that value is put into the resultvector. If no value has a majority, the corresponding element of theresult vector is set to UNKNOWN.

Page 16: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

16/65

Byzantine Agreement Example (1)

I We have 2 loyal generals and one traitor.I Note that process 1 sees a value of 2 from process 2 (process 2

vector) but sees a value of b for process 2 (process 3 vector). Butprocess 1 has no way of knowing whether process 2 or process 3 is atraitor. So it cannot decide the right value for V [2].

I Similarly, process 2 cannot ascertain the right value for V [1]. Hence,they cannot reach agreement.

I For m faulty processes, we need 2m+1 non-faulty processes toovercome the traitors. In total, we need a total of 3m+1 processes toreach agreement.

Page 17: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

17/65

Byzantine Agreement Example (2)

The Byzantine generals problem for 3 loyal generals and 1 traitor:(a) The generals announce their troop strengths (let’s say, in units

of 1 kilo soldiers).(b) The vectors that each general assembles based on previous

step.(c) The vectors that each general receives.(d) If a value has a majority, then we know it correctly, else it is

unknown. In this case, we can reach agreement amongst thenon-faulty processes.

Page 18: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

18/65

Limitations on Reaching Consensus (1)

The general goal of distributed consensus algorithms is to have allthe non-faulty processes reach consensus on some issue, and toestablish that consensus within a finite number of steps. Possiblecases:

I Synchronous versus asynchronous systemsI Communication delay is bounded or notI Message delivery from the same sender is ordered or notI Message transmission is done through unicasting or

multicasting

Page 19: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

19/65

Limitations on Reaching Consensus (2)

Synchronous

Asynchronous

OrderedUnordered

Bounded

Bounded

Unbounded

Unbounded

Unicast UnicastMulticast Multicast

X X

X

X

X

X

X

X

Co

mm

un

ica

tion

de

layP

roc

es

s b

eh

av

ior

Message ordering

Message transmission

I Most distributed systems in practice assume that processesbehave asynchronously, message transmission is unicast, andcommunication delays are unbounded.

Page 20: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

20/65

The CAP Theorem

I CAP Theorem (aka Brewer’s Theorem): states that it isimpossible for a distributed computer system to simultaneouslyprovide all three of the following guarantees:

I Consistency (all nodes see the same data at the same time)I Availability (a guarantee that every request receives a response

about whether it was successful or failed)I Partition tolerance (the system continues to operate despite

arbitrary message loss or failure of part of the system)

This is a chord, this is another, this is a third. Now form a band!We know three chords but you can only pick two

Brewer’s CAP Theorem (julianbrowne.com)

Page 21: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

21/65

Dealing with CAP

I Drop Partition ToleranceI Drop AvailabilityI Drop ConsistencyI The BASE Jump: The notion of accepting eventual

consistency is supported via an architectural approach knownas BASE (Basically Available, Soft-state, Eventuallyconsistent). BASE, as its name indicates, is the logicalopposite of ACID.

I Design around it!

Page 22: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

22/65

Reliable Client-Server Communication

RMI semantics in the presence of failures.

I The client is unable to locate the server.I The request message from the client to the server is lost.I The server crashes after receiving a request.I The reply message from the server to the client is lost.I The client crashes after sending a request.

Page 23: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

23/65

RPC Semantics in the Presence of Failures

I RPC failuresI Client cannot locate server: Raise an exception or send a signal

to client leading to loss in transparencyI Lost request messages: Start a timer when sending a request.

If timer expires before a reply is received, send the requestagain. Server would need to detect duplicate requests

I Server crashes: Server crashes before or after executing therequest is indistinguishable from the client side...

I Possible RPC semanticsI Exactly once semanticsI At least once semanticsI At most once semanticsI Guarantee nothing semantics

Page 24: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

24/65

Server Crash (1)

I A server in client-server communication(a) Normal case(b) Crash after execution(c) Crash before execution

Page 25: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

25/65

Server Crash (2)

I Server: Prints text on receiving request from client and sendsmessage to client after text is printed.

I Send a completion message just before it actually tells theprinter to do its work

I Or after the text has been printed

I Client: After a server crashes and recovers.I Never to reissue a request.I Always reissue a request.I Reissue a request only if it did not receive an acknowledgment

of its request being delivered to the server.I Reissue a request only if it has not received an

acknowledgment of its print request.

Page 26: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

26/65

Server Crash (3)

I Three events that can happen at the server:I Send the completion message (M)I Print the text (P)I Crash (C)

I These events can occur in six different orderings:

Page 27: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

27/65

Server Crash (4)

1. M → P → C : A crash occurs after sending the completionmessage and printing the text.

2. M → C (→ P): A crash happens after sending the completionmessage, but before the text could be printed.

3. P →M → C : A crash occurs after sending the completionmessage and printing the text.

4. P → C (→M): The text printed, after which a crash occursbefore the completion message could be sent.

5. C (→ P →M): A crash happens before the server could doanything.

6. C (→M → P): A crash happens before the server could doanything.

Page 28: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

28/65

Server Crash (5)

I M: send the completion message, P: print the text, C : server crash

I OK (text is printed once), DUP (text is printed twice), ZERO (textis not printed at all)

Page 29: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

29/65

RPC Semantics in the Presence of Failures (2)

I Lost Reply Messages. Set a timer on client. If it expireswithout a reply, then send the request again. If requests areidempotent, then they can be repeated again without ill-effects

I Client Crashes. Creates orphans. An orphan is an activecomputation on the server for which there is no client waiting.How to deal with orphans:

I Extermination. Client logs each request in a file before sendingit. After a reboot the file is checked and the orphan is explicitlykilled off. Expensive, cannot locate grand-orphans etc.

I Reincarnation. Divide time into sequentially numbered epochs.When a client reboots, it broadcasts a message declaring a newepoch. This allows servers to terminate orphan computations.

I Gentle Reincarnation. A server tries to locate the owner oforphans before killing the computation.

I Expiration. Each RPC is given a quantum of time to finish itsjob. If it cannot finish, then it asks for another quantum. Aftera crash, a client need only wait for a quantum to make sure allorphans are gone.

Page 30: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

30/65

Idempotent Operations

I An idempotent operation is one that can be repeated as oftenas necessary without any harm being done. E.g. reading ablock from a file.

I In general, try to make RPC/RMI methods be idempotent ifpossible. If not, it can be dealt with in a couple of ways.

I Use a sequence number with each request so server can detectduplicates. But now the server needs to keep state for eachclient....

I Have a bit in the message to distinguish between original andduplicate transmission

Page 31: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

31/65

Reliable Group Communication

I Reliable multicasting guarantees that messages are delivered toall members in a process group. Reliable multicasting turnsout to be tricky.

I Basic reliable multicasting schemesI Scalable reliable multicasting

I Non-hierarchical feedback controlI Hierarchical feedback control

I Atomic multicasting using Virtual Synchrony

Page 32: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

32/65

Basic Reliable Multicasting Schemes (1)

I Reliable point-to-point channels are available but reliablecommunication to a group of processes is rarely built-in to thetransport layer. For example, multicasting uses datagrams,which are not reliable.

I Few processes: Set up reliable point-to-point channels.Inefficient for more processes.

I Issues in reliable multicasting:I What does reliable multicasting mean?I What if a process joins during the communication?I What happens if a sending process crashes during

communication?I How to reach agreement on what does the group look like?

Page 33: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

33/65

Basic Reliable Multicasting Schemes (2)

Assume that processes do not fail and join or leave the groupduring communication so group membership is known.

I Sending process assigns a sequence number to each message itmulticasts. Messages are received in the order which they weresent.

I Each multicast message is kept in a history buffer at thesender. Assuming that the sender knows the receivers, thesender simply keeps the message until all receivers havereturned the acknowledgment (Ack).

I Sender retransmits on a negative Ack or on timeout before allAcks were received. Acks can be piggy-backed.Retransmissions can be done with point-to-pointcommunication.

Page 34: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

34/65

Basic Reliable Multicasting Schemes (3)

I A simple solution to reliable multicasting when all receivers areknown and are assumed not to fail.

I (a) Message transmission. (b) Reporting feedback.

Page 35: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

35/65

Scalability in Reliable Multicasting

I Negative acknowledgments: A receiver returns feedback only ifit is missing a message. This improves scalability by cuttingdown on the number of messages. However this forces thesender to keep a message in its buffer forever (so we need touse timeouts for the buffer)

I Nonhierarchical feedback control: Feedback suppression viamulticasting of negative feedback.

I Hierarchical feedback control: Use subgroups and coordinatorsin each subgroup.

Page 36: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

36/65

Nonhierarchical Feedback Control

I Several receivers have scheduled a request for retransmissionwith random delays, but the first retransmission request leadsto the suppression of others.

Page 37: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

37/65

Hierarchical Feedback Control

I The essence of hierarchical reliable multicasting. Each localcoordinator forwards the message to its children and laterhandles retransmission requests.

Page 38: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

38/65

Reliable Atomic Multicast

I The atomic multicast setup to achieve reliable multicasting inthe presence of process failures requires the followingconditions:

I A message is delivered to all processes or to none at all.I All messages are delivered in the same order to all processes.

I For example, this solves the problem of a replicated databaseon top of a distributed system.

I Atomic multicasting ensures that non-faulty processesmaintain a consistent view of the database, and forcesreconciliation when a replica recovers and rejoins the group.

Page 39: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

39/65

Virtual Synchrony (1)

I The logical organization of a distributed system to distinguishbetween message receipt and message delivery.

Page 40: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

40/65

Virtual Synchrony (2)

I A multicast message m is uniquely associated with a list ofprocesses to which it should be delivered. This delivery listcorresponds to a group view. Each process on the list has thesame view.

I A view change takes place by multicasting a message vcannouncing the joining or leaving of a process.

I Suppose m and vc are simultaneously in transit. We need toguarantee that m is either delivered to all processes in thegroup view G before each of them is delivered message vc, orm is not delivered at all.

I Note that m being not delivered is because the sender of mcrashed.

Page 41: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

41/65

Virtual Synchrony (3)

I A reliable multicast is said to be virtually synchronous if it hasthe following properties:

I A message multicast to group view G is delivered to eachnon-faulty process in G.

I If a sender of the message crashes during the multicast, themessage may either be delivered to all processes, or ignored byeach of them.

I The principle is that all multicasts take place between viewchanges.

I All multicasts that are in transit when a view change takesplace are completed before the view change comes into effect.

Page 42: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

42/65

Virtual Synchrony (4)

I The principle of virtual synchronous multicast.

Page 43: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

43/65

Message Ordering in Multicasting (1)

Virtual synchrony allows us to think about multicasts as takingplace in epochs. But we can have several possible orderings of themulticasts:

I Unordered multicastsI FIFO-ordered multicastsI Causally-ordered multicasts (requires vector timestamps)I Totally-ordered multicasts

Page 44: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

44/65

Message Ordering in Multicasting (2)

I Three communicating processes in the same group. Theordering of events per process is shown along the vertical axis.This shows unordered multicasts.

Page 45: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

45/65

Message Ordering in Multicasting (3)

I Four processes in the same group with two different senders,and a possible delivery order of messages under FIFO-orderedmulticasting.

Page 46: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

46/65

Message Ordering in Multicasting (4)

I Six different versions of virtually synchronous reliablemulticasting.

Page 47: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

47/65

Implementing Virtual Synchrony (1)

I Assume that we have reliable point to point communication(e.g. TCP) and that messages from the same source arereceived in the same order as sent.

I Every process in G keeps message m until it knows for surethat all member in G have received it. If m has been receivedby all members in G, then m is said to be stable. Only stablemessages are allowed to be delivered.

I To ensure stability, it is sufficient to pick an arbitrary processin G and request it to send m to all other processes. Thatarbitrary process can be the coordinator.

I Assumes that no process crashes during a view change(although it can be generalized to handle that as well).

Page 48: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

48/65

Implementing Virtual Synchrony (2)

(a) Process 4 notices that process 7 has crashed, sends a viewchange.

(b) Process 6 sends out all its unstable messages, followed by aflush message.

(c) Process 6 installs the new view when it has received a flushmessage from everyone else.

Page 49: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

49/65

Distributed Commit

I The distributed commit problem involves having an operationbeing performed by each member of a group or none at all.

I Examples:I Reliable multicasting is a specific example with the operation

being the delivery of a message.I A distributed transaction includes one or more statements

that, individually or as a group, update data on two or moredistinct nodes of a distributed database.

I Solutions (these all use a coordinator):I One-phase commit: Coordinator tells all processes to commit

or not. No way for coordinator to know if commit cannot bedone locally.

I Two-phase commit: Coordinator orchestrates the commit usingtwo phases. Does not work if coordinator crashes during thecommit process.

I Three-phase commit: Can work even if the coordinatorcrashes, but rarely used in practice.

Page 50: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

50/65

Two Phase Commit (1)

(a) The finite state machine for the coordinator in Two PhaseCommit (2PC).

(b) The finite state machine for a participant.

Page 51: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

51/65

Two Phase Commit (2)

I The protocol can fail if a process crashes for other processesmay be waiting indefinitely for a message from the crashedprocess. We deal with this using timeouts.

I Participant blocked in INIT: A participant is waiting for aVOTE_REQUEST message from the coordinator. On a timeout,it can locally abort the transaction and thus send aVOTE_ABORT message to the coordinator.

I Coordinator blocked in WAIT: If it doesn’t get all the votes, itvotes for an abort and sends a GLOBAL_ABORT to allparticipants.

I Participant blocked in READY: Participant cannot simply decideto abort. It needs to know what message was sent by thecoordinator.

I We can simply block until the coordinator recovers.I Or contact another participant Q to see if it can decide from

Q’s state what to do. Four cases to deal with here that aresummarized on next slide.

Page 52: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

52/65

Two Phase Commit (3)

I Actions taken by a participant P when residing in stateREADY and having contacted another participant Q.

Page 53: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

53/65

Two Phase Commit (4)

I actions by coordinator

write START_2PC local log;multicast VOTE_REQUEST to all participants;while not all votes have been collected {

wait for any incoming vote;if timeout {

write GLOBAL_ABORT to local log;multicast GLOBAL_ABORT to all participants;exit;

}record vote;

}if all participants sent VOTE_COMMIT and coordinator votes COMMIT{

write GLOBAL_COMMIT to local log;multicast GLOBAL_COMMIT to all participants;

} else {write GLOBAL_ABORT to local log;multicast GLOBAL_ABORT to all participants;

}

Page 54: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

54/65

Two Phase Commit (5)

I actions by participant

write INIT to local log;wait for VOTE_REQUEST from coordinator;if timeout {

write VOTE_ABORT to local log;exit;

}if participant votes COMMIT {

write VOTE_COMMIT to local log;send VOTE_COMMIT to coordinator;wait for DECISION from coordinator;if timeout {

multicast DECISION_REQUEST to other participants;wait until DECISION is received; /* remain blocked */write DECISION to local log;

}if DECISION == GLOBAL_COMMIT

write GLOBAL_COMMIT to local log;else if DECISION == GLOBAL_ABORT

write GLOBAL_ABORT to local log;} else {

write VOTE_ABORT to local log;send VOTE ABORT to coordinator;

}

Page 55: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

55/65

Two Phase Commit (6)

I actions for handling decision requests

/* executed by separate thread */while true {

wait until any incoming DECISION_REQUEST is received; /* remain blocked */read most recently recorded STATE from the local log;if STATE == GLOBAL_COMMIT

send GLOBAL_COMMIT to requesting participant;else if STATE == INIT or STATE == GLOBAL_ABORT

send GLOBAL_ABORT to requesting participant;else

skip; /* participant remains blocked */}

Page 56: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

56/65

Recovery

I Backward recovery. Roll back the system from erroneous stateto a previously correct state. This requires system to becheckpointing, which has the following issues:

I Relatively costly to checkpoint. Often combined with messagelogging for better performance. Messages are logged beforesending or before receiving. Combined with checkpoints tomakes recovery possible. Checkpoints alone cannot solve theissue of replaying all messages in the right order.

I Backward recovery requires a loop of recovery so failuretransparency cannot be guaranteed. Some states can never berolled back to...

I Forward recovery. Bring the system to a correct new statefrom which it can continue execution. E.g. In an (n,k) blockerasure code, a set of k source packets is encoded into a set ofn encoded packets, such that any set of k encoded packets isenough to reconstruct the original k source packets.

Page 57: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

57/65

Stable Storage

I We need fault-tolerant disk storage for the checkpoints andmessage logs. Examples are various RAID (Redundant Arrayof Independent Disks) schemes (although they are used forboth improved fault tolerance as well as improvedperformance). Some common schemes:

I RAID-0 block-level stripingI RAID-1 mirroringI RAID-5 block-level striping with distributed parityI RAID-6 block-level striping with double distributed parityI RAID-10 stripes data across mirrored pairs

Page 58: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

58/65

Checkpointing

I Backward error recovery schemes require that a distributedsystem regularly checkpoints a consistent global state to stablestorage. This is known as a distributed snapshot.

I In a distributed snapshot, if a process P has recorded thereceipt of a message, then there is also a process Q that hasrecorded the sending of that message.

I To recover after a process or system failure, it is best torecover to the most recent distributed snapshot, also known asthe recovery line.

I We will examine the following checkpointing and loggingtechniques:

I Independent checkpointing.I Coordinated checkpointing.I Message logging:

I Optimistic message loggingI Pessimistic message logging

Page 59: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

59/65

Checkpointing and Recovery Line

Page 60: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

60/65

Independent Checkpointing

I Each process saves its state from time to time in anuncoordinated fashion.

I To discover a recovery line requires that each process be rolledback to its most recently saved state. If these local statesjointly do not form a distributed snapshot, further rolling isnecessary. This can lead to a domino effect.

I Find the recovery line in the following timeline:

Page 61: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

61/65

Coordinated Checkpointing (1)

I All processes synchronize to jointly write their state to localstable storage, which implies that the saved state isautomatically consistent.

I Simple Coordinated Checkpointing:I Coordinator multicasts a CHECKPOINT_REQUEST to all

processes.I When a process receives the request, it takes a local

checkpoint, queues any subsequent messages handed to it bythe application it is executing, and acknowledges to thecoordinator.

I When the coordinator has received an acknowledgment fromall processes, it multicasts a CHECKPOINT_DONE message toallow the blocked processes to continue.

Page 62: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

62/65

Coordinated Checkpointing (2)

I Incremental Snapshot:I The coordinator multicasts a checkpoint request only to those

processes it had sent a message to since it last took acheckpoint. The other processes are ignored.

I When a process P receives such a request, it forwards it to allthose processes to which P itself had sent a message since thelast checkpoint and so on.

I A process forwards the request only once.I When all processes have been identified, then a second

message is multicast to trigger checkpointing and to allow theprocesses to continue

Page 63: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

63/65

Message Logging

I Message logging enables us to reduce the number ofcheckpoints, but still enable recovery.

I If the transmission of messages can be replayed, we can stillreach a globally consistent state by starting from acheckpointed state and retransmitting all messages sent since.

I Assumes a piece wise deterministic model, where deterministicintervals occur between sending/receiving messages. Thedeterministic interval starts with a non-deterministic event suchas sending or receiving a message. Then the execution of theprocess is deterministic until the next non-deterministic event.

I An deterministic interval can be replayed with a known result,provided it is replayed starting with the same non-deterministicevent as before. Hence if we record all non-deterministicevents, it becomes possible to completely replay the entireexecution of a process in a deterministic way.

Page 64: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

64/65

Orphan Process

I An orphan process is a process that has survived the crash of anotherprocess, but whose state is inconsistent with the crashed process afterits recovery.

I The figure below shows an incorrect replay of messages after recovery,leading to an orphan process. Note that we assume that m2 was neverlogged so it isn’t replayed. Which one is the orphan process?

I How to deal with orphans?

Page 65: Fault Tolerance - Computer Sciencecs.boisestate.edu/~amit/teaching/555/handouts/fault-tolerance-handout.pdf4/65 Fault Tolerance: Basic Concepts (2) I Asystemissaidtofailwhenitcannotmeetitspromises.

65/65

References

I "Brewer’s CAP Theorem", julianbrowne.com, 02-Mar-2010.I “Brewer’s conjecture and the feasibility of consistent, available,

partition-tolerant web services”, Nancy Lynch and Seth Gilbert.ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.

I “CAP twelve years later: How the "rules" have changed”, EricBrewer, IEEE Explore, Volume 45, Issue 2 (2012), pg. 23-29.

I CAP Theorem Revisited, Robert Greiner, 2014.I A CAP Solution (proving Brewer Wrong), Guy Pardon’s Blog,

2008.I Your Coffee Shop Doesn’t Use Two-Phase Commit, Gregor

Hohpe. IEEE Software, March/April 2005, pp. 64-66.