Download - 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

7. FAULT TOLERANCE

A single machine system either works or not, totally. A distributed

system can fail partially. It can even recover automatically from

failures.

The term dependable is used about systems that have the following

properties:

• Availability. System is ready to be used.

• Reliability. System can run continuously without failure.

• Safety. System may fail, but nothing catastrophic happens.

Martti Penttonen: Distributed Systems 2002 248

• Maintainability. System may fail, but it is easy to repair.


Fault tolerance terminology

• Failure. Component does not fulfil the specification.

• Error. The part in the state of the system that can lead to a

failure.

• Fault. The cause of an error.

• Fault prevention. Prevent the occurrence of the fault.

• Fault tolerance. Mask the faults so that even in case of fault

the system fulfils the specification.


• Fault removal. Reduce the presence, number and seriousness of

faults.

• Fault forecasting. Estimate the number, time and consequences

of faults.


Failure types

• Crash failures. System component behaves correctly until it

totally halts.

• Omission failures. System fails to respond to requests.

• Timing failures. System responds correctly but outside specified

time. (Usually too late.)

• Response failure. The response of the system is incorrect. Either

the value is wrong (value failure), or the system deviates from

the correct flow of control (state transition failure).


• Arbitrary failure. The system may produce arbitrary responses

at arbitrary times.

Remark. Crash failures are least severe, arbitrary failures are most

dangerous.


Failure masking by redundancy

The best way to protect oneself agains failures is to mask them

out. An important technique of masking is redundancy, use

reserve parts. The followin kinds of redunancy can be used:

• Information redundancy. Extra information is used to recover

from the lost or distorted information. A good example is the

use of error-correcting codes for data.

• Timing redundancy. If a transaction fails, it is tried again. It

applies against intermittent failures, like power failure.

• Physical redundancy. Duplicate hardware. See Figure.


PROCESS RESILIENCE

Protect yourself against faulty processes by replicating and

distributing them in a group.

• Flat group. Processes form a symmetric group. If one of the

processes fails, there is just one process less. Very fault tolerant

but control may be difficult.

• Hierarchical group. One of the processes is coordinator. Easier

to implement but not very fault tolerant. A new coordinator

must be selected.


Agreement if faulty system

We’ll approach to the problem of faulty communication and faulty

processes by two problems

• Faulty communication: two-army problem

• Faulty processes: Byzantine generals problem


Two-army problem

The following problem illustrates the difficulty of agreement, when

communication is not reliable.

• There are two armies, red army of 5k troops encamped in a

valley, and two blue armies of 3k troops each, on surrounding

hills.

• The red army wins blue armies if they do not cooperate but

loses to them if they cooperate.

• General Alexander, commander of blue army 1, sends General


Bonaparte, commander of blue army 2 message: “Hi Bona, let’s

attack at dawn tomorrow”

• General Bonaparte answers: “Great idea Al, see you at dawn

tomorrow”

• General Alexander receives the message, but suddenly he

realizes: Bonaparte does not know whether I got his

acknowledgement and may not dare to attack. Therefore

he sends a messenger to tell Bonaparte that he received the

acknowledgement.

• Bonaparte receives the new message and thinks. Alexander

does not know that I received the message and may not dare to

attack. Therefore he sends a message ...


Even if the message goes through every time, generals will never

agree.


Byzantine generals

Now we assume that communication is reliable, but processes

aren’t. The problem of faulty processes is illustrated by Byzantine

generals.

n generals are planning an attack and therefore wantto exchange information about their troop strengths, inkilosoldiers. Among n generals there are m traitors thatfeed false information. Without knowing each others, canloyal generals get reliable information, and under whatcondition?


For example, consider the case of four generals (n = 4) of the

following Figure, one of whom is traitor (m = 1). General 1

has 1k troops, general 2 has 2k troops, general 3 always lies,

and general 4 has 4k troops. For exchanging the information the

generals proceed as follows

1. Generals send each others a message about their strength.

Figure (a).

2. Each general forms a vector of the received information, where

the i’th item is the strength of general i, see Figure (b).

3. Each general now sends his vector to all others. Now every

general knows everybody else’s opinions about the strengths,

see Figure (c).


4. Of the three opinions, the majority is the answer. General

1, for example, calculates maj(1, a, 1) = 1, maj(2, b, 2) = 2,

maj(y, c, z) =?, maj(4, d, 4) = 4.

In the example, loyal generals got reliable information even if the

traitor gave false information.

Lamport et al. proved that in a system with m faulty processes,

agreement can be achieved if there are at least 2m + 1 correctly

functioning processes.


RELIABLE CLIENT-SERVERCOMMUNICATION: RPC

We shall now take a look in client-server communication in case

of RPC. Possible failures are:

1. Client cannot locate server.

2. Client request message is lost.

3. Server crashes.

4. Server response is lost.


5. Client crashes.


Solutions — kind of

• Report back to client.

• Resend message.

• Things should happen as in Figure (a) but also (b) or (c) may

happen. A problem is that we should react differently with (b)

and (c). In (b) one should probably report to the client and

in (b) one should retransmit the request. As no reply arrives,

client cannot know what is the case.


There are three schools what to do:

– At-least-once semantics. Keep trying, try at least once.

– At-most-once semantics. Give up immediately and report

failure.

– Guarantee nothing. When server crashes, client gets no help.

• The problem of lost reply is that it looks the same as the

crashed server. If the request is idempotent in the sense that


it does not change the state of the server (bank transfer is not

idempotent!), request can be repeated. If the request is not

idempotent, one can perhaps mark the retransmissions so that

they do no cause a new transaction.

• If client crashes before getting the reply, computation becomes

orphan. Orphans can be harmful, not only because of the

wasted resource. When the client reboots, does a new RPC and

before the reply gets an orphan, a confusion may result. One

solution is extermination of orphans at reboot. Another solution

is expiration — if a RPC is not finished within a time bound, it

expires.


RELIABLE MULTICASTING

Transport layer offers reliable point-to-point channels.

Multicasting is more difficult, in particular if senders and receivers

can fail. The following Figure describes a simple solution, when

all receivers are reliable. The sending process assigns a sequence

number to each message and keeps history buffer of messages

until an acknowledgement has arrived. Receivers acknowledge

the arrived messages. By sequence number a receiver notices if

a message is lost or delayed. In that case it sends a negative

acknowledgement with the number of the missing message.


Scalability

The proposed simple multicast solution is not very scalable.

If a message is multicast to N receivers, the sender gets N

acknowledgements.

A solution is not to send positive acknowledgements, but only

negative ones. A problem of missing positive acknowledgements

is that the sender has to keep the history buffer forever.

Other solutions are Scalable Reliable Multicasting (SRM) protocol

and Hierarchical Feedback Control.


Scalable Reliable Multicasting

The principles of the Scalable Reliable Multicasting are the

following:

• Only negative acknowledgements are sent.

• A process sends it negative acknowledgement only after a

random delay.

• Receivers listen to a common feedback channel that is used for

the negative acknowledgements.


• When a receiver observes a negative acknowledgement sent by

another receiver, it suppresses its own acknowledgement it was

about to send. That message will be retransmitted anyway,

there is no need to waste bandwidth.

See Figure.


Hierarchical Feedback Control

When the number of receivers is very high, multicasting can be

made hierarchical as follows:

• If the number of receivers is high, they are divided to subclasses

that select a coordinator. If the number of receivers is very high,

this division is continued recursively. Thus receivers form a tree.

• Messages and acknowledgements are sent to the coordinator.

• Coordinator maintains the history buffer for receivers below it.


Atomic multicast

We now return the case where processes may fail. In atomic

multicast problem it is required that

• Messages are delivered to all processors or none.

• Messages are delivered in the same order to all processes.

The system may continue running after the crash of a process,

but after a crashed process recovers, no updates are allowed until

it has been brought up-to-date with other processes. Thus atomic

multicasting ensures consistency.


DISTRIBUTED COMMIT

The principle of atomic multicasting can be generalized. In

distributed commit, an operation should be performed by a group

of processes , or none.

We shall consider two solution for the problem:

• two-phase commit (2PC), and

• three-phase commit (3PC)


Two-phase commit

The main idea of 2PC is that the client who initiated the

computation acts as coordinator, processes that need to commit

are the participants.

1a Coordinator sends vote-request to participants.

1b Participants respond by vote-commit or vote-abort to the

coordinator and remain waiting.

2a Coordinator collects the votes. If all votes were vote-commit,it sends global-commit to participants, otherwise it sends

global-abort.


2b If participant receives global-commit, it commits the

transaction, in case of global-abort, transaction is aborted.

Figure (a) describes the steps in the coordinator, Figure (b) the

steps in the participants.


Failing participant in 2PC

For recovery, coordinator and participants keep log, and until final

commit make computations in temporary workspace. Depending

on the state, when crash happens, the participant proceeds as

follows:

• Failure at initial state: No problem as participant is unaware of

the protocol.

• Ready state: Participant is waiting for global-commit or

global-abort. After recovery, participant needs to know

which state transition it should make. It needs to ask the global

log for the coordinator’s decision.


• Abort state: Just remove the workspace to return to the state

before transaction.

• Commit state: Perform the commit.


Failing coordinator

The coordinator stores its decisions in the persistent log so that

they can be found after crash.

But what can participants do, if coordinator crashes when it should

make the global decision?


Three-phase commit

1a Coordinator sends vote-request to participants.

1b Participant responds with vote-commit or vote-abort.

2a Coordinator collect votes. If all vote commit, it sends prepareto participants, otherwise it sends abort.

2b Participants wait for prepare or abort.

3a Coordinator waits for acknowledgement ack of the reception

of prepare from participants, and then sends commit to

participants.


3b Participants wait for commit.


3PC failing participant

• The key idea is that on way towards commit, the coordinator

and the participants never differ more than one state transition.

• After crash a participant uses information available from the

coordinator whether it should abort or continue towards commit.

• If coordinator crashes, it may be necessary to select a new

coordinator.


RECOVERY

So far we have concentrated on tolerating faults. How to recover

to an error-free state after a failure? Two main choices are:

1. Forward error recovery: Find a new state where the system can

continue operation.

2. Backward error recovery: Bring the system back into a previous

error-free state. Some recovery points are needed.

A big difficulty in distributed systems is to identify a consistent

state where to continue.


Checkpointing

For recovery, processes may regularly record snapshots about their

states, checkpoints. A recovery line describes the most recent

consistent global checkpoint.

A property of a consistent global checkpoint is that sent messages

should also be received.


If checkpoints are made at bad moment, scrollbacks may be

cascaded. In the following Figure, one has to scrollback to the

initial state!


Independent checkpointing

Let CP[i](m) denote the mth checkpoint of process Pi and

INT[i](m) the interval between checkpoints CP[i](m-1) and

CP[i](m). Proceed as follows:

• When process Pi sends a message in intervall INT[i](m), it

piggybacks (i,m).

• When process Pj receives a message in interval INT[j](n), it

records the depenency INT[i](m) → INT[j](n).

• The dependency INT[i](m) → INT[j](n) is stored in stable

storage when taking checkpoint CP[j](n).


• When Pi scrolls back to CP[i](m-1), Pj must scroll back to

CP[j](n-1)


Message logging


•

•

•

•