7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single...

50
7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially. It can even recover automatically from failures. The term dependable is used about systems that have the following properties: Availability. System is ready to be used. Reliability. System can run continuously without failure. Safety. System may fail, but nothing catastrophic happens. Martti Penttonen: Distributed Systems 2002 248

Transcript of 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single...

Page 1: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

7. FAULT TOLERANCE

A single machine system either works or not, totally. A distributed

system can fail partially. It can even recover automatically from

failures.

The term dependable is used about systems that have the following

properties:

• Availability. System is ready to be used.

• Reliability. System can run continuously without failure.

• Safety. System may fail, but nothing catastrophic happens.

Martti Penttonen: Distributed Systems 2002 248

Page 2: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• Maintainability. System may fail, but it is easy to repair.

Martti Penttonen: Distributed Systems 2002 249

Page 3: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Fault tolerance terminology

• Failure. Component does not fulfil the specification.

• Error. The part in the state of the system that can lead to a

failure.

• Fault. The cause of an error.

• Fault prevention. Prevent the occurrence of the fault.

• Fault tolerance. Mask the faults so that even in case of fault

the system fulfils the specification.

Martti Penttonen: Distributed Systems 2002 250

Page 4: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• Fault removal. Reduce the presence, number and seriousness of

faults.

• Fault forecasting. Estimate the number, time and consequences

of faults.

Martti Penttonen: Distributed Systems 2002 251

Page 5: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Failure types

• Crash failures. System component behaves correctly until it

totally halts.

• Omission failures. System fails to respond to requests.

• Timing failures. System responds correctly but outside specified

time. (Usually too late.)

• Response failure. The response of the system is incorrect. Either

the value is wrong (value failure), or the system deviates from

the correct flow of control (state transition failure).

Martti Penttonen: Distributed Systems 2002 252

Page 6: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• Arbitrary failure. The system may produce arbitrary responses

at arbitrary times.

Remark. Crash failures are least severe, arbitrary failures are most

dangerous.

Martti Penttonen: Distributed Systems 2002 253

Page 7: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Failure masking by redundancy

The best way to protect oneself agains failures is to mask them

out. An important technique of masking is redundancy, use

reserve parts. The followin kinds of redunancy can be used:

• Information redundancy. Extra information is used to recover

from the lost or distorted information. A good example is the

use of error-correcting codes for data.

• Timing redundancy. If a transaction fails, it is tried again. It

applies against intermittent failures, like power failure.

• Physical redundancy. Duplicate hardware. See Figure.

Martti Penttonen: Distributed Systems 2002 254

Page 8: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 255

Page 9: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

PROCESS RESILIENCE

Protect yourself against faulty processes by replicating and

distributing them in a group.

• Flat group. Processes form a symmetric group. If one of the

processes fails, there is just one process less. Very fault tolerant

but control may be difficult.

• Hierarchical group. One of the processes is coordinator. Easier

to implement but not very fault tolerant. A new coordinator

must be selected.

Martti Penttonen: Distributed Systems 2002 256

Page 10: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 257

Page 11: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Agreement if faulty system

We’ll approach to the problem of faulty communication and faulty

processes by two problems

• Faulty communication: two-army problem

• Faulty processes: Byzantine generals problem

Martti Penttonen: Distributed Systems 2002 258

Page 12: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Two-army problem

The following problem illustrates the difficulty of agreement, when

communication is not reliable.

• There are two armies, red army of 5k troops encamped in a

valley, and two blue armies of 3k troops each, on surrounding

hills.

• The red army wins blue armies if they do not cooperate but

loses to them if they cooperate.

• General Alexander, commander of blue army 1, sends General

Martti Penttonen: Distributed Systems 2002 259

Page 13: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Bonaparte, commander of blue army 2 message: “Hi Bona, let’s

attack at dawn tomorrow”

• General Bonaparte answers: “Great idea Al, see you at dawn

tomorrow”

• General Alexander receives the message, but suddenly he

realizes: Bonaparte does not know whether I got his

acknowledgement and may not dare to attack. Therefore

he sends a messenger to tell Bonaparte that he received the

acknowledgement.

• Bonaparte receives the new message and thinks. Alexander

does not know that I received the message and may not dare to

attack. Therefore he sends a message ...

Martti Penttonen: Distributed Systems 2002 260

Page 14: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Even if the message goes through every time, generals will never

agree.

Martti Penttonen: Distributed Systems 2002 261

Page 15: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Byzantine generals

Now we assume that communication is reliable, but processes

aren’t. The problem of faulty processes is illustrated by Byzantine

generals.

n generals are planning an attack and therefore wantto exchange information about their troop strengths, inkilosoldiers. Among n generals there are m traitors thatfeed false information. Without knowing each others, canloyal generals get reliable information, and under whatcondition?

Martti Penttonen: Distributed Systems 2002 262

Page 16: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

For example, consider the case of four generals (n = 4) of the

following Figure, one of whom is traitor (m = 1). General 1

has 1k troops, general 2 has 2k troops, general 3 always lies,

and general 4 has 4k troops. For exchanging the information the

generals proceed as follows

1. Generals send each others a message about their strength.

Figure (a).

2. Each general forms a vector of the received information, where

the i’th item is the strength of general i, see Figure (b).

3. Each general now sends his vector to all others. Now every

general knows everybody else’s opinions about the strengths,

see Figure (c).

Martti Penttonen: Distributed Systems 2002 263

Page 17: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

4. Of the three opinions, the majority is the answer. General

1, for example, calculates maj(1, a, 1) = 1, maj(2, b, 2) = 2,

maj(y, c, z) =?, maj(4, d, 4) = 4.

In the example, loyal generals got reliable information even if the

traitor gave false information.

Lamport et al. proved that in a system with m faulty processes,

agreement can be achieved if there are at least 2m + 1 correctly

functioning processes.

Martti Penttonen: Distributed Systems 2002 264

Page 18: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 265

Page 19: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

RELIABLE CLIENT-SERVERCOMMUNICATION: RPC

We shall now take a look in client-server communication in case

of RPC. Possible failures are:

1. Client cannot locate server.

2. Client request message is lost.

3. Server crashes.

4. Server response is lost.

Martti Penttonen: Distributed Systems 2002 266

Page 20: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

5. Client crashes.

Martti Penttonen: Distributed Systems 2002 267

Page 21: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Solutions — kind of

• Report back to client.

• Resend message.

• Things should happen as in Figure (a) but also (b) or (c) may

happen. A problem is that we should react differently with (b)

and (c). In (b) one should probably report to the client and

in (b) one should retransmit the request. As no reply arrives,

client cannot know what is the case.

Martti Penttonen: Distributed Systems 2002 268

Page 22: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

There are three schools what to do:

– At-least-once semantics. Keep trying, try at least once.

– At-most-once semantics. Give up immediately and report

failure.

– Guarantee nothing. When server crashes, client gets no help.

• The problem of lost reply is that it looks the same as the

crashed server. If the request is idempotent in the sense that

Martti Penttonen: Distributed Systems 2002 269

Page 23: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

it does not change the state of the server (bank transfer is not

idempotent!), request can be repeated. If the request is not

idempotent, one can perhaps mark the retransmissions so that

they do no cause a new transaction.

• If client crashes before getting the reply, computation becomes

orphan. Orphans can be harmful, not only because of the

wasted resource. When the client reboots, does a new RPC and

before the reply gets an orphan, a confusion may result. One

solution is extermination of orphans at reboot. Another solution

is expiration — if a RPC is not finished within a time bound, it

expires.

Martti Penttonen: Distributed Systems 2002 270

Page 24: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

RELIABLE MULTICASTING

Transport layer offers reliable point-to-point channels.

Multicasting is more difficult, in particular if senders and receivers

can fail. The following Figure describes a simple solution, when

all receivers are reliable. The sending process assigns a sequence

number to each message and keeps history buffer of messages

until an acknowledgement has arrived. Receivers acknowledge

the arrived messages. By sequence number a receiver notices if

a message is lost or delayed. In that case it sends a negative

acknowledgement with the number of the missing message.

Martti Penttonen: Distributed Systems 2002 271

Page 25: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 272

Page 26: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Scalability

The proposed simple multicast solution is not very scalable.

If a message is multicast to N receivers, the sender gets N

acknowledgements.

A solution is not to send positive acknowledgements, but only

negative ones. A problem of missing positive acknowledgements

is that the sender has to keep the history buffer forever.

Other solutions are Scalable Reliable Multicasting (SRM) protocol

and Hierarchical Feedback Control.

Martti Penttonen: Distributed Systems 2002 273

Page 27: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Scalable Reliable Multicasting

The principles of the Scalable Reliable Multicasting are the

following:

• Only negative acknowledgements are sent.

• A process sends it negative acknowledgement only after a

random delay.

• Receivers listen to a common feedback channel that is used for

the negative acknowledgements.

Martti Penttonen: Distributed Systems 2002 274

Page 28: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• When a receiver observes a negative acknowledgement sent by

another receiver, it suppresses its own acknowledgement it was

about to send. That message will be retransmitted anyway,

there is no need to waste bandwidth.

See Figure.

Martti Penttonen: Distributed Systems 2002 275

Page 29: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 276

Page 30: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Hierarchical Feedback Control

When the number of receivers is very high, multicasting can be

made hierarchical as follows:

• If the number of receivers is high, they are divided to subclasses

that select a coordinator. If the number of receivers is very high,

this division is continued recursively. Thus receivers form a tree.

• Messages and acknowledgements are sent to the coordinator.

• Coordinator maintains the history buffer for receivers below it.

Martti Penttonen: Distributed Systems 2002 277

Page 31: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 278

Page 32: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Atomic multicast

We now return the case where processes may fail. In atomic

multicast problem it is required that

• Messages are delivered to all processors or none.

• Messages are delivered in the same order to all processes.

The system may continue running after the crash of a process,

but after a crashed process recovers, no updates are allowed until

it has been brought up-to-date with other processes. Thus atomic

multicasting ensures consistency.

Martti Penttonen: Distributed Systems 2002 279

Page 33: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

DISTRIBUTED COMMIT

The principle of atomic multicasting can be generalized. In

distributed commit, an operation should be performed by a group

of processes , or none.

We shall consider two solution for the problem:

• two-phase commit (2PC), and

• three-phase commit (3PC)

Martti Penttonen: Distributed Systems 2002 280

Page 34: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Two-phase commit

The main idea of 2PC is that the client who initiated the

computation acts as coordinator, processes that need to commit

are the participants.

1a Coordinator sends vote-request to participants.

1b Participants respond by vote-commit or vote-abort to the

coordinator and remain waiting.

2a Coordinator collects the votes. If all votes were vote-commit,it sends global-commit to participants, otherwise it sends

global-abort.

Martti Penttonen: Distributed Systems 2002 281

Page 35: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

2b If participant receives global-commit, it commits the

transaction, in case of global-abort, transaction is aborted.

Figure (a) describes the steps in the coordinator, Figure (b) the

steps in the participants.

Martti Penttonen: Distributed Systems 2002 282

Page 36: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Failing participant in 2PC

For recovery, coordinator and participants keep log, and until final

commit make computations in temporary workspace. Depending

on the state, when crash happens, the participant proceeds as

follows:

• Failure at initial state: No problem as participant is unaware of

the protocol.

• Ready state: Participant is waiting for global-commit or

global-abort. After recovery, participant needs to know

which state transition it should make. It needs to ask the global

log for the coordinator’s decision.

Martti Penttonen: Distributed Systems 2002 283

Page 37: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• Abort state: Just remove the workspace to return to the state

before transaction.

• Commit state: Perform the commit.

Martti Penttonen: Distributed Systems 2002 284

Page 38: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Failing coordinator

The coordinator stores its decisions in the persistent log so that

they can be found after crash.

But what can participants do, if coordinator crashes when it should

make the global decision?

Martti Penttonen: Distributed Systems 2002 285

Page 39: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Three-phase commit

1a Coordinator sends vote-request to participants.

1b Participant responds with vote-commit or vote-abort.

2a Coordinator collect votes. If all vote commit, it sends prepareto participants, otherwise it sends abort.

2b Participants wait for prepare or abort.

3a Coordinator waits for acknowledgement ack of the reception

of prepare from participants, and then sends commit to

participants.

Martti Penttonen: Distributed Systems 2002 286

Page 40: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

3b Participants wait for commit.

Martti Penttonen: Distributed Systems 2002 287

Page 41: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 288

Page 42: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

3PC failing participant

• The key idea is that on way towards commit, the coordinator

and the participants never differ more than one state transition.

• After crash a participant uses information available from the

coordinator whether it should abort or continue towards commit.

• If coordinator crashes, it may be necessary to select a new

coordinator.

Martti Penttonen: Distributed Systems 2002 289

Page 43: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

RECOVERY

So far we have concentrated on tolerating faults. How to recover

to an error-free state after a failure? Two main choices are:

1. Forward error recovery: Find a new state where the system can

continue operation.

2. Backward error recovery: Bring the system back into a previous

error-free state. Some recovery points are needed.

A big difficulty in distributed systems is to identify a consistent

state where to continue.

Martti Penttonen: Distributed Systems 2002 290

Page 44: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Checkpointing

For recovery, processes may regularly record snapshots about their

states, checkpoints. A recovery line describes the most recent

consistent global checkpoint.

A property of a consistent global checkpoint is that sent messages

should also be received.

Martti Penttonen: Distributed Systems 2002 291

Page 45: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 292

Page 46: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

If checkpoints are made at bad moment, scrollbacks may be

cascaded. In the following Figure, one has to scrollback to the

initial state!

Martti Penttonen: Distributed Systems 2002 293

Page 47: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Independent checkpointing

Let CP[i](m) denote the mth checkpoint of process Pi and

INT[i](m) the interval between checkpoints CP[i](m-1) and

CP[i](m). Proceed as follows:

• When process Pi sends a message in intervall INT[i](m), it

piggybacks (i,m).

• When process Pj receives a message in interval INT[j](n), it

records the depenency INT[i](m) → INT[j](n).

• The dependency INT[i](m) → INT[j](n) is stored in stable

storage when taking checkpoint CP[j](n).

Martti Penttonen: Distributed Systems 2002 294

Page 48: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

• When Pi scrolls back to CP[i](m-1), Pj must scroll back to

CP[j](n-1)

Martti Penttonen: Distributed Systems 2002 295

Page 49: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Message logging

Martti Penttonen: Distributed Systems 2002 296

Page 50: 7. FAULT TOLERANCEcs.uef.fi › ~penttone › ds2002 › ch7.pdf · 7. FAULT TOLERANCE A single machine system either works or not, totally. A distributed system can fail partially.

Martti Penttonen: Distributed Systems 2002 297