7. FAULT TOLERANCE
A single machine system either works or not, totally. A distributed
system can fail partially. It can even recover automatically from
failures.
The term dependable is used about systems that have the following
properties:
• Availability. System is ready to be used.
• Reliability. System can run continuously without failure.
• Safety. System may fail, but nothing catastrophic happens.
Martti Penttonen: Distributed Systems 2002 248
• Maintainability. System may fail, but it is easy to repair.
Martti Penttonen: Distributed Systems 2002 249
Fault tolerance terminology
• Failure. Component does not fulfil the specification.
• Error. The part in the state of the system that can lead to a
failure.
• Fault. The cause of an error.
• Fault prevention. Prevent the occurrence of the fault.
• Fault tolerance. Mask the faults so that even in case of fault
the system fulfils the specification.
Martti Penttonen: Distributed Systems 2002 250
• Fault removal. Reduce the presence, number and seriousness of
faults.
• Fault forecasting. Estimate the number, time and consequences
of faults.
Martti Penttonen: Distributed Systems 2002 251
Failure types
• Crash failures. System component behaves correctly until it
totally halts.
• Omission failures. System fails to respond to requests.
• Timing failures. System responds correctly but outside specified
time. (Usually too late.)
• Response failure. The response of the system is incorrect. Either
the value is wrong (value failure), or the system deviates from
the correct flow of control (state transition failure).
Martti Penttonen: Distributed Systems 2002 252
• Arbitrary failure. The system may produce arbitrary responses
at arbitrary times.
Remark. Crash failures are least severe, arbitrary failures are most
dangerous.
Martti Penttonen: Distributed Systems 2002 253
Failure masking by redundancy
The best way to protect oneself agains failures is to mask them
out. An important technique of masking is redundancy, use
reserve parts. The followin kinds of redunancy can be used:
• Information redundancy. Extra information is used to recover
from the lost or distorted information. A good example is the
use of error-correcting codes for data.
• Timing redundancy. If a transaction fails, it is tried again. It
applies against intermittent failures, like power failure.
• Physical redundancy. Duplicate hardware. See Figure.
Martti Penttonen: Distributed Systems 2002 254
Martti Penttonen: Distributed Systems 2002 255
PROCESS RESILIENCE
Protect yourself against faulty processes by replicating and
distributing them in a group.
• Flat group. Processes form a symmetric group. If one of the
processes fails, there is just one process less. Very fault tolerant
but control may be difficult.
• Hierarchical group. One of the processes is coordinator. Easier
to implement but not very fault tolerant. A new coordinator
must be selected.
Martti Penttonen: Distributed Systems 2002 256
Martti Penttonen: Distributed Systems 2002 257
Agreement if faulty system
We’ll approach to the problem of faulty communication and faulty
processes by two problems
• Faulty communication: two-army problem
• Faulty processes: Byzantine generals problem
Martti Penttonen: Distributed Systems 2002 258
Two-army problem
The following problem illustrates the difficulty of agreement, when
communication is not reliable.
• There are two armies, red army of 5k troops encamped in a
valley, and two blue armies of 3k troops each, on surrounding
hills.
• The red army wins blue armies if they do not cooperate but
loses to them if they cooperate.
• General Alexander, commander of blue army 1, sends General
Martti Penttonen: Distributed Systems 2002 259
Bonaparte, commander of blue army 2 message: “Hi Bona, let’s
attack at dawn tomorrow”
• General Bonaparte answers: “Great idea Al, see you at dawn
tomorrow”
• General Alexander receives the message, but suddenly he
realizes: Bonaparte does not know whether I got his
acknowledgement and may not dare to attack. Therefore
he sends a messenger to tell Bonaparte that he received the
acknowledgement.
• Bonaparte receives the new message and thinks. Alexander
does not know that I received the message and may not dare to
attack. Therefore he sends a message ...
Martti Penttonen: Distributed Systems 2002 260
Even if the message goes through every time, generals will never
agree.
Martti Penttonen: Distributed Systems 2002 261
Byzantine generals
Now we assume that communication is reliable, but processes
aren’t. The problem of faulty processes is illustrated by Byzantine
generals.
n generals are planning an attack and therefore wantto exchange information about their troop strengths, inkilosoldiers. Among n generals there are m traitors thatfeed false information. Without knowing each others, canloyal generals get reliable information, and under whatcondition?
Martti Penttonen: Distributed Systems 2002 262
For example, consider the case of four generals (n = 4) of the
following Figure, one of whom is traitor (m = 1). General 1
has 1k troops, general 2 has 2k troops, general 3 always lies,
and general 4 has 4k troops. For exchanging the information the
generals proceed as follows
1. Generals send each others a message about their strength.
Figure (a).
2. Each general forms a vector of the received information, where
the i’th item is the strength of general i, see Figure (b).
3. Each general now sends his vector to all others. Now every
general knows everybody else’s opinions about the strengths,
see Figure (c).
Martti Penttonen: Distributed Systems 2002 263
4. Of the three opinions, the majority is the answer. General
1, for example, calculates maj(1, a, 1) = 1, maj(2, b, 2) = 2,
maj(y, c, z) =?, maj(4, d, 4) = 4.
In the example, loyal generals got reliable information even if the
traitor gave false information.
Lamport et al. proved that in a system with m faulty processes,
agreement can be achieved if there are at least 2m + 1 correctly
functioning processes.
Martti Penttonen: Distributed Systems 2002 264
Martti Penttonen: Distributed Systems 2002 265
RELIABLE CLIENT-SERVERCOMMUNICATION: RPC
We shall now take a look in client-server communication in case
of RPC. Possible failures are:
1. Client cannot locate server.
2. Client request message is lost.
3. Server crashes.
4. Server response is lost.
Martti Penttonen: Distributed Systems 2002 266
5. Client crashes.
Martti Penttonen: Distributed Systems 2002 267
Solutions — kind of
• Report back to client.
• Resend message.
• Things should happen as in Figure (a) but also (b) or (c) may
happen. A problem is that we should react differently with (b)
and (c). In (b) one should probably report to the client and
in (b) one should retransmit the request. As no reply arrives,
client cannot know what is the case.
Martti Penttonen: Distributed Systems 2002 268
There are three schools what to do:
– At-least-once semantics. Keep trying, try at least once.
– At-most-once semantics. Give up immediately and report
failure.
– Guarantee nothing. When server crashes, client gets no help.
• The problem of lost reply is that it looks the same as the
crashed server. If the request is idempotent in the sense that
Martti Penttonen: Distributed Systems 2002 269
it does not change the state of the server (bank transfer is not
idempotent!), request can be repeated. If the request is not
idempotent, one can perhaps mark the retransmissions so that
they do no cause a new transaction.
• If client crashes before getting the reply, computation becomes
orphan. Orphans can be harmful, not only because of the
wasted resource. When the client reboots, does a new RPC and
before the reply gets an orphan, a confusion may result. One
solution is extermination of orphans at reboot. Another solution
is expiration — if a RPC is not finished within a time bound, it
expires.
Martti Penttonen: Distributed Systems 2002 270
RELIABLE MULTICASTING
Transport layer offers reliable point-to-point channels.
Multicasting is more difficult, in particular if senders and receivers
can fail. The following Figure describes a simple solution, when
all receivers are reliable. The sending process assigns a sequence
number to each message and keeps history buffer of messages
until an acknowledgement has arrived. Receivers acknowledge
the arrived messages. By sequence number a receiver notices if
a message is lost or delayed. In that case it sends a negative
acknowledgement with the number of the missing message.
Martti Penttonen: Distributed Systems 2002 271
Martti Penttonen: Distributed Systems 2002 272
Scalability
The proposed simple multicast solution is not very scalable.
If a message is multicast to N receivers, the sender gets N
acknowledgements.
A solution is not to send positive acknowledgements, but only
negative ones. A problem of missing positive acknowledgements
is that the sender has to keep the history buffer forever.
Other solutions are Scalable Reliable Multicasting (SRM) protocol
and Hierarchical Feedback Control.
Martti Penttonen: Distributed Systems 2002 273
Scalable Reliable Multicasting
The principles of the Scalable Reliable Multicasting are the
following:
• Only negative acknowledgements are sent.
• A process sends it negative acknowledgement only after a
random delay.
• Receivers listen to a common feedback channel that is used for
the negative acknowledgements.
Martti Penttonen: Distributed Systems 2002 274
• When a receiver observes a negative acknowledgement sent by
another receiver, it suppresses its own acknowledgement it was
about to send. That message will be retransmitted anyway,
there is no need to waste bandwidth.
See Figure.
Martti Penttonen: Distributed Systems 2002 275
Martti Penttonen: Distributed Systems 2002 276
Hierarchical Feedback Control
When the number of receivers is very high, multicasting can be
made hierarchical as follows:
• If the number of receivers is high, they are divided to subclasses
that select a coordinator. If the number of receivers is very high,
this division is continued recursively. Thus receivers form a tree.
• Messages and acknowledgements are sent to the coordinator.
• Coordinator maintains the history buffer for receivers below it.
Martti Penttonen: Distributed Systems 2002 277
Martti Penttonen: Distributed Systems 2002 278
Atomic multicast
We now return the case where processes may fail. In atomic
multicast problem it is required that
• Messages are delivered to all processors or none.
• Messages are delivered in the same order to all processes.
The system may continue running after the crash of a process,
but after a crashed process recovers, no updates are allowed until
it has been brought up-to-date with other processes. Thus atomic
multicasting ensures consistency.
Martti Penttonen: Distributed Systems 2002 279
DISTRIBUTED COMMIT
The principle of atomic multicasting can be generalized. In
distributed commit, an operation should be performed by a group
of processes , or none.
We shall consider two solution for the problem:
• two-phase commit (2PC), and
• three-phase commit (3PC)
Martti Penttonen: Distributed Systems 2002 280
Two-phase commit
The main idea of 2PC is that the client who initiated the
computation acts as coordinator, processes that need to commit
are the participants.
1a Coordinator sends vote-request to participants.
1b Participants respond by vote-commit or vote-abort to the
coordinator and remain waiting.
2a Coordinator collects the votes. If all votes were vote-commit,it sends global-commit to participants, otherwise it sends
global-abort.
Martti Penttonen: Distributed Systems 2002 281
2b If participant receives global-commit, it commits the
transaction, in case of global-abort, transaction is aborted.
Figure (a) describes the steps in the coordinator, Figure (b) the
steps in the participants.
Martti Penttonen: Distributed Systems 2002 282
Failing participant in 2PC
For recovery, coordinator and participants keep log, and until final
commit make computations in temporary workspace. Depending
on the state, when crash happens, the participant proceeds as
follows:
• Failure at initial state: No problem as participant is unaware of
the protocol.
• Ready state: Participant is waiting for global-commit or
global-abort. After recovery, participant needs to know
which state transition it should make. It needs to ask the global
log for the coordinator’s decision.
Martti Penttonen: Distributed Systems 2002 283
• Abort state: Just remove the workspace to return to the state
before transaction.
• Commit state: Perform the commit.
Martti Penttonen: Distributed Systems 2002 284
Failing coordinator
The coordinator stores its decisions in the persistent log so that
they can be found after crash.
But what can participants do, if coordinator crashes when it should
make the global decision?
Martti Penttonen: Distributed Systems 2002 285
Three-phase commit
1a Coordinator sends vote-request to participants.
1b Participant responds with vote-commit or vote-abort.
2a Coordinator collect votes. If all vote commit, it sends prepareto participants, otherwise it sends abort.
2b Participants wait for prepare or abort.
3a Coordinator waits for acknowledgement ack of the reception
of prepare from participants, and then sends commit to
participants.
Martti Penttonen: Distributed Systems 2002 286
3b Participants wait for commit.
Martti Penttonen: Distributed Systems 2002 287
Martti Penttonen: Distributed Systems 2002 288
3PC failing participant
• The key idea is that on way towards commit, the coordinator
and the participants never differ more than one state transition.
• After crash a participant uses information available from the
coordinator whether it should abort or continue towards commit.
• If coordinator crashes, it may be necessary to select a new
coordinator.
Martti Penttonen: Distributed Systems 2002 289
RECOVERY
So far we have concentrated on tolerating faults. How to recover
to an error-free state after a failure? Two main choices are:
1. Forward error recovery: Find a new state where the system can
continue operation.
2. Backward error recovery: Bring the system back into a previous
error-free state. Some recovery points are needed.
A big difficulty in distributed systems is to identify a consistent
state where to continue.
Martti Penttonen: Distributed Systems 2002 290
Checkpointing
For recovery, processes may regularly record snapshots about their
states, checkpoints. A recovery line describes the most recent
consistent global checkpoint.
A property of a consistent global checkpoint is that sent messages
should also be received.
Martti Penttonen: Distributed Systems 2002 291
Martti Penttonen: Distributed Systems 2002 292
If checkpoints are made at bad moment, scrollbacks may be
cascaded. In the following Figure, one has to scrollback to the
initial state!
Martti Penttonen: Distributed Systems 2002 293
Independent checkpointing
Let CP[i](m) denote the mth checkpoint of process Pi and
INT[i](m) the interval between checkpoints CP[i](m-1) and
CP[i](m). Proceed as follows:
• When process Pi sends a message in intervall INT[i](m), it
piggybacks (i,m).
• When process Pj receives a message in interval INT[j](n), it
records the depenency INT[i](m) → INT[j](n).
• The dependency INT[i](m) → INT[j](n) is stored in stable
storage when taking checkpoint CP[j](n).
Martti Penttonen: Distributed Systems 2002 294
• When Pi scrolls back to CP[i](m-1), Pj must scroll back to
CP[j](n-1)
Martti Penttonen: Distributed Systems 2002 295
Message logging
Martti Penttonen: Distributed Systems 2002 296
•
•
•
•
Martti Penttonen: Distributed Systems 2002 297
Top Related