SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL,...
-
Upload
pamela-brooking -
Category
Documents
-
view
229 -
download
5
Transcript of SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL,...
![Page 1: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/1.jpg)
SOFSEM 2006 André Schiper 1
Group Communication:from practice to theory
André Schiper
EPFL, Lausanne, Switzerland
![Page 2: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/2.jpg)
SOFSEM 2006 André Schiper 2
Outline
• Introduction to fault tolerance
• Replication for fault tolerance
• Group communication for replication
• Implementation of group communication
Practical issues
Theoretical issues
![Page 3: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/3.jpg)
SOFSEM 2006 André Schiper 3
Introduction
• Computer systems become every day more complex
• Probability of faults increases
• Need to develop techniques to achieve fault tolerance
![Page 4: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/4.jpg)
SOFSEM 2006 André Schiper 4
Introduction (2)
Fault tolerant techniques developed over the year
• Transactions (all or nothing property)
• Checkpointing (prevents state loss)
• Replication (masks faults)
![Page 5: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/5.jpg)
SOFSEM 2006 André Schiper 5
Transactions
• ACID properties (Atomicity, Consistency, Isolation, Durability)
begin-transactionremove (100, account-1)deposit (100, account-2)
end-transaction
Crash during the transaction: rollback to the state before the beginning of the transaction
account-1 account-2
site 1 site 2
![Page 6: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/6.jpg)
SOFSEM 2006 André Schiper 6
Checkpointing
Crash of p1: rollback of p1 to the latest saved state
chkpt chkpt
chkpt
chkpt
Xcrash
p1
p3
p2
![Page 7: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/7.jpg)
SOFSEM 2006 André Schiper 7
Replication
• Crash of s1 is masked to the client
• Upon recovery of s1: state transfer to bring s1 up-to-date
s1
s2
s3
client
replicated server
request
response
![Page 8: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/8.jpg)
SOFSEM 2006 André Schiper 8
Introduction to FT (6)
Comparison of the three techniques
• Only replication masks crashes (i.e., ensures high availability)
• Transactions and checkpointing: progress is only possible after recovery of the crashed process
We will concentrate on replication
![Page 9: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/9.jpg)
SOFSEM 2006 André Schiper 9
Outline
• Introduction to fault tolerance
• Replication for fault tolerance
• Group communication for replication
• Implementation of group communication
![Page 10: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/10.jpg)
SOFSEM 2006 André Schiper 10
Context
Replication in the general context of client-server interaction:
c srequest
response
serverclients
(several)
![Page 11: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/11.jpg)
SOFSEM 2006 André Schiper 11
Context (2)
• Fault tolerance by replicating the server
cs2
request
response
s3
s1
clients(several)
server s
![Page 12: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/12.jpg)
SOFSEM 2006 André Schiper 12
Correctness criterion: linearizability
• Need to keep the replica consistent
• Consistency defined in terms of the operations executed by the clients
ci
cj
S1
S2
S3
op
op
![Page 13: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/13.jpg)
SOFSEM 2006 André Schiper 13
Correctness criterion: linearizability (2)
Example:
• Server: FIFO queue
• Operations: enqueue/dequeue
![Page 14: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/14.jpg)
SOFSEM 2006 André Schiper 14
Correctness criterion: linearizability (3)
Example 1
p
q
execution
execution
enq(a)
enq(b)
deq( )
deq( )
b
a
linearizable execution
![Page 15: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/15.jpg)
SOFSEM 2006 André Schiper 15
Correctness criterion: linearizability (4)
Example 2
p
q
enq(a)
enq(b)
deq( )
deq( )
b
a
non linearizable execution
![Page 16: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/16.jpg)
SOFSEM 2006 André Schiper 16
Replication techniques
Two main replication techniques for linearizability
• Primary-backup replication (or passive replication)
• Active replication (or state machine replication)
![Page 17: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/17.jpg)
SOFSEM 2006 André Schiper 17
Primary-backup replication
client
S1 primary
S2 backup
S3 backup
request
update
response
![Page 18: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/18.jpg)
SOFSEM 2006 André Schiper 18
Primary-backup replication
• Crash cases
client
S1 primary
S2 backup
S3 backup
request
update
response
1 2 3
![Page 19: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/19.jpg)
SOFSEM 2006 André Schiper 19
Primary-backup replication (3)
• Crash detection usually based on time-outs
• Time-outs may lead to incorrectly suspect the crash of a process
• The technique must work correctly even if processes are incorrectly suspected to have crashed
![Page 20: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/20.jpg)
SOFSEM 2006 André Schiper 20
Active replication
client
S1
S2
S3
request response
wait for the first reply
Crash of a server replica transparent to the client
![Page 21: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/21.jpg)
SOFSEM 2006 André Schiper 21
Active replication (2)
• If more than one client, the requests must be received by all the replicas in the same order
• Ensured by a communication primitive called total order broadcast (or atomic broadcast)
• Complexity of active replication hidden in the implementation this primitive
S1
S3
C
C’
S2
![Page 22: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/22.jpg)
SOFSEM 2006 André Schiper 22
Outline
• Introduction to fault tolerance
• Replication for fault tolerance
• Group communication for replication
• Implementation of group communication
![Page 23: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/23.jpg)
SOFSEM 2006 André Schiper 23
Role of group communication
• Active replication: requires communication primitive that orders client requests
• Passive replication: have shown the issues to be addressed
• Group communication: communication infrastructure that provides solutions to these problems
Replication technique
Group communication
Transport layer
![Page 24: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/24.jpg)
SOFSEM 2006 André Schiper 24
Role of group communication (2)
Let g be a group with members p, q, r
• multicast(g, m): allows m to be multicast to g without knowing the membership of g
• IP-multicast (UDP) also provides such a feature
• Group communication provides stronger guarantees (reliability, order, etc.)
![Page 25: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/25.jpg)
SOFSEM 2006 André Schiper 25
Role of group communication (3)
We will discuss:
• Group communication for active replication
• Group communication for passive replication
![Page 26: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/26.jpg)
SOFSEM 2006 André Schiper 26
Atomic broadcast for active replication
Group communication primitive for active replication:
• atomic broadcast (denoted sometimes abcast)
• also called total order broadcast
Replication technique
Group communication
Transport layer
abcast (g, m) adeliver (m)
receive (m)send (m) to p
![Page 27: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/27.jpg)
SOFSEM 2006 André Schiper 27
Atomic broadcast for active replication (2)
S1
S3
C
C’
S2
abcast(gS, req)
abcast(gS, req’)
group gS
![Page 28: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/28.jpg)
SOFSEM 2006 André Schiper 28
Atomic broadcast: specification
• If p executes abcast(g, m) and does not crash, then all processes in g eventually adeliver m
• If some process in g adelivers m and does not crash, then all processes in g that do not crash eventually adeliver m
• If p, q in g adeliver m and m’, then they adeliver them in the same order
![Page 29: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/29.jpg)
SOFSEM 2006 André Schiper 29
Role of group communication (3)
We will discuss:
• Group communication for active replication
• Group communication for passive replication
![Page 30: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/30.jpg)
SOFSEM 2006 André Schiper 30
Generic broadcast for passive replication
• Atomic broadcast can also be used to implement passive replication
• Better solution: generic broadcast
• Same spec as atomic broadcast, except that not all messages are ordered:– Generic broadcast based on a conflict relation on the messages
– Conflicting messages are ordered, non conflicting messages are not ordered
![Page 31: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/31.jpg)
SOFSEM 2006 André Schiper 31
Generic broadcast for passive replication (2)
• Two types of messages:
– Update
– PrimaryChange: issued if a process suspects the current primary
Upon delivery: cyclic permutation of process list
New primary: process at the head of the process list
• Conflict relation:PrimaryChange Update
PrimaryChange no conflict conflict
Update conflict conflict
![Page 32: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/32.jpg)
SOFSEM 2006 André Schiper 32
Generic broadcast for passive replication (3)
client
S1 primary
S2 backup
S3 backup
request
Update
PrimaryChange
![Page 33: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/33.jpg)
SOFSEM 2006 André Schiper 33
Generic broadcast for passive replication (4)
Another way to view the strategy:
• Replicas numbered 0 .. n-1
• Computation divided into rounds
• In round r, the primary is the process with number r mod n
• When process p suspects the primary of round r, it broadcasts newRound (= primaryChange)
newRound Update
newRound no conflict conflict
Update conflict conflict
Conflict relation
All processes deliver Update in the same round
![Page 34: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/34.jpg)
SOFSEM 2006 André Schiper 34
Other group communication issues
To discuss:
• Static vs. dynamic group
• Crash-stop vs. crash-recovery model
![Page 35: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/35.jpg)
SOFSEM 2006 André Schiper 35
Static vs. dynamic groups
• Static group: group whose membership does not change during the lifetime of the system
• A static group is sometimes too limitative from a practical point of view: may want to replace a crashed replica with a new replica
• Dynamic group: group whose membership changes during the lifetime of the system
• Dynamic group requires to address two problems:– How to add / remove processes from the group (group
membership problem)– Semantics of communication primitives
![Page 36: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/36.jpg)
SOFSEM 2006 André Schiper 36
Group membership problem
• Need to distinguish– (1) group
– (2) membership of the group at time t
• New notion: view of group g defined by: (i, s)– i identifies the view
– s: membership of view i (set of processes)
– membership i also denoted vi
![Page 37: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/37.jpg)
SOFSEM 2006 André Schiper 37
Group membership problem (2)
Changing the membership of a group:
• add (g, p): for adding p to group g
• remove (g, p): for removing p from group g
Usual requirement: all the members of a group see the same sequence of views:
if (i, sp) the view i of p
and (i, sq) the view i of q
then sp = sq
![Page 38: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/38.jpg)
SOFSEM 2006 André Schiper 38
Process recovery
• Usual theoretical model: crash-stop – processes do not have access to stable storage (disk)
– if a process crashes, its whole state is lost (i.e. no recovery possible)
• Drawback of the static crash-stop model: if non null crash probability, eventually all processes will have crashed
• Drawback of the dynamic crash-stop model: not tolerant to catastrophic failures (crash of all the processes in a group)
• To tolerate catastrophic failures: crash-recovery model
![Page 39: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/39.jpg)
SOFSEM 2006 André Schiper 39
Combining the different GC models
Combining static/dynamic with crash-stop/crash-recovery:
1. Static groups in the crash-stop model
Considered in most theoretical papers
2. Dynamic groups in the crash-stop model
Considered in most existing GC systems
3. Static groups in the crash-recovery model
4. Dynamic groups in the crash-recovery model
![Page 40: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/40.jpg)
SOFSEM 2006 André Schiper 40
Quorum systems vs. group communication
• Quorum system: context of read/write operations
• Typical situation:
– Access a majority of replicas to perform a read operation
– Access a majority of replicas to perform a write operation
![Page 41: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/41.jpg)
SOFSEM 2006 André Schiper 41
Quorum systems vs. group communication
Example:
• a replicated server managing an integer with read/write operations (called register)
• operation to perform: increment
![Page 42: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/42.jpg)
SOFSEM 2006 André Schiper 42
Quorum systems vs. group communication (2)
• Solution with quorum system:
• Solution with group communication
client
s1
s2
s3
client
s1
s2
s3
Read
Increment
Write
Increment
Critical section
![Page 43: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/43.jpg)
SOFSEM 2006 André Schiper 43
Outline
• Introduction to fault tolerance
• Replication for fault tolerance
• Group communication for replication
• Implementation of group communication
![Page 44: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/44.jpg)
SOFSEM 2006 André Schiper 44
Implementation of group communication
Context:
• Static groups
• Crash-stop model
• Non-malicious processes
• Atomic broadcast
• Generic broadcast
Consensus
Atomicbroadcast
Genericbroadcast
Common denominator: Consensus
![Page 45: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/45.jpg)
SOFSEM 2006 André Schiper 45
Consensus (informal)
p1
p2
p3
4
7
1
7
7
7
![Page 46: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/46.jpg)
SOFSEM 2006 André Schiper 46
Consensus (formal)
A set of processes, each pi with an initial value vi
Three properties:
Validity: If a process decides v, then v is the initial value of some process
Agreement: No two processes decide differently
Termination: Every process eventually decides some value
![Page 47: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/47.jpg)
SOFSEM 2006 André Schiper 47
Impossibility result
Asynchronous system:
• No bound on the message transmission delay
• No bound on the process relative speeds
Fischer-Lynch-Paterson impossibility result (1985):
Consensus is not solvable in an asynchronous system with a deterministic algorithm and reliable links if one single process may crash
![Page 48: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/48.jpg)
SOFSEM 2006 André Schiper 48
Impossibility of atomic broadcast
• By contradiction
p1
p2
p3
4
7
1
7
7
7Abcast(4)
Abcast(7)
Abcast(1)
Adeliver(7)
Adeliver(7)
Adeliver(7)
![Page 49: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/49.jpg)
SOFSEM 2006 André Schiper 49
Models for solving consensus
Synchronous system:– There is a known bound on the transmission delay of messages
– There is a known bound on the process relative speeds
Consensus solvable with up to n-1 faulty processes
Drawback:
• Requires to be pessimistic (large bounds)
• Large bounds lead to a large blackout period in case of a crash
![Page 50: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/50.jpg)
SOFSEM 2006 André Schiper 50
Models for solving consensus (2)
Partially synchronous system (Dwork, Lynch, Stockmeyer, 1988)
Two variants:
• There is a bound on the transmission delay of messages and on the process relative speed, but these bounds are not known
• There are known bounds on the transmission delay of messages and on the process relative speed, but these bounds hold only from some unknown point on.
Consensus solvable with a majority of correct processes
![Page 51: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/51.jpg)
SOFSEM 2006 André Schiper 51
Models for solving consensus (3)
Failure detector model (Chandra,Toueg, 1995)
• Asynchronous model augmented with an oracle (called failure detector) defined by abstract properties
• A failure detector defined by a completeness property and an accuracy property
pFDpr FDr
qFDq
network
![Page 52: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/52.jpg)
SOFSEM 2006 André Schiper 52
Failure detector model (2)
Example: failure detector S• Strong completeness: eventually every process that crashes
is permanently suspected to have crashed by every correct process
• Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process
Consensus solvable with S and a majority of correct processes
Drawback:
• requires reliable links
![Page 53: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/53.jpg)
SOFSEM 2006 André Schiper 53
RbR model
• Joint work with Bernadette Charron-Bost• Combination of:
– Gafni 1998, Round-by-round failure detectors(unifies synchronous model and failure detector model)
– Santoro, Widmayer, 1989, Time is not a healer (show that dynamic faults have the same destructive effect on solving consensus as asynchronicity)
• Existing models for solving consensus have put emphasis on static faults
• Our new model handles static faults and dynamic faults in the same way
![Page 54: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/54.jpg)
SOFSEM 2006 André Schiper 54
RbR machine
• Distributed algorithm A on the set of processes
• Communication predicate P on the HO’s Examples: P1 : p , r > 0 : | HO(p, r) | > n/2
P2 : r0, HO 2, p : HO(p, r0) = HO
• A problem is solved by a pair ( A, P )
In round r, process p receives messages from the set HO(p, r)
st st’
Round r
p
![Page 55: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/55.jpg)
SOFSEM 2006 André Schiper 55
Round-by-Round (RbR) model (2)
• Assume q HO(p, r): in the RbR model, no tentative to justify this (e.g, channel failure, crash of q, send-omission failure of q, etc.)
• This removes the difference between static and dynamic faults (static: crash-stop; dynamic: channel fault, crash-recovery)
• In the RbR model, no notion of faulty process
![Page 56: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/56.jpg)
SOFSEM 2006 André Schiper 56
Consensus algorithms
• A lot ….
• Most influential algorithms:– Consensus algorithm based on the failure detector S (Chandra-
Toueg, 1995)
– Paxos (Lamport, 1989-1998)
![Page 57: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/57.jpg)
SOFSEM 2006 André Schiper 57
Consensus algorithms (2)
The CT (Chandra-Toueg) and Paxos algorithms have strong similarities but also important differences:
• CT based on rotating coordinator / Paxos based on a dynamically chosen leader
• CT requires reliable links / Paxos tolerates lossy links
• The condition for liveness clearly defined in CT
• The condition for liveness less formally defined in Paxos (can be formally expressed in the RbR model)
![Page 58: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/58.jpg)
SOFSEM 2006 André Schiper 58
Paxos in the RbR modelInitialization
xp := vp
votep := V {?}, initially ?
voteToSendp:=false; decidedp:=false; tsp:=0
Round r = 4Φ – 3 :
Srp : send xp, tsp to leaderp(Φ)
Trp :
if p = leaderp(Φ) and #x, ts rcvd > n/2 then let θ be the largest θ from v, θ received
votep := one v such that v,θ received
voteToSendp := true
Round r = 4Φ – 2 :
Srp : if p = leaderp(Φ) and voteToSendp then
send votep to all processes
Trp :
if receivedv from leaderp(Φ) then
xp := v ; tsp := Φ
Round r = 4Φ – 1 :
Srp :
if tsp = Φ then
send ack to leaderp(Φ)
Trp :
if p = leaderp(Φ) and #ack rcvd > n/2 then
DECIDE (votep); decidedp := true
Round r = 4Φ
Srp :
if p = coordp(Φ) and decidedp then
send votep to all processes
Trp :
if received v and not decidedp then
DECIDE(v) ); decidedp := true
voteToSendp := false leader
![Page 59: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/59.jpg)
SOFSEM 2006 André Schiper 59
Paxos (2)
CONDITION FOR SAFETY CONDITION FOR LIVENESS
None
Φ0 > 0 :
p : | HO(p, Φ0) | > n/2
and
p, q: leaderp(Φ0) = leaderq(Φ0)
and
leaderp(Φ0) K(Φ0)
• Kernel of round r: K (r) = HO (p, r)
• Kernel of a phase Φ : K(Φ) = K (r)
p
r (phase = sequence of rounds)
![Page 60: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/60.jpg)
SOFSEM 2006 André Schiper 60
Atomic broadcast
• A lot of algorithms published …
• Easy to implement using a sequence of instance of consensus
• Consider atomic broadcast within a static group g:– Consensus within g on a set of messages
– Let consensus #k decide on the set Msg(k)
– Each process delivers the messages in Msg(k) before those in Msg(k+1)
– Each process delivers the messages in Msg(k) in a deterministic order (e.g., in the order of the message ids)
![Page 61: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/61.jpg)
SOFSEM 2006 André Schiper 61
Atomic broadcast (2)
Leads to the following algorithm:
• Each process p manages a counter k and a set undeliveredMessages
• Upon abcast(m) do broadcast(m)
• Upon reception of m do– add m to undeliveredMessages
– if no consensus algorithm is running then
Msg(k) consensus (undeliveredMessages)
deliver messages in Msg(k) in some deterministic order
![Page 62: SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.](https://reader035.fdocuments.in/reader035/viewer/2022062312/551b516d5503465c7e8b5a6a/html5/thumbnails/62.jpg)
SOFSEM 2006 André Schiper 62
Conclusion
• Necessarily superficial presentation of group communication
• Comments:– Static/crash-stop model has reached maturity
– Maturity not yet reached in the other models (static/crash-recovery, dynamic/crash-stop, …)
– With the RbR model we hope to bridge the gap between static/crash-stop and static/crash-recovery (ongoing implementation work)
– More work needed to quantitatively compare the various atomic broadcast algorithms (and other algorithms)