Overview - Southern Illinois University Carbondalerahimi/cs420/slides/cs420-part4.pdf · Overview...
Transcript of Overview - Southern Illinois University Carbondalerahimi/cs420/slides/cs420-part4.pdf · Overview...
Overview♦ Introduction♦ Fundamental Concepts of Distributed Systems
8 System models8 Review of network architectures8 Interprocess communication
♦ Time and Global States8 Clocks and concepts of time8 Synchronization8 Global states
♦ Coordination8 Distributed mutual exclusion8 Multicast8 Byzantine problems
♦ Distribution and Operating Systems8 Protection mechanisms8 Processes and threads
♦ Distributed File Systems8 Network file system (NFS)
Overview♦ Middleware
8 Distributed object models– Remote invocation– CORBA
8 Name services♦ Security
8 Cryptographic algorithms8 Digital signatures
♦ Distribution and Database Systems8 Distribution of databses8 Transactions and concurrency control8 Concurrency control in distributed transactions
♦ Distributed Shared Memory8 Sequential consistency
♦ Telecommunications Systems8 Distributed multimedia systems8 Intelligent networks8 Network management
Coordination♦ Coordination Problems in Distributed Systems
8 asynchronous distributed systems: no one process has a view of the currentglobal system state
8 need to coordinate the actions of the independent processes to achievecommon goals
– failure detection: how do I know in an asynchronous network whether mypeer is dead or alive
– mutual exclusion: no two process will ever get access to a sharedresource in a critical section at the same time
– election: in master-slave systems, how will the system elect a master(either at boot up time or when the master fails)
– muticast: sending to a group of receipientsireliability of multicast iorder preservation
– consensus in the presence of faults (byzantine problems):ihow to know whether acknowledgement was received over an
unreliable communication mediumihow to know whether peer process knows about one’s own
intentions in the presence of a non-confidential communicationchannel
Failure Detection♦ Failure Detector
8 service that posesses the capability to decide whether a particular processhas crashed or not
8 local failure detector in each object, collaborating with peers in otherprocesses to detect failure
– unreliable failure detector: distinguishes suspected and unsuspectedpeer processesiunsuspected: failure is unlikely (e.g., f.d. has recently received
communication from unsuspected peer)* may be inaccurate
isuspected: indication that peer process failed (e.g., no messagereceived in quite some time)
* may be inaccurate (e.g., peer process hasn’t failed, but thecommunication link is down, or peer process is much slowerthan expected)
– reliable failure detectoriunsuspected: potentially inaccurate as aboveifailed
* accurate determination that peer process has failed
Failure Detection♦ Failure Detector
8 implementation of unreliable failure detector– periodically, every T seconds each p send’s “I’m alive” message to every
other process– if local failure detector at q does not receive “I’m alive” from p within T+D
(D = est. max. transmission delay), then p is suspected– will revise verdict if message is subsequently received
8 problem: how to calibrate D– for small D, intermittent network performance downgrades will
lead to suspected non-crashed processes many times, or– for large D, crashes will remain unobserved (crashed nodes will be fixed
before timeout expires)8 solution approaches
– variable D, based on observed network latencies8 implementation of reliable failure detectors only possible in synchronous
networks
Mutual Exclusion♦ Mutual exclusion problems
8 prominent problem in multitasking operating systems– access to shared memory– access to shared resources– access to shared data– various algorithms to ensure mutual exclusion, e.g.
iDijkstra’s SemaphoresiMonitors
8 mutual exclusion in distributed systems– no shared memory– usually, no centralized instance like operating system kernel that would
coordinate access– based on synchronous or asynchronous design approatch
8 examples– consistent access to shared files (e.g., Network File Systems)– coordination of access to an access point in an IEEE 802.11 WaveLAN
Mutual Exclusion
♦ Requirements for Mutual Exclusion Algorithms8 ME1: at most one process may execute in the critical secion at any given
point in time (safety)8 ME2: requests to enter or exit the critical section will eventually succeed
(liveness)– impossible for one process to enter critical section more than once while
other processes are awaiting entry8 ME3: if one request to enter the critical section is issued before another
request (as per the → relation), then the requests will be served in the sameorder
enter()
access()
exit()
process 1
enter()
access()
exit()
process n
...
Mutual Exclusion♦ Performance criteria to be used in the assessment of mutual
exclusion algorithms8 bandwidth consumed (corresponds to number of messages sent)8 client delay at each entry and exit8 throughput: number of critical region accesses that the system allows
– here measured in terms of the synchronization delay between oneprocess exiting the critical section and the next process entering
Mutual Exclusion
♦ Central Server-based Algorithm8 central server receives access requests
– if no process in critical section, request will be granted– if process in critical section, request will be queued
8 process leaving critical section– grant access to next process in queue, or wait for new requests if queue
is empty♦ Properties
8 satisfies ME1 and ME2, but not ME3 (network delays may reorder requests)8 two messages per request, one per exit, exit does not delay the exiting process8 performance and availability of server are the bottlenecks
Server
1. Requesttoken
Queue ofrequests
2. Releasetoken
3. Granttoken
4
2
p4
p3p
2
p1
© Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Ring-based Algorithm8 logical, not necessarily physical link: every process pi has connection to
process p(i+1) mod N8 token passes in one direction through the ring8 token arrival
– only process in posession of token may access critical region– if no request upon arrival of token, or when exiting critical region, pass
token on to neighbour8 satisfies ME1 and ME2, but not ME38 performance
– constant bandwidth consumption– entry delay between 0 and N message transmission times– synchronization delay between 1 and N message transmission times
pn
p2
p3
p4
Token
p1
© Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Algorithm by Ricart and Agrawala8 based on multicast
– process requesting access multicasts request to all other processes– process may only enter critical section if all other processes return
positive acknowledgement messages8 assumptions
– all processes have communication channels to all other processes– all processes have distinct numeric ID and maintain logical clocks
On initializationstate := RELEASED;
To enter the sectionstate := WANTED;Multicast request to all processes; processing deferred hereT := request’s timestamp;Wait until (number of replies received = (N – 1));state := HELD;
On receipt of a request <Ti, pi> at pj (i ≤ j)if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))then
queue request from pi without replying; else
reply immediately to pi;end if
To exit the critical sectionstate := RELEASED;reply to any queued requests; © Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Algorithm by Ricart and Agrawala8 if request is broadcast and state of all other processes is RELEASED, then
all processes will reply immediately and requester will obtain entry8 if at least one process is in state HELD, that process will not reply until it has
left critical section, hence mutual exclusion8 if two or more processes request at the same time, whichever processes
request bears lower timestamp will be the first to get N-1 replies8 in case of equal timestamps, process with lower ID wins
On initializationstate := RELEASED;
To enter the sectionstate := WANTED;Multicast request to all processes; processing of incoming requestsT := request’s timestamp; deferred hereWait until (number of replies received = (N – 1));state := HELD;
On receipt of a request <Ti, pi> at pj (i ≤ j)if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))then
queue request from pi without replying; else
reply immediately to pi;end if
To exit the critical sectionstate := RELEASED;reply to any queued requests; © Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Algorithm by Ricart and Agrawala8 p3 not attempting to enter, p1 and p2 request entry simultaneously8 p3 replies immediately8 p2 receives request from p1, timestamp(p2) < timestamp(p1), therefore p2
does not reply8 p1 sees its timestamp to be larger than that of the request from p2, hence it
replies immediately and p2 is granted access8 p2 will reply to p1’s request after exiting the critical section
p3
34
Reply
3441
41 41
34
p1
p2
ReplyReply
© Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Algorithm by Ricart and Agrawala8 algorithms satisfies ME1
– two processes pi and pj can only access critical section at the same timein case they would have replied to each other
– since pairs <Ti, pi> are totally ordered, this cannot happen8 algorithms also satisfies ME2 and ME3
On initializationstate := RELEASED;
To enter the sectionstate := WANTED;Multicast request to all processes; processing of incoming requestsT := request’s timestamp; deferred hereWait until (number of replies received = (N – 1));state := HELD;
On receipt of a request <Ti, pi> at pj (i ≤ j)if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))then
queue request from pi without replying; else
reply immediately to pi;end if
To exit the critical sectionstate := RELEASED;reply to any queued requests; © Addison-Wesley Publishers 2000
Mutual Exclusion
♦ Algorithm by Ricart and Agrawala8 performance
– getting access requires 2(N-1) messages per request– synchronization delay: only one message transmission time,
client delay: just one round-trip time, previous algorithms up to N)8 protocol improvements
– repeated entry of same process without executing protocol– optimization possible to N messages per request (with hardware support for multicast)
On initializationstate := RELEASED;
To enter the sectionstate := WANTED;Multicast request to all processes; processing of incoming requestsT := request’s timestamp; deferred hereWait until (number of replies received = (N – 1));state := HELD;
On receipt of a request <Ti, pi> at pj (i ≤ j)if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))then
queue request from pi without replying; else
reply immediately to pi;end if
To exit the critical sectionstate := RELEASED;reply to any queued requests; © Addison-Wesley Publishers 2000
Mutual Exclusion♦ Maekawa’s Voting Algorithm
8 observation– to get access, not all processes have to agree– suffices to split set of processes up into subsets (“voting sets”) that
overlap– suffices that there is consensus within every subset
8 model– processes p1, .., pN– voting sets V1, .., VN chosen such that ∀ i,k and for some integer M:
pi ∈ ViVi ∩ Vk ≠ ∅ (some overlap in every voting set)| Vi | = K (fairness: all voting sets have equal size)each process pk, is contained in M voting sets
Mutual Exclusion♦ Maekawa’s Voting Algorithm
8 protocol– to obtain entry to critical section, pi sends request messages to all K-1
members of voting set Vi– cannot enter until K-1 replies received– when leaving critical section, send release to all members of Vi– when receiving request
iif state = HELD or already replied (voted) since last request* then queue request
ielse immediately send reply– when receiving release
iremove request at head of queue and send reply
Mutual Exclusion♦ Maekawa’s Voting Algorithm
On initializationstate := RELEASED; voted := FALSE;
For pi to enter the critical sectionstate := WANTED;Multicast request to all processes in Vi – {pi};Wait until (number of replies received = (K – 1));state := HELD;
On receipt of a request from pi at pj (i ≠ j)if (state = HELD or voted = TRUE)then
queue request from pi without replying; else
send reply to pi;voted := TRUE;
end ifFor pi to exit the critical section
state := RELEASED;Multicast release to all processes in Vi – {pi};
On receipt of a release from pi at pj (i ≠ j)if (queue of requests is non-empty)then
remove head of queue – from pk, say;send reply to pk;voted := TRUE;
elsevoted := FALSE;
end if © Addison-Wesley Publishers 2000
Mutual Exclusion♦ Maekawa’s Voting Algorithm
8 optimization goal: minimize K while achieving mutual exclusion– can be shown to be reached when K~√ N and M=K
8 optimal voting sets: nontrivial to calculate– approximation: derive Vi so that | Vi | ~ 2√N
iplace processes in a √N by √N matrixilet Vi the union of the row and column containing pi
8 satisfies ME1– if possible for two processes to enter critical section, then processes in
the non-empty intersection of their voting sets would have both grantedaccess
– impossible, since all processes make at most one vote after receivingrequest
8 deadlocks are possible– consider three processes with
iV1 = {p1, p2}, V2 = {p2, p3}, V3 = {p3, p1}– possible to construct cyclic wait graph
ip1 replies to p2, but queues request from p3ip2 replies to p3, but queues request from p1ip3 replies to p1, but queues request from p2
Mutual Exclusion♦ Maekawa’s Voting Algorithm
8 algorithm can be adapted to become deadlock-free– use of logical clocks– processes queue requests in happened-before order– means that ME3 is also satisfied
8 performance– bandwidth utilization
i2√N per entry, √N per exit, total 3√N is better than Ricart andAgrawala for N>4
– synchronization delayiround-trip time instead of single-message transmission time in Ricart
and Agrawala
Mutual Exclusion♦ Notes on Fault Tolerance
8 none of these algorithms tolerates message loss8 ring-algorithms canot tolerate single crash failure8 Maekawa’s algorithm can tolerate some crash failure
– if process is in a voting set not required, rest of the system not affected8 Central-Server: tolerates crash failure of node that has neither requested
access nor is currently in the critical section8 Ricart and Agrawala algorithm can be modified to tolerate crash failures by
the assumption that a failed process grants all requests immediately– requires reliable failure detector
Election Algorithms♦ Election
8 algorithm designed to designate one unique process out of a set ofprocesses with similar capabilities to take over certain functions in adistributes system
– central server for mutual exclusion– ring master in token ring networks– bus master
8 necessary when– system is booted– server fails– server retires
8 properties, to be valid during any particular run of the system– E1: a process pi has electedi = ⊥ (undefined) or electedi = P for some
non-crashed process P that will be chosen at the end of the run with thelargest identifier (safety)
– E2: all processes pi will eventually set electedi ≠ ⊥ (liveness)8 performance
– network bandwidth utilization (proportional to total number of messagessent)
– turnaround time: the number of serialized message transmission timesbetween initiation and termination of a single run
Election Algorithms♦ Ring-based Algorithm
8 assumptions– all nodes communicate on uni-directional ring structure– all processes have unique integer id– asynchronous, reliable system
8 initially, all processs marked “non-participant”8 to begin election, a process place election message with its identifier on ring
and marks itself “participant”8 upon receipt of election message, compare received identifier with its own
– if received id greater than your id, forward message to neighbour– if received id smaller than your id
iif your status is “non-participant”, then substitute your id in electionmessage and foward on ring
iotherwise, does not forward message (already “participant”)– if received id is identical to your id
ithis process’s id must be greatest and it becomes electedimark your status as “non-participant”isend out “elected” message
8 upon any forwarding, mark your state as “participant”8 when receiving “elected” message
– mark your status as “non-participant”– set electedi appropriately and forward elected message
Election Algorithms♦ Ring-based Algorithm
8 properties– E1 satisfied, since all identifiers are compared– E2 follows from reliable communication property
8 performance– at worst 2N-1 messages for electing the right-hand neighbour– another N elected messages
8 failures– tolerates no failures
Election Algorithms♦ The Bully-Algorithm
8 works for synchronous networks– nodes can crash, and crashes will be detected reliably
8 assumptions– each node knows identifiers of all other nodes– every node can communicate with every other node
8 message types– election: announce an election– answer: reply to an election message– coordinator: announce identity of elected process
Election Algorithms♦ The Bully-Algorithm
8 initiation of algorithm: reliable failure detection– a peer process failed if no answer to request within
iT = 2Ttrans + Tprocess8 process can decide whether to become coordinator by comparing own id with
all other ids (highest wins)– announce by sending coordinator message to all other nodes with lower
id8 process with lower id can bid to become coordinator by sending election
message to all processes with higher ID– if no response within T, considers itself elected coordinator, sends
coordinator message to all processes with lower id– otherwise, wait for another T’ time units for a coordinator message to
arrive from new coordinatoriif no response, then begin another election process
8 process receiving coordinator message sets variable electioni to the id of thecoordinator received in the message
8 if process receives election message, sends back an answer message andbegins another election - unless one was already initiated
8 new process replacing crashed process– if highest id, will immediately send coordinator message and “bully”
current coordinator to resign
Election Algorithms♦ The Bully Algorithm
8 example
p1 p2
p3
p4
p1
p2
p3
p4
Ccoordinator
Stage 4
C
election
electionStage 2
p1
p2
p3
p4
C
election
answer
answer
electionStage 1
timeout
Stage 3
Eventually.....
p1
p2
p3
p4
election
answer
The election of coordinator p2,after the failure of p4 and then p3
© Addison-Wesley Publishers 2000
Election Algorithms♦ The Bully Algorithm
8 properties– E1 satisfied (if no process replaced and timeout T estimate accurate)– E2 satisfied (synchronous network, reliable transmission)– E1 not satisfied if crashed process replaced at the same time while
another process has announced that it is the new coordinator8 performance
– best case: process with the second highest identifier detects coordinatorsfailureielects itself coordinator and sends N-2 coordinator messages
– requires O(N2) messages in worst case when least id detects failure firstiN-1 processes with higher IDs start election
Multicast♦ Multicast
8 group communication– sending and delivery of messages to more than one receipient– membership in receipient group transparent to sender
ione send operation to one address without having to send individualmessages to all receipients
8 issues– addressing– coordination
iguarantees that messages are received by a group of receipientsidelivery ordering among group members
8 uses of multicast– Computer Supported Collaborative Work (CSCW)
ishared white boradsivideo-conferencing
– communication with replicated servers (to achieve fault-tolerance)– event notification in networks
Multicast♦ IP-based Multicast
8 only implemented by some IP routers8 available for UDP transport service8 addressing: multicast address and port number8 IP multicast group
– class D IP address for which first 4 bits are 1110 in IPv4– membership is dynamic– computer belongs to multicast group if one or more processes have
sockets that belong to a multicast group8 implementation of multicast IP routers
– on local area networks, use LAN's multicast capabilities (e.g., Ethernet)iuse locally valid multicast address, set Time To Live (TTL) counter in
IP header to 1 so that packet will never get routed outside LAN– in the Internet, router forwards messages to all other routers that have
members in the multicast group, which in turn forward the datagrams togroup membersisession directory (sd)
* allowing users to advertise multicast sessions as well as their valid multicast addresses
8 no guarantees whatsoever– message loss, reordering, duplication, etc.
Multicast♦ Properties of multicast
8 achieves not only transparency, but also enables stronger guarantees than"delivery by hand"
– efficient use of network hardwareirouter sends individual messagesiuses tree-like distribution structure if avialabeiuse of LAN-based multicast capabilities, if available
– delivery guarantees♦ System model
8 messge m: contains ID of sender and of destination group– multicast(g, m): multicast message m to group g– deliver(m): delivery of a message at receipient
8 multicast group is– closed, if multicast only within– open, if processes not member of the group may send to it
Multicast♦ Basic multicast
8 guaranteed delivery, unless multicaster crashes8 primitives and implementation
– B-multicast(g, m): for each process p ∈ g, send(p, m)– B-deliver(m) at p: when receive(m) at p, for all p
8 problem in using concurrent send(p, m) operations– ack-implosion:
iall receipients acknowledge receipt at about same timeibuffer overflow leads to dropping of ack messagesiretransmits, even more ack messages
Multicast♦ Reliable multicast
8 primitives– R-multicast(m, g)– R-deliver(m)
8 desired properties– integrity: a correct process p delivers a message at most once, and the
delivered message is identical to the message sent in the multicast sendoperation (safety)
– validity: if a correct process multicasts message m, then it will eventuallydeliver m (liveness)
– agreement: if a correct process delivers a message m, then all othercorrect processes in the target group of message m will also delivermessage m
– (additionally) uniform agreement: if a process, no matter whether it iscorrect or fails, delivers a message m, then all correct processe in thegroup will deliver m as well
8 notes:– validity is expressed in terms of self-delivery, for simplicity reasons
ivalidity and agreement amount to overall liveness requirement: if oneprocess (the sender) delivers a message m, then m will eventuallybe delivered to all the group’s correct members
– agreement is similar to “atomicity”: all-or-nothing semantics
Multicast♦ Reliable multicast
8 Implementation– B-multicast to processes in group– R-deliver
8 properties– validity: a correct process will eventually B-deliver to itself– integrity: based on underlying communication medium– agreement: B-multicast to all other processes after B-deliver
8 inefficient, since each message is sent |g| times to each process
© Addison-Wesley Publishers 2000
Multicast♦ Reliable Multicast over IP Multicast
8 R-IP-multicast is based on observation that multicast is successful in most cases– use negative acknowledgement to indicate non-delivery
8 Basic idea– closed multicast groups– Sg
p: sequence number for group g that process p belongs to– Rg
p: sequence number of latest message that a process has deliveredfrom process p and that was sent to group g
– p R-multicasts message to group gipiggy back onto message
* Sgp
* acknowledgements <q, Rgq> for all q
iIP-multicast message and piggy back informationiincrement Sg
p by one
Multicast♦ Reliable Multicast over IP Multicast
8 Basic idea– R-deliver message from p
ionly if received sequence number S = Rgp+1
ithen increment Rgp by 1
iretain any message that cannot yet be delivered in hold-back-queue
Messageprocessing
Delivery queueHold-back
queue
deliver
Incomingmessages
When delivery guarantees aremet
© Addison-Wesley Publishers 2000
Multicast♦ Reliable Multicast over IP Multicast
8 Basic idea– R-deliver message from p
iif S ≤ Rgp, then message is already delivered, discard
iif S > Rgp or R > Rg
q for any enclosed acknowledgement <q, R>, thenreceiver has missed one or more messages, requests retransmitthrough negative acknowledgement
8 properties– integrity
ifollows from detection of duplicates and properties of IP multicast(e.g., checksum to detect message corruption)
– validity & agreement (validity holds because IP multicast has this property)imessage loss can only be detected when a successor message is
eventually transmittedirequires processes to multicast messages indefinitelyirequires unbounded history for broadcast messages so that
retransmit is always possible
Multicast♦ Ordered Multicast
8 assume: every process belongs to at most one group8 properties
– FIFO ordering: if a correct process issues a multicast(g, m) andthen multicast(g, m’), then every correct process that delivers m’will deliver m before m’
– causal ordering: if multicast(g, m) → multicast(g, m’), where→ is induced by message passing only, then every correct process thatdelivers m’ will deliver m before m’
– total ordering: if a correct process delivers m before it delivers m’, thenany other correct process that delivers m’ will deliver m before m’
8 notes– causal ordering implies FIFO ordering– FIFO ordering and causal ordering are partial orders– total order allows arbitrary ordering of deliver events relative to multicast
events, as long as this order is identical in all correct processes
Multicast♦ Ordered Multicast
8 implementing FIFO ordering– Sg
p: sequence number for group g that process p belongs to– Rg
p: sequence number of latest message that a process has deliveredfrom process p and that was sent to group g
– assumption: non-overlapping groups– FO-multicast(m, g)
iB-multicast(m, g, < Sgp >)
iincrement Sgp by 1
– upon receipt of a message from q with sequence number Siif S = Rg
q+1, then this is the next message,* therefore FO-deliver(m)* Rg
q := Siif S > Rg
q+1, then* place message on hold-back queue until intervening messages
have been delivered and S = Rgq+1
Multicast♦ Ordered Multicast
8 implementing total ordering– idea: assign totally ordered identifiers to multicast messages so that
every process makes the same delivery decision based on theseidentifiers
– delivery similar to FIFO delivery, only that group-specific sequencenumbers rather than process-specific sequence numbers are used
– assumption: non-overlapping groups– two main methods for the assignment of identifiers
isequencericollective agreement on the assignment of message identifiers
Multicast♦ Ordered Multicast
8 implementing total ordering– sequencer
iprocess wishing to TO-broadcast attaches a unique identifier id(m)to the message
imessage is sent to sequencer as well as all members of gisequencer maintains group-specific sequence number sg which it
uses to assign increasing and consecutive sequence numbers to themessages it B-delivers
iannounces the order in which members of g have to deliver thesemessages using a B-multicasted order message
© Addison-Wesley Publishers 2000
Multicast♦ Ordered Multicast
8 implementing total ordering– sequencer
© Addison-Wesley Publishers 2000
Multicast♦ Ordered Multicast
8 implementing total ordering– sequencer is bottleneck (performance and/or reliability)– collective agreement on the assignment of message identifiers
iimplemented in the ISIS toolkitigroups may be open or closedireceiving processes bounce proposed sequence numbers to senderisender returns agreed sequence numbersieach process q in group g maintains
* Agq: the largest agreed sequence number it has observed so far
for group g* Pg
q: its own largest proposed sequence number
Multicast♦ Ordered Multicast
8 implementing total ordering– algorithm for collective agreement on the assignment of message
identifiersip B-multicasts <m, i> to g, where i is unique identifier for mieach receipient q replies to g with proposal for agreed sequence
number* Pg
q:= max(Agq, Pg
q) + 1* each process q provisionally assigns its own proposed sequence
number to message and queues message in hold back queue,ordered according to proposed sequence number
ip chooses largest proposed number as sequence number, aip B-multicasts <i, a> to gieach process q in group
* sets Agq:= max(Ag
q, a)* reorders received message in hold-back queue if received
sequence number differs from proposed number* only when message at head of hold-back queue is assigned an
agreed sequence number, it will be queued in delivery queue
Multicast♦ Ordered Multicast
8 implementing causal ordering (after Birman et al.)– algorithm shown here ensures compliance with HB-relation only when
it is established by multicast messages, not by individual one-to-onecommunication
– each process maintains a vector clock counting the multicast events thathave happened before a local multicast event
– CO-multicast(m, g)iadd one to its own timestampiB-multicast message
– when pi B-delivers message from pkiplace it in hold-back queue until it is assured that all causally
preceding messages have been delivered:iconsider vector timestamp of received message
* wait until it has delivered any earlier message sent by pk, and* it has delivered any message that pk had delivered at the time it
multicast the current messageiupdate own vector timestamp in the k-th position
Multicast♦ Ordered Multicast
8 implementing causal ordering (after Birman et al.)
© Addison-Wesley Publishers 2000
Multicast♦ Ordered Multicast
8 note: combinations are possible– CO-multicast + TO-multicast (sequencer) yields total and causal
message deliveryiidea: all processes in the same order, i.e., in the sequencers order,
and this order is causal, we get total and causal order8 extensions to overlapping groups
– “naive” extension: implement orderings on all processes at hand, thosethat are not in a particular group will discard messages not adressed tothem
– inefficient solution, suggestions to more efficient solutions exist
Group Communication♦ Multicast communication to groups with dynamic membership
Join
Groupaddress
expansion
Multicastcommunication
Groupsend
Fail Group membershipmanagement
Leave
Process group
© Addison-Wesley Publishers 2000
Group Communication♦ Group membership service
8 interface for group membership changes8 failure detector8 notification of membership changes to group members8 group address expansion
♦ Group views8 lists of current group members8 process “suspected”
– exclusion from group view– if process not failed, or recovered, it needs to re-join group– false suspicion reduces effectiveness of group
Group Communication♦ View delivery
8 necessary to relieve programmer to query state of all other group membersbefore making a send decision
8 group management service delivers sequence of views to members, e.g.– v0(g) = {p}, v1(g) = {p, p’}, v2(g) = {p}, ...
8 system imposes an ordering on the possibly concurrent view changes8 receiving/delivering a view
– queue in hold-back queue as for multicast until all members agree todeliver the view
Consensus♦ Consensus problems
8 all correct computers controlling a spaceship should decide to proceed withlanding, or all of them should decide to abort (after each has proposed oneaction or the other)
8 in an electronic money transfer transaction, all involved processes mustconsistently agee on whether to perform the transaction (debit and credit), ornot
8 in mutual exclusion, processes need to agree on which process enters criticalsection
8 in election, processes need to agree on elected process8 in totally ordered multicast, processes need to agree on a consistent
message delivery order
Consensus♦ Recall process failure models
8 crash failures: processes stop (fail), but remain silent8 byzantine failures: processes fail, but may still respond to environment with
arbitrary, erratic behavior (e.g., send false acknowledgements, etc.)
© Addison-Wesley Publishers 2000
Consensus♦ Factors threatening consensus
8 failures– communication link or process failures– crash failures (fail-silent) or byzantine failures (arbitrary)
i(after Byzantine Empire 330-1453, in which unfaithfulness anduntruthfulness have allegedly been very common)
8 network characteristics– synchronous or asynchronous
8 failure detectors– reliable or unreliable
8 are messages authenticated (digitally signed) or not– can a process lie about the content of message that it received from a
correct process?– can adversary claim to send message under a false expedient’s id?
♦ Model8 processes communicating by message passing8 desireable: reaching consensus even in the presence of faults
– assumption: communication is reliable, but processes may fail
Consensus♦ The Consensus Problem (C)
8 agreement in the value of a decision variable among all correct processes– pi is in state undecided and proposes a single value vi– next, processes communicate with each other to exchange values– in doing so, pi sets decision variable di and enters the decided state after
which the value of di remains unchanged
1
P2
P3 (crashes)
P1
Consensus algorithm
v1=proceed
v3=abort
v2=proceed
d1:=proceed d2:=proceed
© Addison-Wesley Publishers 2000
Consensus♦ The Consensus Problem (C)
8 properties of a consensus algorithm– termination: eventually, each correct process sets its decision variable– agreement:
ifor all correct correct pi and pk such that state(pi) = state(pk)=decided
di = dk– integrity: if the correct processes all proposed the same value, then any
correct process has chosen that value in the decided stateivariation: ... then some correct process has chosen that value in the
decided state
Consensus♦ The Consensus Problem (C)
8 algorithm to solve consensus in a failure-free environment– each process reliably multicasts proposed values– after receiving response, solves consensus function
majority(v1,.., vN), [remark: other problem-specific functions possible]which returns most often proposed value, or undefined if no majorityexists
– propertiesitermination guaranteed by reliability of multicastiagreement, integrity: definition of majority, and integrity of reliable
multicast (all processes solve same function on same data)8 when crashes occur
– how to detect failure?– will algorithm terminate?
8 when byzantine failures occur– processes communicate random values– evaluation of consensus function may be inconsistent– malevolent processes may deliberately propose false or inconsistent
values
Consensus♦ The Byzantine Generals Problem (BG)
8 three or more generals are to agree on an attack or retreat8 commander issues order
– others (lieutenant to the commander) have to decide to attack or retreat8 one of the generals may be treacherous
– if commander is treacherous, it proposes attacking to one general andretreating to the other
– if lieutenants are treacherous, they tell one of their peers that commanderordered to attack, and others that commander ordered to retreat
8 difference to consensus problem: one process supplies a value that othershave to agree on
8 properties– termination: eventually each correct process sets it decision variable– agreement: the decision value of all correct processes is the same– integrity: if the commander is correct, then all processes decide on the
value that the commander proposesinote: implies agreement only if the commander is correct, but
commander need not be correct (see above)
Consensus♦ Interactive Consistency (IC)
8 each process suggests one value8 goal: all correct processes agree on a vector of values, each component
corresponding to one processes’ agreed value– example: agreement about each processes' local state
8 requirements– termination: eventually each correct process sets it decision variable– agreement: the decision vector of all correct processes is the same– integrity: if pi is correct, then all correct processes decide on vi as the i-th
component of their vector
Consensus♦ Relationship of Consensus to Other Problems
8 assume that the previous problems could be solved, yielding the followingdecision variables
– C(v1,.., vN) returns the decision value of pi– BGi(k, v) returns the decision value of pi where pk is the commander
which proposes value v– ICi(v1,.., vN)[k] returns the k-th value in the decision vector of pi where
v1,.., vN are the values that the processes propose8 possibilities to derive solutions from these problem solutions
– IC from BGirun BG N times, once with each pi acting as commander
ICi(v1,.., vN)[k] = BGi(k, vk)– C from IC
irun IC to produce a vector of values at each processiapply an appropriate function on the vector’s values to derive a
single valueCi(v1,.., vN) = majority(ICi(v1,.., vN)[1],.., ICi(v1,.., vN)[N])
– BG from Cicommander pk sends its proposed value v to itself and each of the
remaining processesiall processes run C with the values v1,.., vN that they receivei BGi(k, v) = Ci(v1,.., vN)
– termination, agreement and integrity preserved in each case
Consensus♦ Relationship of Consensus to Other Problems
8 solving consensus equivalent to solving reliable, totally ordered multicast– implementing consensus with RTO-multicast
icollect all processes in one groupieach pi performs RTO-multicast(g, vi)ieach pi chooses di = mi, where mi is the first value that the RTO-
multicast deliversiproperties
* termination follows from reliability of multicast* agreement and integrity follow from reliability and total ordering
– implementing RTO-multicast from consensus can be shown as well
Consensus♦ Consensus in Synchronous Networks
8 assumption: no more than f of the N processes crash8 algorithm proceeds in in f+1 rounds
– processes B-multicast values between them– at the end of f+1 rounds, all surviving processe are in a position to agree
© Addison-Wesley Publishers 2000
Consensus♦ Consensus in Synchronous Networks
8 Dolev-Strong algorithm– Valuesi
r: set of proposed values known to process i before round r– every process multicasts the set of values it has not sent in previous
rounds– then takes delivery of values from other processes– round is potentially terminated by timeout– at the end of f+1 rounds, each process choses minimum value
© Addison-Wesley Publishers 2000
Consensus♦ Consensus in Synchronous Networks
8 Dolev-Strong algorithm– termination: guaranteed through synchronicity property of system– correctness: will every process arrive at the same set of values at the end
of the final round?iif proven, integrety and agreement will follow, since processes
consistently apply the minimum function to this set
© Addison-Wesley Publishers 2000
Consensus♦ Consensus in Synchronous Networks
8 Dolev-Strong algorithm– correctness: will every process arrive at the same set of values at the end
of the final round?iif proven, integrity and agreement will follow, since processes
consistently apply the minimum function to this set– proof sketch
iassume two processes differ in their final set of valuesihence, some correct process i possesses a value v that another
correct process k (i ≠ k) does not possessithe only way to explain this is that some other process m, which sent
v to i, crashed before v could be delivered to kiin turn, any process sending v in the previous round must have
crashediwe have to assume at least one crash per roundihave f+1 rounds, at most f crashes, hence contradiction
8 It can be shown that in synchronous systems, any algorithm to reachconsensus, tolerating up to f crash or byzantine failures, requires at least f+1rounds
Consensus♦ Byzantine Generals Problem in Synchronous Network
8 allow arbitrary (byzantine) failures8 up to f faulty processes8 correct processes can detect the absence of a message through timeout, but
cannot conclude that sender has crashed, since it may be silent for sometime and then start sending messages again
8 assume private communication channels– fourth process cannot detect if one process sends messages with
different content to two peers– no faulty process can inject messages into channels connecting correct
processes8 assume that messages are not digitally signed (authenticated and verifyable)8 general result (Lamport, Shostak and Pease)
– no solution if N ≤ 3f– give an algorithm for N ≥ 3f+1
Consensus
♦ Byzantine Generals Problem in Synchronous Network8 impossibility for N = 3 processes
– read “3:1:u” as “three says one says u”– both scenarios show two rounds of messages– left: all p2 knows is that it has received two different values– right: same situation, even though now commander is faulty– assume a solution existed
ip2 would have to decide on value v, by integrity condition of BG– assume that no algorithm can decide locally for p2 between the two
scenariosithen p2 would need to decide on w (value sent by commander) in
right hand scenario– same reasoning for p3 w
iwill have to decide for commander’s value, which is a violation ofagreement in right hand scenario, hence contradiction
p1 (Commander)
p2 p3
1:v1:v
2:1:v
3:1:u
p1 (Commander)
p2 p3
1:x1:w
2:1:w
3:1:x
Faulty processes are shown shaded© Addison-Wesley Publishers 2000
Consensus♦ Byzantine Generals Problem in Synchronous Network
8 sketch of impossibility for N < 3f (Pease, Shostak and Lamport)– assume a solution existed for N ≤ 3– let each of three processes p1, p2 and p3 simulate n1, n2 and n3 generals,
where p1+ p2 + p3 = N and n1, n2, n3 ≤ N/3– assume that one of the processes is faulty– correct processes simulate correct generals
iinternal interaction of “own” generalsisend messages from “own” generals to those generals simulated by
other processes– faulty general’s processes are faulty and may emit spurious messages– since p1+ p2 + p3 = N and n1, n2, n3 ≤ N/3, at most f generals are faulty– since algorithms that is run on the generals is correct, simulation will
terminate– however, now there is a way for two processes out of three to reach
consensus: each process decides on the value chosen by all of theirsimulated generals
– contradicts impossibility for N = 3
Consensus♦ Byzantine Generals Problem in Synchronous Network
8 solution for N ≥ 3f+1– solution by Pease, Shostak and Lamport too complex to present here– therefore: presentation of solution for N = 4, f = 1– correct generals reach agreement in two rounds:
ifirst, commander sends value to each lieutenantisecond, each lieutenant sends value it received to all peers
– lieutenant receivesivalue from commanderiN-2 values from peers
– if commander faulty, then all lieutenants correct, each will have gatheredexactly the set of values that the commander sent out
– if one lieutenant faulty, each of its peers receives N-2 copies of the valuethe commander sent out, plus the faulty lieutenant value
– to reach agreement, simple majority function sufficesisince N ≥ 4, N-2 ≥ 2, majority function will ignore value of faulty
lieutenant, and produce value of commander if commander is correct(will produce ⊥ if commander incorrect)
– note: BG requires agreement only if commander correct
Consensus♦ Byzantine Generals Problem in Synchronous Network
© Addison-Wesley Publishers 2000
p1 (Commander)
p2 p3
1:v1:v
2:1:v3:1:u
p4
1:v
4:1:v2:1:v 3:1:w
4:1:v
p1 (Commander)
p2 p3
1:w1:u
2:1:u3:1:w
p4
1:v
4:1:v2:1:u 3:1:w
4:1:v
p1 (Commander)
p2 p3
1:w1:v
2:1:v3:1:w
p4
1:v
4:1:v2:1:v 3:1:w
4:1:v
{v,u,v}
{v,v,w}
{v,w,v}
{v,v,w}
{w,v,v} {u,v,w}
{u,v,w}
{u,v,w}
p2: majority({v,u,v}) = vp3: majority({v,v,w}) = v
p2: majority({v,w,v}) = vp3: majority({v,v,w}) = vp4: majority({w,v,v}) = v
p2, p3, p4:majority({v,u,w}) = ⊥
Consensus♦ Impossibility of Agreement in Asynchronous Systems
8 previous algorithms: synchrony assumption– message exchanges in rounds– timeouts
8 in asynchronous systems, no algorithms can guarantee reaching consensus,even with just one process crash failure (Fischer, Lynch and Paterson, 1985)
– proof ideai show that there is always some continuation of the process’s
execution that avoids consensus being reached8 consequences
– in asynchronous systems, no solution to BG, IC, TOR-multicast8 of course, in practice consensus can often be reached, but a residual
probability that consensus cannot be reached remains8 possible approaches to reaching consensus by weakening system
assumptions– partial synchrony– masking faults– modified failure detectors– randomized algorithms
Consensus♦ Impossibility of Agreement in Asynchronous Systems
8 partial synchrony– message delays are bounded, but bound unknown– known bound, but longer transmission delays for some, finite, initial
period of time8 masking faults
– design system so that failures appear like intermittent slowdown inprocessing of messagesistore system state on persistent storage before crashirestart system in that state after recovery
8 modified failure detectors– in ISIS system
ideem process that has not responded as faileditreat this process as fail-safe, i.e., discard any subsequent
messages from this processiproblems:
* long timeouts necessary* false negatives possible that reduce effectiveness of system
Consensus♦ Impossibility of Agreement in Asynchronous Systems
8 modified failure detectors– in ISIS system (Birman, 1993)
ideem process that has not responded as faileditreat this process as fail-safe, i.e., discard any subsequent
messages from this processiproblems:
* long timeouts necessary* false negatives possible that reduce effectiveness of system
– eventually weak failure detector (Chandra and Toueg, 1996)iconsensus can be solved, even with a weak failure detector, if fewer
than N/2 processes crash and communication is reliableieventually weak failure detector
* eventually weakly complete: each faulty process is eventuallysuspected permanently
* eventually weakly accurate: after some time, at least one correctprocess is never suspected by any correct process
ieventually weak failure detector cannot be implemented inasynchronous system based on message passing, however, failuredetectors adapting timeout values can come close to “ewfd”s