Distributed Systems: Shared Data
description
Transcript of Distributed Systems: Shared Data
November 2005 Distributed systems: shared data 1
Distributed Systems:
Shared Data
November 2005 Distributed systems: shared data 2
Overview of chapters
• Introduction• Co-ordination models and languages• General services• Distributed algorithms• Shared data
– Ch 13 Transactions and concurrency control, 13.1-13.4
– Ch 14 Distributed transactions
– Ch 15 Replication
November 2005 Distributed systems: shared data 3
Overview• Transactions and locks
• Distributed transactions
• Replication
November 2005 Distributed systems: shared data 4
Overview• Transactions
• Nested transactions
• Locks
• Distributed transactions
• Replication
Known material!
November 2005 Distributed systems: shared data 5
Transactions: Introduction
• Environment– data partitioned over different servers on
different systems– sequence of operations as individual unit– long-lived data at servers (cfr. Databases)
• transactions = approach to achieve consistency of data in a distributed environment
November 2005 Distributed systems: shared data 6
Transactions: Introduction• Example
Person 1:
Withdraw ( A, 100);
Deposit (B, 100);
Person 2:
Withdraw ( C, 200);
Deposit (B, 200);
A
C
B
November 2005 Distributed systems: shared data 7
Transactions: Introduction• Critical section
– group of instructions indivisible block wrt other cs
– short duration
• atomic operation (within a server)– operation is free of interference from operations being
performed on behalf of other (concurrent) clients
– concurrency in server multiple threads
– atomic operation <> critical section
• transaction
November 2005 Distributed systems: shared data 8
Transactions: Introduction• Critical section• atomic operation• transaction
– group of different operations + properties
– single transaction may contain operations on different servers
– possibly long duration
ACID properties
November 2005 Distributed systems: shared data 9
Transactions: ACID• Properties concerning the sequence of operations
that read or modify shared data:
A tomicity
C onsistency
I solation
D urability
November 2005 Distributed systems: shared data 10
Transactions: ACID• Atomicity or the “all-or-nothing” property
– a transaction
• commits = completes successfully or
• aborts = has no effect at all
– the effect of a committed transaction
• is guaranteed to persist
• can be made visible to other transactions
– transaction aborts can be initiated by
• the system (e.g. when a node fails) or
• a user issuing an abort command
November 2005 Distributed systems: shared data 11
Transactions: ACID• Consistency
– a transaction moves data from one consistent state to another
• Isolation– no interference from other transactions
– intermediate effects invisible to other transactions
The isolation property has 2 parts:
– serializability: running concurrent transactions has the same effect as some serial ordering of the transactions
– Failure isolation: a transaction cannot see the uncommitted effects of another transaction
November 2005 Distributed systems: shared data 12
Transactions: ACID
• Durability– once a transaction commits, the effects of the
transaction are preserved despite subsequent failures
November 2005 Distributed systems: shared data 13
Transactions: Life histories• Transactional service operations
– OpenTransaction() Trans• starts new transaction
• returns unique identifier for transaction
– CloseTransaction(Trans) (Commit, Abort)• ends transaction
• returns commit if transaction committed else abort
– AbortTransaction(Trans)• aborts transaction
November 2005 Distributed systems: shared data 14
Transactions: Life histories• History 1: success
T := OpenTransaction();
operation;
operation;
….
operation;
CloseTransaction(T);
Operations have
read
or
write
semantics
November 2005 Distributed systems: shared data 15
Transactions: Life histories• History 2: abort by client
T := OpenTransaction();
operation;
operation;
….
operation;
AbortTransaction(T);
November 2005 Distributed systems: shared data 16
Transactions: Life histories• History 3: abort by server
T := OpenTransaction();
operation;
operation;
….
operation;
Server aborts!
Error reported
November 2005 Distributed systems: shared data 17
Transactions: Concurrency• Illustration of well known problems:
– the lost update problem
– inconsistent retrievals
• operations used + implementations
– Withdraw(A, n)
– Deposit(A, n)
b := A.read();
A.write( b - n);
b := A.read();
A.write( b + n);
November 2005 Distributed systems: shared data 18
Transactions: Concurrency• The lost update problem:
Transaction T
Withdraw(A,4);
Deposit(B,4);
Transaction U
Withdraw(C,3);
Deposit(B,3);
Interleaved execution of operations on B ?
November 2005 Distributed systems: shared data 19
Transactions: Concurrency• The lost update problem:
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4);A: 100
B: 200
C: 300
November 2005 Distributed systems: shared data 20
Transactions: Concurrency• The lost update problem:
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 200
C: 300
November 2005 Distributed systems: shared data 21
Transactions: Concurrency• The lost update problem:
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 200
C: 297
bt := B.read();
bt=200 bu := B.read();
B.write(bu+3);
November 2005 Distributed systems: shared data 22
Transactions: Concurrency• The lost update problem:
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 203
C: 297
bt := B.read();
bt=200 bu := B.read();
B.write(bu+3);B.write(bt+4);
November 2005 Distributed systems: shared data 23
Transactions: Concurrency• The lost update problem:
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 204
C: 297
bt := B.read();
bt=200 bu := B.read();
B.write(bu+3);B.write(bt+4);
Correct B = 207!!
November 2005 Distributed systems: shared data 24
Transactions: Concurrency• The inconsistent retrieval problem:
Transaction T
Withdraw(A,50);
Deposit(B,50);
Transaction U
BranchTotal();
November 2005 Distributed systems: shared data 25
Transactions: Concurrency• The inconsistent retrieval problem :
Transaction T A B: 50 Transaction U BranchTotal
bt := A.read();
A.write(bt-50);A: 100
B: 200
C: 300
November 2005 Distributed systems: shared data 26
Transactions: Concurrency• The inconsistent retrieval problem :
Transaction T A B: 50 Transaction U BranchTotal
bt := A.read();
A.write(bt-50);A: 50
B: 200
C: 300
bu := A.read();
bu := bu + B. read();
bu := bu + C.read();
550bt := B.read();
B.write(bt+50);
November 2005 Distributed systems: shared data 27
Transactions: Concurrency• The inconsistent retrieval problem:
Transaction T A B: 50 Transaction U BranchTotal
bt := A.read();
A.write(bt-50);A: 50
B: 250
C: 300
bu := A.read();
bu := bu + B. read();
bu := bu + C.read();
550bt := B.read();
B.write(bt+50);Correct total: 600
November 2005 Distributed systems: shared data 28
Transactions: Concurrency• Illustration of well known problems:
– the lost update problem
– inconsistent retrievals
• elements of solution– execute all transactions serially?
• No concurrency unacceptable
– execute transactions in such a way that overall execution is equivalent with some serial execution
• sufficient? Yes
• how? Concurrency control
November 2005 Distributed systems: shared data 29
Transactions: Concurrency• The lost update problem: serially equivalent interleaving
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4);A: 100
B: 200
C: 300
November 2005 Distributed systems: shared data 30
Transactions: Concurrency• The lost update problem: serially equivalent interleaving
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 200
C: 300
November 2005 Distributed systems: shared data 31
Transactions: Concurrency• The lost update problem: serially equivalent interleaving
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 200
C: 297
bt := B.read();
B.write(bt+4);
November 2005 Distributed systems: shared data 32
Transactions: Concurrency• The lost update problem: serially equivalent interleaving
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 204
C: 297
bt := B.read();
B.write(bt+4); bu := B.read();
B.write(bu+3);
November 2005 Distributed systems: shared data 33
Transactions: Concurrency• The lost update problem: serially equivalent interleaving
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
A.write(bt-4); bu := C.read();
C.write(bu-3);
A: 96
B: 207
C: 297
bt := B.read();
B.write(bt+4); bu := B.read();
B.write(bu+3);
November 2005 Distributed systems: shared data 34
Transactions: Recovery• Illustration of well known problems:
– a dirty read
– premature write
• operations used + implementations
– Withdraw(A, n)
– Deposit(A, n)
b := A.read();
A.write( b - n);
b := A.read();
A.write( b + n);
November 2005 Distributed systems: shared data 35
Transactions: Recovery• A dirty read problem:
Transaction T
Deposit(A,4);
Transaction U
Deposit(A,3);
Interleaved execution and abort ?
November 2005 Distributed systems: shared data 36
Transactions: Recovery• A dirty read problem:
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4);A: 100
November 2005 Distributed systems: shared data 37
Transactions: Recovery• A dirty read problem:
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4); bu := A.read();
A.write(bu+3);
A: 104
November 2005 Distributed systems: shared data 38
Transactions: Recovery• A dirty read problem:
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4); bu := A.read();
A.write(bu+3);
A: 107
Commit
Abort
Correct result: A = 103
November 2005 Distributed systems: shared data 39
Transactions: Recovery• Premature write or
Over-writing uncommitted values :
Transaction T
Deposit(A,4);
Transaction U
Deposit(A,3);
Interleaved execution and Abort ?
November 2005 Distributed systems: shared data 40
Transactions: Recovery• Over-writing uncommitted values :
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4);A: 100
November 2005 Distributed systems: shared data 41
Transactions: Recovery• Over-writing uncommitted values :
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4); bu := A.read();
A.write(bu+3);
A: 104
November 2005 Distributed systems: shared data 42
Transactions: Recovery• Over-writing uncommitted values :
Transaction T 4 A Transaction U 3 A
bt := A.read();
A.write(bt+4); bu := A.read();
A.write(bu+3);
A: 107
Abort
Correct result: A = 104
November 2005 Distributed systems: shared data 43
Transactions: Recovery• Illustration of well known problems:
– a dirty read
– premature write
• elements of solution:– Cascading Aborts: a transaction reading uncommitted
data must be aborted if the transaction that modified the data aborts
– to avoid cascading aborts, transactions can only read data written by committed transactions
– undo of write operations must be possible
November 2005 Distributed systems: shared data 44
Transactions: Recovery
• how to preserve data despite subsequent failures?– usually by using stable storage
• two copies of data stored – in separate parts of disks
– not decay related (probability of both parts corrupted is small)
November 2005 Distributed systems: shared data 45
Nested Transactions• Transactions composed of several
sub-transactions
• Why nesting?– Modular approach to structuring transactions in
applications
– means of controlling concurrency within a transaction• concurrent sub-transactions accessing shared data are
serialized
– a finer grained recovery from failures• sub-transactions fail independent
November 2005 Distributed systems: shared data 46
Nested Transactions
• Sub-transactions commit or abort independently– without effect on outcome of other sub-transactions or
enclosing transactions
• effect of sub-transaction becomes durable only when top-level transaction commits
T = Transfer
T1 = Deposit T2 = Withdraw
November 2005 Distributed systems: shared data 47
Concurrency control: locking• Environment
– shared data in a single server (this section)– many competing clients
• problem:– realize transactions– maximize concurrency
• solution: serial equivalence
• difference with mutual exclusion?
November 2005 Distributed systems: shared data 48
Concurrency control: locking
• Protocols:– Locks– Optimistic Concurrency Control– Timestamp Ordering
November 2005 Distributed systems: shared data 49
Concurrency control: locking• Example:
– access to shared data within a transaction lock (= data reserved for …)
– exclusive locks• exclude access by other transactions
November 2005 Distributed systems: shared data 50
Concurrency control: locking• Same example (lost update) with locking
Transaction T
Withdraw(A,4);
Deposit(B,4);
Transaction U
Withdraw(C,3);
Deposit(B,3);
Colour of data show owner of lock
November 2005 Distributed systems: shared data 51
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Exclusive locks
A: 100
November 2005 Distributed systems: shared data 52
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Exclusive locks
A: 100A.write(bt-4);
November 2005 Distributed systems: shared data 53
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
November 2005 Distributed systems: shared data 54
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
November 2005 Distributed systems: shared data 55
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bu := B.read();
November 2005 Distributed systems: shared data 56
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
Wait for T
B.write(bt+4);
November 2005 Distributed systems: shared data 57
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 204
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
Wait for T
B.write(bt+4);
CloseTransaction(T);
November 2005 Distributed systems: shared data 58
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 204
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
Wait for T
B.write(bt+4);
CloseTransaction(T);
November 2005 Distributed systems: shared data 59
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 204
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
B.write(bt+4);
CloseTransaction(T);B.write(bu+3);
November 2005 Distributed systems: shared data 60
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 207
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
B.write(bt+4);
CloseTransaction(T);B.write(bu+3); CloseTransaction(U);
November 2005 Distributed systems: shared data 61
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 207
C: 297
• Exclusive locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
B.write(bt+4);
CloseTransaction(T);B.write(bu+3); CloseTransaction(U);
November 2005 Distributed systems: shared data 62
Concurrency control: locking• Basic elements of protocol
1 serial equivalence• requirements
– all of a transaction’s accesses to a particular data item should be serialized with respect to accesses by other transactions
– all pairs of conflicting operations of 2 transactions should be executed in the same order
• how?– A transaction is not allowed any new locks after it has
released a lock
Two-phase locking
November 2005 Distributed systems: shared data 63
Concurrency control: locking• Two-phase locking
– Growing Phase• new locks can be acquired
– Shrinking Phase• no new locks
• locks are released
November 2005 Distributed systems: shared data 64
Concurrency control: locking• Basic elements of protocol
1 serial equivalence two-phase locking 2 hide intermediate results
• conflict between– release of lock access by other transactions possible
– access should be delayed till commit/abort transaction
• how?– New mechanism?
– (better) release of locks only at commit/abort
strict two-phase locking– locks held till end of transaction
November 2005 Distributed systems: shared data 65
Concurrency control: locking• How increase concurrency and preserve
serial equivalence?– Granularity of locks
– Appropriate locking rules
November 2005 Distributed systems: shared data 66
Concurrency control: locking• Granularity of locks
– observations• large number of data items on server• typical transaction needs only a few items• conflicts unlikely
– large granularitylimits concurrent access• example: all accounts in a branch of bank are
locked together
– small granularity overhead
November 2005 Distributed systems: shared data 67
Concurrency control: locking• Appropriate locking rules
– when conflicts?
Read & Write locks
operation by T operation by U conflict
read read No
read write Yes
write write Yes
November 2005 Distributed systems: shared data 68
Concurrency control: locking• Lock compatibility
Lock requested
Read Write
Lock None OK OK
already Read OK Wait
set Write Wait Wait
For one data item
November 2005 Distributed systems: shared data 69
Concurrency control: locking• Strict two-phase locking
– locking• done by server (containing data item)
– unlocking• done by commit/abort of the transactional service
November 2005 Distributed systems: shared data 70
Concurrency control: locking• Use of locks on strict two-phase locking
– when an operation accesses a data item• not locked yet lock set & operation proceeds
• conflicting lock set by another transactiontransaction must wait till ...
• non-conflicting lock set by another transactionlock shared & operation proceeds
• locked by same transactionlock promoted if necessary & operation proceeds
November 2005 Distributed systems: shared data 71
Concurrency control: locking• Use of locks on strict two-phase locking
– when an operation accesses a data item
– when a transaction is committed/aborted server unlocks all data items locked for the
transaction
November 2005 Distributed systems: shared data 72
Concurrency control: locking• Lock implementation
– lock manager– managing table of locks:
• transaction identifiers
• identifier of (locked) data item
• lock type
• condition variable– for waiting transactions
November 2005 Distributed systems: shared data 73
Concurrency control: locking• Deadlocks
– a state in which each member of a group of transactions is waiting for some other member to release a lock
no progress possible!
– Example: with read/write locks
November 2005 Distributed systems: shared data 74
Concurrency control: locking• Same example (lost update) with locking
Transaction T
Withdraw(A,4);
Deposit(B,4);
Transaction U
Withdraw(C,3);
Deposit(B,3);
Colour of data show owner of lock
November 2005 Distributed systems: shared data 75
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Read/write locks
A: 100
November 2005 Distributed systems: shared data 76
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Read/write locks
A: 100A.write(bt-4);
November 2005 Distributed systems: shared data 77
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
November 2005 Distributed systems: shared data 78
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 300
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
November 2005 Distributed systems: shared data 79
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 297
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bu := B.read();
November 2005 Distributed systems: shared data 80
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 297
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
November 2005 Distributed systems: shared data 81
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 204
C: 297
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
B.write(bt+4);
Wait for release by U
B.write(bu+3);
Wait for release by T
Deadlock!!
November 2005 Distributed systems: shared data 82
Concurrency control: locking• Solutions to the Deadlock problem
– Prevention• by locking all data items used by a transaction
when it starts
• by requesting locks on data items in a predefined order
Evaluation• impossible for interactive transactions
• reduction of concurrency
November 2005 Distributed systems: shared data 83
Concurrency control: locking• Solutions to the Deadlock problem
– Detection• the server keeps track of a wait-for graph
– lock: edge is added
– unlock: edge is removed
• the presence of cycles may be checked– when an edge is added
– periodically
– example
November 2005 Distributed systems: shared data 84
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 200
C: 297
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
November 2005 Distributed systems: shared data 85
Concurrency control: locking• Wait-for graph
A
B
C
T U
Held by
November 2005 Distributed systems: shared data 86
Concurrency control: locking
Transaction T A B: 4 Transaction U C B: 3
bt := A.read();
B: 204
C: 297
• Read/write locks
A: 96A.write(bt-4);
bu := C.read();
C.write(bu-3);
bt := B.read();bu := B.read();
B.write(bt+4);
Wait for release by U
B.write(bu+3);
Wait for release by T
November 2005 Distributed systems: shared data 87
Concurrency control: locking• Wait-for graph
A
B
C
T U
Held by
November 2005 Distributed systems: shared data 88
Concurrency control: locking• Wait-for graph
A
B
C
T U
Held by
November 2005 Distributed systems: shared data 89
Concurrency control: locking• Wait-for graph
B
T U
Cycle deadlock
November 2005 Distributed systems: shared data 90
Concurrency control: locking• Solutions to the Deadlock problem
– Detection• the server keeps track of a wait-for graph
• the presence of cycles must be checked
• once a deadlock detected, the server must select a transaction and abort it (to break the cycle)
• choice of transaction? Important factors– age of transaction
– number of cycles the transaction is involved in
November 2005 Distributed systems: shared data 91
Concurrency control: locking• Solutions to the Deadlock problem
– Timeouts• locks granted for a limited period of time
– within period: lock invulnerable
– after period: lock vulnerable
November 2005 Distributed systems: shared data 92
Overview• Transactions
• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery
• Replication
November 2005 Distributed systems: shared data 93
Distributed transactions
• Definition
Any transaction whose activities involve
multiple servers
• Examples
– simple: client accesses several servers
– nested: server accesses several other servers
November 2005 Distributed systems: shared data 94
Distributed transactions
• Examples: simple
client
Y
X
Z
Serial execution of requests on different server
November 2005 Distributed systems: shared data 95
Distributed transactions
• Examples: nesting
Serial or parallel execution of requests on different servers
client
Y
X
Z
P
N
M
TT1
T2
T11
T12
T22
T21
November 2005 Distributed systems: shared data 96
Distributed transactions
• Examples:
Client
X
Y
Z
X
Y
M
NT1
T2
T11
Client
P
TT
12
T21
T22
T
T
November 2005 Distributed systems: shared data 97
Distributed transactions• Commit: agreement between all servers involved
• to commit• to abort
• take one server as coordinatorsimple (?) protocol
– single point of failure?
• tasks of the coordinator– keep track of other servers, called workers
– responsible for final decision
November 2005 Distributed systems: shared data 98
Distributed transactions• New service operations:
– AddServer( TransID, CoordinatorID)• called by clients
• first operation on server that has not joined the transaction yet
– NewServer( TransID, WorkerID)• called by new server on the coordinator
• coordinator records ServerID of the worker in its workers list
November 2005 Distributed systems: shared data 99
Distributed transactions
• Examples: simple
client
Y
X
Z
T := OpenTransaction();
X$Withdraw(A,4);
Z$Deposit(C,4);
Y$Withdraw(B,3);
Z$Deposit(D,3);
CloseTransaction(T);
1. T := X$OpenTransaction();
2. X$Withdraw(A,4);
coordinator
A
B
C,D
November 2005 Distributed systems: shared data 100
Distributed transactions
• Examples: simple
client
Y
X
Z
T := OpenTransaction();
X$Withdraw(A,4);
Z$Deposit(C,4);
Y$Withdraw(B,3);
Z$Deposit(D,3);
CloseTransaction(T);
coordinator
A
B
C,D
3. Z$AddServer(T, X)
4. X$NewServer(T, Z);
5. Z$Deposit(C,4);
worker
November 2005 Distributed systems: shared data 101
Distributed transactions
• Examples: simple
client
Y
X
Z
T := OpenTransaction();
X$Withdraw(A,4);
Z$Deposit(C,4);
Y$Withdraw(B,3);
Z$Deposit(D,3);
CloseTransaction(T);
coordinator
A
B
C,D
6. Y$AddServer(T, X)
7. X$NewServer(T, Y);
8. Y$Withdraw(B,3);
worker
worker
November 2005 Distributed systems: shared data 102
Distributed transactions
• Examples: simple
client
Y
X
Z
T := OpenTransaction();
X$Withdraw(A,4);
Z$Deposit(C,4);
Y$Withdraw(B,3);
Z$Deposit(D,3);
CloseTransaction(T);
coordinator
A
B
C,D
9. Z$Deposit(D, 3);
worker
worker
November 2005 Distributed systems: shared data 103
Distributed transactions
• Examples: simple
client
Y
X
Z
T := OpenTransaction();
X$Withdraw(A,4);
Z$Deposit(C,4);
Y$Withdraw(B,3);
Z$Deposit(D,3);
CloseTransaction(T);
coordinator
A
B
C,D
10. X$CloseTransaction(T);
worker
worker
November 2005 Distributed systems: shared data 104
Distributed transactions
• Examples: data at servers
client
Y
X
Z
coordinator
A
B
C,D
worker
worker
Server Trans Role Coord. Workers
X T Coord (here) Y, Z
Y T Worker X
Z T Worker X
November 2005 Distributed systems: shared data 105
Overview• Transactions
• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery
• Replication
November 2005 Distributed systems: shared data 106
Atomic Commit protocol
• Elements of the protocol– each server is allowed to abort its part of a transaction
– if a server votes to commit it must ensure that it will
eventually be able to carry out this commitment
• the transaction must be in the prepared state
• all altered data items must be on permanent storage
– if any server votes to abort, then the decision must be
to abort the transaction
November 2005 Distributed systems: shared data 107
Atomic Commit protocol• Elements of the protocol (cont.)
– the protocol must work correctly, even when
• some servers fail
• messages are lost
• servers are temporarily unable to communicate
November 2005 Distributed systems: shared data 108
Atomic Commit protocol• Protocol:
– Phase 1: voting phase
– Phase 2: completion according to outcome of vote
November 2005 Distributed systems: shared data 109
Atomic Commit protocol• Protocol
Coordinator
Step Status1
prepared to commit
Worker
Step Status
2
prepared to commit
3 (counting votes)
committed
CanCommit?
Yes
4 committed
done
DoCommit
HaveCommitted
November 2005 Distributed systems: shared data 110
Atomic Commit protocol• Protocol: Phase 1 voting phase
– Coordinator: for operation CloseTransaction• sends CanCommit to each worker• behaves as worker in phase 1• waits for replies from workers
– Worker: when receiving CanCommit• if for worker transaction can commit
– saves data items– sends Yes to coordinator
• if for worker transaction cannot commit– sends No to coordinator– clears data structures, removes locks
November 2005 Distributed systems: shared data 111
Atomic Commit protocol• Protocol: Phase 2
– Coordinator: collecting votes• all votes Yes:
commit transaction; send DoCommit to workers
• one vote No:abort transaction
– Worker: voted yes, waits for decision of coordinator• receives DoCommit
– makes committed data available; removes locks
• receives AbortTransaction– clears data structures; removes locks
Point of decision!!
November 2005 Distributed systems: shared data 112
Atomic Commit protocol• Timeouts:
– worker did all/some operations and waits for CanCommit
• unilateral abort possible– coordinator waits for votes of workers
• unilateral abort possible– worker voted Yes and waits for final decision
of coordinator• wait unavoidable• extensive delay possible• additional operation GetDecision can be used to get
decision from coordinator or other workers
November 2005 Distributed systems: shared data 113
Atomic Commit protocol• Performance:
– C W: CanCommit N-1 messages– W C: Yes/No N-1 messages– C W: DoCommit N-1 messages– W C: HaveCommitted N-1 messages
+ (unavoidable) delays possible
November 2005 Distributed systems: shared data 114
Atomic Commit protocol• Nested Transactions
– top level transaction & subtransactions
transaction tree
November 2005 Distributed systems: shared data 115
Atomic Commit protocol
client
Y
X
Z
P
N
M
TT1
T2
T11
T12
T22
T21
November 2005 Distributed systems: shared data 116
Atomic Commit protocol• Nested Transactions
– top level transaction & subtransactions
transaction tree
– coordinator = top level transaction
– subtransaction identifiers• globally unique
• allow derivation of ancestor transactions(why necessary?)
November 2005 Distributed systems: shared data 117
Atomic Commit protocol• Nested Transactions: Transaction IDs
TID in example actual TID
T Z, nz
T1 Z, nz ; X, nx
T11 Z, nz ; X, nx ; M, nm
T2 Z, nz ; Y, ny
November 2005 Distributed systems: shared data 118
Atomic Commit protocol• Upon completion of a subtransaction
– independent decision to commit or abort– commit of subtransaction
• only provisionally• status (including status of descendants) reported to
parent• final outcome dependant on its ancestors
– abort of subtransaction• implies abort of all its descendants• abort reported to its parent (always possible?)
November 2005 Distributed systems: shared data 119
Atomic Commit protocol• Data structures
– commit list: list of all committed (sub)transactions
– aborts list: list of all aborted (sub)transactions
– example
November 2005 Distributed systems: shared data 120
Atomic Commit protocol• Data structures: example
1
2
T11
T12
T22
T21
abort (at M)
provisional commit (at N)
provisional commit (at X)
aborted (at Y)
provisional commit (at N)
provisional commit (at P)
T
T
T
November 2005 Distributed systems: shared data 121
Atomic Commit protocol• Data structures: example
T22
P
T1
T2
T12
T21
X
Y
T11
M
N
T
Z
November 2005 Distributed systems: shared data 122
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T
X
Y
M
N
P
November 2005 Distributed systems: shared data 123
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
November 2005 Distributed systems: shared data 124
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1
Y
M
N
P
November 2005 Distributed systems: shared data 125
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
November 2005 Distributed systems: shared data 126
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1 T11
Y
M T11
N
P
November 2005 Distributed systems: shared data 127
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
November 2005 Distributed systems: shared data 128
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1 T11 T11
Y
M T11 T11
N
P
November 2005 Distributed systems: shared data 129
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
November 2005 Distributed systems: shared data 130
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1 T11 ,T12 T11
Y
M T11 T11
N T12
P
November 2005 Distributed systems: shared data 131
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commit
November 2005 Distributed systems: shared data 132
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1 T11 ,T12 T12 T11
Y
M T11 T11
N T12 T12
P
November 2005 Distributed systems: shared data 133
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
November 2005 Distributed systems: shared data 134
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1
X T1 T11 ,T12 T12 , T1 T11
Y
M T11 T11
N T12 T12
P
November 2005 Distributed systems: shared data 135
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y
M T11 T11
N T12 T12
P
November 2005 Distributed systems: shared data 136
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
November 2005 Distributed systems: shared data 137
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2
M T11 T11
N T12 T12
P
November 2005 Distributed systems: shared data 138
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
November 2005 Distributed systems: shared data 139
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21
M T11 T11
N T12 ,T21 T12
P
November 2005 Distributed systems: shared data 140
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
commit
November 2005 Distributed systems: shared data 141
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21 T21
M T11 T11
N T12 ,T21 T12 ,T21
P
November 2005 Distributed systems: shared data 142
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
commit
November 2005 Distributed systems: shared data 143
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21 ,T22 T21
M T11 T11
N T12 ,T21 T12 ,T21
P T22
November 2005 Distributed systems: shared data 144
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
commit
commit
November 2005 Distributed systems: shared data 145
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21 ,T22 T21 ,T22
M T11 T11
N T12 ,T21 T12 ,T21
P T22 T22
November 2005 Distributed systems: shared data 146
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
commit
commitabort
November 2005 Distributed systems: shared data 147
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21 ,T22 T21 ,T22 T2
M T11 T11
N T12 ,T21 T12 ,T21
P T22 T22
November 2005 Distributed systems: shared data 148
Atomic Commit protocol• Data structures: example
Server Trans ChildTrans
CommitList
AbortList
Z T T1 ,T2 T12 , T1 T11 ,T2
X T1 T11 ,T12 T12 , T1 T11
Y T2 T21 ,T22 T21 ,T22 T2
M T11 T11
N T12 ,T21 T12 ,T21
P T22 T22
November 2005 Distributed systems: shared data 149
Atomic Commit protocol• Data structures: final data
Server Trans Child Trans
Commit List
Abort List
Z T T12 , T1
N, X T11 ,T2
X T1 T12 , T1 T11
Y
M
N T12 ,T21 T12 ,T21
P T22 T22
November 2005 Distributed systems: shared data 150
Atomic Commit protocol• Algorithm of coordinator (flat protocol)
– Phase 1• send CanCommit to each worker in commit list
– TransactionId: T– abort list
• coordinator behaves as worker– Phase 2 (as for non-nested transactions)
• all votes Yes:commit transaction; send DoCommit to workers
• one vote No:abort transaction
November 2005 Distributed systems: shared data 151
Atomic Commit protocol• Algorithm of worker (flat protocol)
– Phase 1 (after receipt of CanCommit)
• at least one (provisionally) committed descendant of top level transaction:
– transactions with ancestors in abort list are aborted
– prepare for commit of other transactions
– send Yes to coordinator
• no (provisionally) committed descendant– send No to coordinator
– Phase 2 (as for non-nested transactions)
November 2005 Distributed systems: shared data 152
Atomic Commit protocol• Algorithm of worker (flat protocol)
– Phase 1 (after receipt of CanCommit)
– Phase 2 voted yes, waits for decision of coordinator• receives DoCommit
– makes committed data available; removes locks
• receives AbortTransaction– clears data structures; removes locks
November 2005 Distributed systems: shared data 153
Atomic Commit protocol• Timeouts:
– same 3 as above:• worker did all/some operations and waits for
CanCommit• coordinator waits for votes of workers • worker voted Yes and waits for final decision of
coordinator– provisionally committed child with an aborted
ancestor:• does not participate in algorithm• has to make an enquiry itself• when?
November 2005 Distributed systems: shared data 154
Atomic Commit protocol• Data structures: final data
Server Trans ChildTrans
CommitList
AbortList
Z T T12 , T1
N, XT11 ,T2
X T1 T12 , T1 T11
Y
M
N T12 ,T21 T12 ,T21
P T22 T22
November 2005 Distributed systems: shared data 155
Atomic Commit protocol• Data structures: example
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
abort
commitcommit
commit
commitabort
November 2005 Distributed systems: shared data 156
Overview• Transactions
• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery
• Replication
November 2005 Distributed systems: shared data 157
Distributed transactions Locking
• Locks are maintained locally (at each server)
– it decides whether• to grant a lock
• to make the requesting transaction wait
– it cannot release the lock until it knows whether the transaction has been
• committed
• aborted
at all servers– deadlocks can occur
November 2005 Distributed systems: shared data 158
Distributed transactions Locking
• Locking rules for nested transactions– child transaction inherits locks from parents
– when a nested transaction commits, its locks are inherited by its parents
– when a nested transaction aborts, its locks are removed
– a nested transaction can get a read lock when all the holders of write locks (on that data item) are ancestors
– a nested transaction can get a write lock when all the holders of read and write locks (on that data item) are ancestors
November 2005 Distributed systems: shared data 159
Distributed transactions Locking
• Who can access A?
T
T1
T2
T12
T21
X
Y
T11
M
T22
P
NZ
A
November 2005 Distributed systems: shared data 160
Overview• Transactions
• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery
• Replication
November 2005 Distributed systems: shared data 161
Distributed deadlocks
• Single server approaches– prevention: difficult to apply
– timeouts: value with variable delays?
Detection
• global wait-for-graph can be constructed from local
ones
• cycle in global graph possible without cycle in local
graph
November 2005 Distributed systems: shared data 162
Distributed transactions Deadlocks
C D
Z
A
X
B
Y
W
V
U
November 2005 Distributed systems: shared data 163
Distributed transactions Deadlocks
• Algorithms:
– centralised deadlock detection: not a good idea
• depends on a single server
• cost of transmission of local wait-for graphs
– distributed algorithm:
• complex
• phantom deadlocks
edge chasing approach
November 2005 Distributed systems: shared data 164
Distributed transactions Deadlocks
• Phantom deadlocks
– deadlock detected that is not really a deadlock
– during deadlock detection
• while constructing global wait-for graph
• waiting transaction is aborted
November 2005 Distributed systems: shared data 165
Distributed transactions Deadlocks
• Edge Chasing
– distributed approach to deadlock detection:
• no global wait-for graph is constructed
– servers attempt to find cycles
• by forwarding probes (= messages) that follow
edges of the wait-for graph throughout the
distributed system
November 2005 Distributed systems: shared data 166
Distributed transactions Deadlocks
• Edge Chasing
– three steps:
• initiation: transaction starts waiting
– new probe constructed
• detection: probe received
– extend probe
– check for loop
– forward new probe
• resolution
November 2005 Distributed systems: shared data 167
• Edge Chasing: initiation
– send out probe
when transaction T starts waiting for U (and U
is already waiting for …)
– in case of lock sharing, different probes are
forwarded
Distributed transactions Deadlocks
T U
November 2005 Distributed systems: shared data 168
Distributed transactions Deadlocks
C
Z
A
X
B
Y
W
V
U
W U
Initiation
November 2005 Distributed systems: shared data 169
• Edge Chasing: detection
– when receiving probe
• Check if U is waiting
• if U is waiting for V (and V is waiting)
add V to probe
• check for loop in probe?
– yes deadlock
– no forward new probe
Distributed transactions Deadlocks
T U
T U V
November 2005 Distributed systems: shared data 170
Distributed transactions Deadlocks
C
Z
A
X
B
Y
W
V
U
W U
Initiation
W U V
W U V W
November 2005 Distributed systems: shared data 171
• Edge Chasing: resolution
– abort one transaction
– problem?
• Every waiting transaction can initiate deadlock
detection
• detection may happen at different servers
• several transactions may be aborted
– solution: transactions priorities
Distributed transactions Deadlocks
November 2005 Distributed systems: shared data 172
• Edge Chasing: transaction priorities
– assign priority to each transaction, e.g. using
timestamps
– solution of problem above:
• abort transaction with lowest priority
• if different servers detect same cycle, the same
transaction will be aborted
Distributed transactions Deadlocks
November 2005 Distributed systems: shared data 173
• Edge Chasing: transaction priorities
– other improvements
• number of initiated probe messages – detection only initiated when higher priority transaction
waits for a lower priority one
• number of forwarded probe messages – probes travel downhill -from transaction with high
priority to transactions with lower priorities
– probe queues required; more complex algorithm
Distributed transactions Deadlocks
November 2005 Distributed systems: shared data 174
Overview• Transactions
• Distributed transactions– Flat and nested distributed transactions– Atomic commit protocols– Concurrency in distributed transactions– Distributed deadlocks– Transaction recovery
• Replication
November 2005 Distributed systems: shared data 175
Transactions and failures• Introduction
– Approaches to fault-tolerant systems• replication
– instantaneous recovery from a single fault
– expensive in computing resources
• restart and restore consistent state– less expensive
– requires stable storage
– slow(er) recovery process
November 2005 Distributed systems: shared data 176
Transactions and failures• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005 Distributed systems: shared data 177
Transactions and failures Stable storage
• Ensures that any essential permanent data will be recoverable after any single system failure
• allow system failures: – during a disk write
– damage to any single disk block
• hardware solution RAID technology• software solution:
– based on pairs of blocks for same data item
– checksum to determine whether block is good or bad
November 2005 Distributed systems: shared data 178
Transactions and failures Stable storage
• Based on the following invariant:– not more than one block of any pair is bad
– if both are good• same data
• except during execution of write operation
• write operation:– maintains invariant
– writes on both blocks are done strictly sequential
• restart of stable storage server after crash recovery procedure to restore invariant
November 2005 Distributed systems: shared data 179
Transactions and failures Stable storage
• Recovery for a pair:– both good and the same ok
– one good, one bad copy good block to bad block
– both good and different copy one block to the other
November 2005 Distributed systems: shared data 180
Transactions and failures• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005 Distributed systems: shared data 181
Transactions and failures Transaction recovery
• atomic property of transaction implies:– durability
• data items stored in permanent storage
• data will remain available indefinitely
– failure atomicity• effects of transactions are atomic even when servers
fail
• recovery should ensure durability and failure atomicity
November 2005 Distributed systems: shared data 182
Transactions and failures Transaction recovery
• Assumptions about servers– servers keep data in volatile storage
– committed data recorded in a recovery file
• single mechanism: recovery manager– save data items in permanent storage for committed
transactions
– restore the server’s data items after a crash
– reorganize the recovery file to improve performance of recovery
– reclaim storage space in the recovery file
November 2005 Distributed systems: shared data 183
Transactions and failures Transaction recovery
• Elements of algorithm:– each server maintains an intention list for all of its
active transactions: pairs of
• name• new value
– decision of server: prepared to commit a transaction intention list saved in the recovery file (stable storage)
– server receives DoCommit commit recorded in recovery file
– after a crash: based on recovery file
• effects of committed transactions restored (in correct order)
• effects of other transactions neglected
November 2005 Distributed systems: shared data 184
Transactions and failures Transaction recovery
• Alternative implementations for recovery file:
– logging technique
– shadow versions
• (see book for details)
November 2005 Distributed systems: shared data 185
Transactions and failures• Overview
– Stable storage
– Transaction recovery
– Recovery of the two-phase commit protocol
November 2005 Distributed systems: shared data 186
Transactions and failures two-phase commit protocol
• Server can fail during commit protocol
• each server keeps its own recovery file
• 2 new status values:– done– uncertain
November 2005 Distributed systems: shared data 187
Transactions and failures two-phase commit protocol
• meaning of status values:– committed:
• coordinator: outcome of votes is yes
• worker: protocol is complete
– done• coordinator: protocol is complete
– uncertain:• worker: voted yes; outcome unknown
November 2005 Distributed systems: shared data 188
Transactions and failures two-phase commit protocol
• Recovery actions: (status@…) in recovery file
– prepared@coordinator• no decision before failure of server
• send AbortTransaction to all workers
– aborted@coordinator• send AbortTransaction to all workers
– committed@coordinator• decision to commit taken before crash
• send DoCommit to all workers
• resume protocol
November 2005 Distributed systems: shared data 189
Transactions and failures two-phase commit protocol
• Recovery actions: (status@…) in recovery file
– committed@worker• send HaveCommitted to coordinator
– uncertain@worker• send GetDecision to coordinator to get status
– prepared@worker• not yet voted yes• unilateral abort possible
– done@coordinator• no action required
November 2005 Distributed systems: shared data 190
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 191
Replication• A technique for enhancing services
– Performance enhancement– Increased availability– Fault tolerance
• Requirements– Replication transparency– Consistency
November 2005 Distributed systems: shared data 192
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 193
System model and group communication
• Architectural model
FE
Requests andreplies
C
ReplicaC
ServiceClients Front ends
managers
RM
RMFE
RM
November 2005 Distributed systems: shared data 194
System model and group communication
• 5 phases in the execution of a request:
– FE issues requests to one or more RMs
– Coordination: needed to execute requests consistently
• FIFO
• Causal
• Total
– Execution: by all managers, perhaps tentatively
– Agreement
– Response
November 2005 Distributed systems: shared data 195
System model and group communication
• Need for dynamic groups!
• Role of group membership service– Interface for group membership changes: create/destroy groups, add process
– Implementing a failure detector: monitor group members
– Notifying members of group membership changes
– Performing group address expansion
• Handling network partitions: group is
– Reduced: primary-partition
– Split: partitionable
November 2005 Distributed systems: shared data 196
System model and group communication
Join
Groupaddress
expansion
Multicastcommunication
Group
send
FailGroup membership
management
Leave
Process group
November 2005 Distributed systems: shared data 197
System model and group communication
• View delivery
– To all members when a change in membership occurs
– <> receive view
• Event occurring in a view v(g) at process p
• Basic requirements for view delivery
– Order: if process p delivers v(g) and then v(g’)
then no process delivers v(g’) before v(g)
– Integrity: if p delivers v(g) then p v(g)
– Non-triviality: if q joins group and remains reachable
then eventually q v(g) at p
November 2005 Distributed systems: shared data 198
System model and group communication
• View-synchronous group communication
– Reliable multicast + handle changing group views
– Guarantees
• Agreement: correct processes deliver the same set of
messages in any given view
• Integrity: if a process delivers m, it will not deliver it again
• Validity: if the system fails to deliver m to q
then other processes will deliver v’(g) (=v(g) –
{q})
before delivering m
November 2005 Distributed systems: shared data 199
System model and group communication
p
q
r
p crashes
view (q, r)view (p, q, r)
p
q
r
p crashes
view (q, r)view (p, q, r)
a (allowed). b (allowed).
p
q
r
view (p, q, r)
p
q
r
p crashes
view (q, r)view (p, q, r)
c (disallowed). d (disallowed).
p crashes
view (q, r)
November 2005 Distributed systems: shared data 200
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 201
Fault-tolerant services• Goal: provide a service that is correct
despite up to f process failures
• Assumptions:– Communication reliable– No network partitions
• Meaning of correct in case of replication– Service keeps responding– Clients cannot discover difference with ...
November 2005 Distributed systems: shared data 202
Fault-tolerant services• Naive replication system:
– Clients read and update accounts at local replica manager
– Clients try another replica manager in case of failure
– Replica managers propagate updates in the background
• Example: 2 replica managers: A and B
2 bank accounts: x and y
2 clients: client 1 will use B by preference
client 2 will use A by preference
Client 1 Client 2
setBalanceB(x,1)
setBalanceA(y,2)
getBalanceA(y) 2
getBalanceA(x) 0
Strange behavior:
Client 2 sees 0 on account x and NOT 1
2 on account y
and update of x has been done earlier!!
November 2005 Distributed systems: shared data 203
Fault-tolerant services• Correct behaviour?
– Linearizability• Strong requirement
– Sequential consistency• Weaker requirement
November 2005 Distributed systems: shared data 204
Fault-tolerant services• Linearizability
– Terminology:• Oij: client i performs operation j• Sequence of operations by one client: O20, O21, O22,...• Virtual interleaving of operations performed by all clients
– Correctness requirements: interleaved sequence ...• Interleaved sequence of operations meets specification of a
(single) copy of the objects• Order of operations in the interleaving is consistent with the
real times at which the operations occurred
– Real time?• Yes, we prefer up-to-date information• Requires clock synchronization: difficult
November 2005 Distributed systems: shared data 205
Fault-tolerant services• Sequential consistency
– Correctness requirements: interleaved sequence ... (red = difference!)
• Interleaved sequence of operations meets specification of a (single) copy of the objects
• Order of operations in the interleaving is consistent with the program order in which each individual client executed them
– Example: sequential consistent not linearizable
Client 1 Client 2
setBalanceB(x,1)
getBalanceA(y) 0
getBalanceA(x) 0
setBalanceA(y,2)
November 2005 Distributed systems: shared data 206
Fault-tolerant services• Passive (primary-backup) replication
FEC
FEC
RM
Primary
Backup
Backup
RM
RM
November 2005 Distributed systems: shared data 207
Fault-tolerant services• Passive (primary-backup) replication
– Sequence of events for handling a client request:• Request: FE issues request with unique id to primary• Coordination: request handled atomically in order; if
request already handled, re-send response
• Execution: execute request and store response• Agreement: primary sends updated state to backups and
waits for acks• Response: primary responds to FE; FE hands response
back to client
– Correctness: linearizability– Failures?
November 2005 Distributed systems: shared data 208
Fault-tolerant services• Passive (primary-backup) replication
– Failures?• Primary uses view-synchronous group communication• Linearizability preserved, if
– Primary replaced by a unique backup– Surviving replica managers agree on which operations had been
performed at the replacement point
– Evaluation:• Non-deterministic behaviour of primary supported• Large overhead: view-synchronous communication required• Variation of the model:
– Read requests handled by backups: linearizability sequential consistent
November 2005 Distributed systems: shared data 209
Fault-tolerant services• Active replication
FE CFEC RM
RM
RM
November 2005 Distributed systems: shared data 210
Fault-tolerant services• Active replication
– Sequence of events for handling a client request:• Request: FE does reliable TO-multicast(g, <m, i>) and
waits for reply• Coordination: every correct RM gets requests in same order• Execution: every correct RM executes the request;
all RMs execute all requests in the same order
• Agreement: not needed• Response: every RM returns result to FE;
when return result to client?– Crash failures: after first response from RM– Byzantine failures: after f+1 identical responses from RMs
– Correctness: sequential consistency, not linearizability
November 2005 Distributed systems: shared data 211
Fault-tolerant services• Active replication
– Evaluation• Reliable + totally ordered multicast solving consensus
Synchronous systemAsynchronous + failure detectors
– Overhead!
• More performance– Relax total order in case operations commute:
result of o1;o2 = result o2;o1
– Forward read-only request to a single RM
November 2005 Distributed systems: shared data 212
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 213
Highly available services• Goal
– Provide acceptable level of service– Use minimal number of RMs– Minimize delay for returning resultWeaker consistency
• Overview– Coda
– Gossip Architecture
– Bayou
November 2005 Distributed systems: shared data 214
Highly available services Coda
• Aims: constant data availability– better performance, e.g. for bulletin boards,
databases,…– more fault tolerance with increasing scale– support mobile and portable computers
(disconnected operation)
Approach: AFS + replication
November 2005 Distributed systems: shared data 215
Highly available services Coda
• Design AFS+– file volumes replicated on different servers– volume storage group (VSG) per file volume– Available Volume Storage group (AVSG) per
file volume at a particular instance of time– volume disconnected when AVSG is empty;
due to• network failure, partitioning• server failures• deliberate disconnection of portable workstation
November 2005 Distributed systems: shared data 216
Highly available services Coda
• Replication and consistency– file version
• integer number associated with file copy
• incremented when file is changed
– Coda version vector (CVV)• array of numbers stored with file copy on a
particular server (holding a volume)
• one value per volume in VSG
November 2005 Distributed systems: shared data 217
Highly available services Coda
• Replication and consistency: example 1
– File F stored at 3 servers: S1, S2, S3
– Initial values for all CVVs: CVVi = [1,1,1]
– update by C1 at S1 and S2; S3 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]
– network repaired conflict detectedfile copy at S3 updated
CVV1 = [2,2,2], CVV2 = [2,2,2], CVV3 = [2,2,2]
November 2005 Distributed systems: shared data 218
Highly available services Coda
• Replication and consistency: example 2
– File F stored at 3 servers: S1, S2, S3
– Initial values for all CVVs: CVVi = [1,1,1]
– update by C1 at S1 and S2; S3 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,1]
– update by C2 at S3 ; S1 and S2 inaccessibleCVV1 = [2,2,1], CVV2 = [2,2,1], CVV3 = [1,1,2]
– network repaired conflict detectedmanual intervention or ….
November 2005 Distributed systems: shared data 219
Highly available services Coda
• Implementation– On open
• Select one server from AVSG
• check CCV with all servers in AVSG
• files in replicated volume remain accessible to a client that can access at least one of the replica
• load sharing over replicated volumes
– On close• multicast file to AVSG
• update of CCV
– manual resolution of conflicts might be necessary
November 2005 Distributed systems: shared data 220
Highly available services Coda
• Caching: update semantics– successful open:
AVSG not empty and latest(F, AVSG, 0) or
AVSG not empty and latest(F, AVSG, T) and
lostcallback(AVSG, T) and incache (F) or
AVSG empty and incache (F)
November 2005 Distributed systems: shared data 221
Highly available services Coda
• Caching: update semantics– failed open:
AVSG not empty and conflict(F, AVSG) or
AVSG empty and not incache(F)
– successful close: AVSG not empty and updated(F, AVSG)
or
AVSG empty
– failed close: AVSG not empty and conflict(F, AVSG)
November 2005 Distributed systems: shared data 222
Highly available services Coda
• Caching: cache coherence– relevant events to detect by Venus within T seconds of
their occurrence:• enlargement of AVSG
• shrinking of AVSG
• lost callback event
– method: probe message to all servers in VSG of any cached file every T seconds
November 2005 Distributed systems: shared data 223
Highly available services Coda
• Caching: disconnected operation– Cache replacement policy: e.g. least-recently used
– how support long disconnection of portables:• Venus can monitor file referencing
• users can specify a prioritised list of files to retain on local disk
– reintegration after disconnection• priority for files on server
• client files in conflict are stored on covolumes; client is informed
November 2005 Distributed systems: shared data 224
Highly available services Coda
• Performance: Coda <> AFS– No replication: no significant difference– 3-fold replication & load for 5 users:
load +5%– 3-fold replication & load for 50 users
load + 70% for Coda <> +16% for AFS• Difference: replication + tuning?
• Discussion– Optimistic approach to achieve high availability– Use of semantics free conflict detection
(except file directories)
November 2005 Distributed systems: shared data 225
Highly available servicesGossip
• Goal of Gossip architecture– Framework for implementing highly ...
– Replicate data close to points where groups of clients need it
• Operations:– 2 types:
• Queries: read-only operations
• Updates: change state (do not read state)
– FE send operations to any RM selection criterium: available + reasonable response time
– Guarantees
November 2005 Distributed systems: shared data 226
Highly available servicesGossip
• Goal of Gossip architecture
• Operations:– 2 types: queries & updates– FE send operations to any RM – Guarantees:
• Each client obtains consistent service over time – even when communicating with different RMs
• Relaxed consistency between replicas– Weaker than sequential consistency
November 2005 Distributed systems: shared data 227
Highly available servicesGossip
• Update ordering:– Causal least costly
– Forced (= total + causal)
– Immediate• Applied in a consistent order relative to any other update at all
RMs, independent of order requested for other updates
• Choice– Left to application designer
– Reflects trade-off between consistency and operation cost
– Implications for users
November 2005 Distributed systems: shared data 228
Highly available servicesGossip
• Update ordering:– Causal least costly
– Forced (= total + causal)
– Immediate• Applied in a consistent order relative to any other update at all
RMs, independent of order requested for other updates
• Example electronic bulletin board:– Causal: for posting items
– Forced: for adding new subscriber
– Immediate: for removing a user
November 2005 Distributed systems: shared data 229
Highly available servicesGossip
• Architecture– Clients + FE/client– Timestamps added to operations: in next figure
• Prev: reflects version of latest data values seen by client
• New: reflects state of responding RM
– Gossip messages: • exchange of operations between RMs
November 2005 Distributed systems: shared data 230
Highly available servicesGossip
Query Val
FE
RM RM
RM
Query, prev Val, new
Update
FE
Update, prev Update id
Clients
gossipService
November 2005 Distributed systems: shared data 231
Highly available servicesGossip
• Sequence of events for handling a client request:– Request:
• FE sends request to a single RM
• For query operation: Client blocked
• For update operation: FE returns to client asap; then forwards operation to one RM or f+1 RMs for increased reliability
– Update response: • if update operation, RM replies to FE after receiving the request
– Coordination:• Request stored in log queue till it can be executed
• Gossip messages can be exchanged to update state in RM
– Execution:
November 2005 Distributed systems: shared data 232
Highly available servicesGossip
• Sequence of events for handling a client request:– Request
– Update response
– Coordination
– Execution: • RM executes request
– Query response• If operation is query then RM replies at this point
– Agreement:• Lazy update by RMs
November 2005 Distributed systems: shared data 233
Highly available servicesGossip
• Gossip internals– Timestamps at FEs– State of RMs– Handling of Query operations– Processing of update operations in causal order– Forced and immediate update operations– Gossip messages– Update propagation
November 2005 Distributed systems: shared data 234
Highly available servicesGossip
• Timestamps at FEs– Vector timestamps with entry for every RM
– Local component updated at every operation
– Returned timestamp merged with local one
– Client-to client operations• Via FEs• Include timestamps (to preserve causal order)
November 2005 Distributed systems: shared data 235
Highly available servicesGossip
Query Val
FE
RM RM
RM
Query, prev Val, new
Update
FE
Update, prev Update id
Clients
gossipService
ts
November 2005 Distributed systems: shared data 236
Highly available servicesGossip
• State of RMs– Value: state as maintained by RM
– Value timestamp: associated with value
– Update log: why log?• Operation cannot be executed yet• Operation has to be forwarded to other RMs
– Replica timestamp: reflects updates received by RM
– Executed operation table: to prevent re-execution
– Timestamp table: • timestamps from other RMs• Received with Gossip messages• Used to check for consistency between RMs
November 2005 Distributed systems: shared data 237
Highly available servicesGossip
Replica timestamp
Update log
Value timestamp
Value
Executed operation table
Stable
updates
Updates
Gossipmessages
FE
Replicatimestamp
Replica log
OperationID Update PrevFE
Replica manager
Other replica managers
Timestamp table
November 2005 Distributed systems: shared data 238
Highly available servicesGossip
• Handling of Query operations– Query request contains:
• q= operation
• q.prev = timestamp at FE
– If q.prev <= valueTSthen
operation can be executedelse operation put in hold-back queue
– Result contains new timestampmerged with timestamp at FE
November 2005 Distributed systems: shared data 239
Highly available servicesGossip
• Processing of update operations in causal order– Update request u
• u.op specification of operation (type & parameters)
• u.prev: timestamp generated at FE
• u.id: unique identifier
– Handling of u at RMi
November 2005 Distributed systems: shared data 240
Highly available servicesGossip
• Processing of update operations in causal order– Update request u = <u.op, u.prev, u.id>
– Handling of u at RMi
• u already processed?
• Increment i-the element in replicaTS
• Assign ts to uts[i] = replicaTS[i]ts[k] = u.prev[k], k<>i
• Log record r = <i, ts, u.op, u.prev, u.id> added to log
• ts returned to FE
• If stability condition u.prev <= valueTS is satisfiedthen value := apply(value, r.u.op)
valueTS := merge( valueTS, r.ts)executed := executed {r.u.id}
November 2005 Distributed systems: shared data 241
Highly available servicesGossip
• Forced and immediate update operations– Special treatment
– Forced = total + causal order• Unique global sequence numbers
by primary RM (reelection if necessary)
– Immediate• Placed in sequence by primary RM (forced order)
• Additional communication between RMs
November 2005 Distributed systems: shared data 242
Highly available servicesGossip
• Gossip messages– Gossip message m
• m.log = log
• m.ts = replicaTS
– Tasks done by RM when receiving m• Merge m with its own log
– Drop r when r.ts <= replicaTS
– replicaTS := merge(replicaTS, m.ts)
• Apply updates that have become stable
• Eliminate records from log and executed operations table
November 2005 Distributed systems: shared data 243
Highly available servicesGossip
• Update propagation– Gossip exchange frequency
• To be tuned by application
– Partner selection policy• Random• Deterministic• Topological
– How much time will it take for all RMs to receive an update
• Frequency and duration of network partitions• Frequency for exchanging gossip-messages• Policy for choosing partners
November 2005 Distributed systems: shared data 244
Highly available servicesGossip
• Discussion of architecture+ Clients can continue to obtain a service even with
network partition
- Relaxed consistency guarantees
- Inappropriate for updating replicas in near-real time
? Scalability? Depends upon• Number of updates in a gossip message
• Use read-only replicas
November 2005 Distributed systems: shared data 245
Highly available servicesBayou
• Goal– Data replication for high availability– Weaker guarantees than sequential consistency– Cope with variable connectivity
• Guarantees: Every RM eventually – receives the same set of updates– applies those updates
November 2005 Distributed systems: shared data 246
Highly available servicesBayou
• Approach– Use a domain specific policy for detecting and
resolving conflicts– Every Bayou update contains
• Dependency check procedure– Check for conflict if update would be applied
• Merge procedure– Adapt update operation
» Achieves something similar
» Passes dependency check
November 2005 Distributed systems: shared data 247
Highly available servicesBayou
• Committed and tentative updates– New updates are applied and marked as tentative
– Tentative updates can be undone and reapplied later
– final order decided by primary replica manager committed order
– Tentative update ti becomes next committed update• Undo of all tentative updates after last committed update
• Apply ti
• Other tentative updates are reapplied
November 2005 Distributed systems: shared data 248
Highly available servicesBayou
c0 c1 c2 cN t0 t1 ti
Committed Tentative
t2
Tentative update ti becomes the next committed update and is inserted after the last committed update cN.
ti+1
November 2005 Distributed systems: shared data 249
Highly available servicesBayou
• Discussion– Makes replication non-transparant to the application
• Exploits application’s semantics to increase availability• Maintains replicated state as eventually sequentially consistent
– Disadvantages:• Increased complexity for application programmer• Increased complexity for user: returned results can be changed
– Suitable for applications ....• conflicts rare• Underlying data semantics simple• Users can cope with tentative information
e.g. diary
November 2005 Distributed systems: shared data 250
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 251
Transactions with replicated data• Introduction
– Replicated transactional service• each data item replicated at a group of servers
replica managers
• transparency one-copy serializability
– Why?• increase availability
• increase performance
– Advantages/disadvantages• higher performance for read-only requests
• degraded performance op update requests
November 2005 Distributed systems: shared data 252
Transactions with replicated data• Architectures for replicated transactions
– questions:
• Can a client send requests to any replica manager?
• How many replica managers are needed for a
successful completion of an operation?
• If one replica manager is addressed, can this one
delay forwarding till commit of transaction?
• How to carry out two-phase commit?
November 2005 Distributed systems: shared data 253
Transactions with replicated data• Architectures for replicated transactions
– Replication Schemes:• Read-one/Write-all
– a read request can be performed by a single replica manager
– a write request must be performed by all replica managers
• Quorum Consensus
• Primary Copy
November 2005 Distributed systems: shared data 254
Transactions with replicated data• Architectures for replicated transactions
– Replication Schemes:• Read-one/Write-all
• Quorum Consensus– nr replica managers required to read data item
– nw replica managers requited to update data item
– nr + nw > number of replica managers
– advantage: less managers needed for update
• Primary Copy– all requests directed to a single server
– slaves can take over when primary fails
November 2005 Distributed systems: shared data 255
Transactions with replicated data• Architectures for replicated transactions
– Forwarding update requests?• Read-one/Write-all
– as soon as operation is received (only write requests)
• Quorum Consensus– as soon as operation is received (read + write requests)
• Primary Copy– when transaction commits
November 2005 Distributed systems: shared data 256
Transactions with replicated data
• Architectures for replicated transactions
– Two-phase commit
• locking (read one/write all)
– read operation: lock on single replica
– write operation: locks on all replica
– one copy serializability: assured as read and write have
conflicting lock on at least one replica
• protocol:
November 2005 Distributed systems: shared data 257
Transactions with replicated data• Architectures for replicated transactions
– Two-phase commit• locking (read one/write all)
• protocol:– becomes two-level nested two-phase protocol
– first phase:
» worker receives CanCommit request
request passed to all replica managers
replies collected; one reply to coordinator
– second phase:
» same approach for DoCommit and Abort request
November 2005 Distributed systems: shared data 258
Transactions with replicated data• Architectures for replicated transactions
– Failure implications?• Read-one/Write-all
– updates impossible available copies replication
• Quorum Consensus– quorum satisfaction still possible
• Primary Copy– failure of primary election new primary?
– Failure of slave update delayed
November 2005 Distributed systems: shared data 259
Transactions with replicated data• Available copies replication
– updates only performed by all available replica managers (cfr. Coda)
– no failures:• local concurrency control one copy
serializability
– failures:• additional concurrency control is needed• example (next slides)
November 2005 Distributed systems: shared data 260
Transactions with replicated data
A
X
A
Y
Replica managers
B
MB
P
B
N
C1
C2 GetBalance(A);
Deposit(B,3);
UT
November 2005 Distributed systems: shared data 261
Transactions with replicated data
A
X
A
Y
Replica managers
B
MB
P
B
N
C1
C2 GetBalance(B);
Deposit(A,3);
TU
November 2005 Distributed systems: shared data 262
Transactions with replicated data
A
X
A
Y
Replica managers
B
MB
P
B
N
C1
C2
TU U waiting for A @ X
T waiting for B @ N
Deadlock!
GetBalance(A);Deposit(B,3);
GetBalance(B);
Deposit(A,3);
November 2005 Distributed systems: shared data 263
Transactions with replicated data
A
X
A
Y
Replica managers
B
MB
P
B
N
C1
C2
TU
U can proceed; conflict not detected
T can proceed; conflict not detected
GetBalance(A);
Deposit(B,3);
GetBalance(B);
Deposit(A,3);
November 2005 Distributed systems: shared data 264
Transactions with replicated data• Available copies replication
– failures & additional concurrency control• at commit time it is checked
– servers unavailable during transaction: still unavailable?– Servers available during transaction: still available?
– If no abort transaction
• implications for two-phase commit protocol?
November 2005 Distributed systems: shared data 265
Transactions with replicated data
• Replication and network partitions– optimistic approach: available copies
• operations can go on in each partition
• conflicting transactions should be detectedand compensated for
• conflict detection:– for file systems: version vector (see Coda)
– for read/write conflicts: ….
November 2005 Distributed systems: shared data 266
Transactions with replicated data
• Replication and network partitions (cont.)
– pessimistic approach: quorum consensus
• operations can go on in a single partition only
• R read quorum & W write quorum– W > half of the votes
– R + W > total number of votes
• out of date copies should be detected– version vectors
– timestamps
November 2005 Distributed systems: shared data 267
Transactions with replicated data• Replication and network partitions (cont.)
– quorum consensus + available copies• virtual partitions advantages of both approaches
• virtual partition = abstraction of real partition
• transaction can operate in a virtual partition– sufficient replica managers to have read & write quorum
– available copies is used in transaction
• virtual partition changes during transactionabort of transaction
• member of virtual partition cannot access other membercreate new partition
November 2005 Distributed systems: shared data 268
Overview• Transactions
• Distributed transactions
• Replication– System model and group communication– Fault-tolerant services– Highly available services– Transactions with replicated data
November 2005 Distributed systems: shared data 269
Distributed Systems:
Shared Data