CS 542 -- Concurrency Control, Distributed Commit
-
Upload
j-singh -
Category
Technology
-
view
2.002 -
download
2
description
Transcript of CS 542 -- Concurrency Control, Distributed Commit
CS 542 Database Management SystemsConcurrency ControlCommit in Distributed Systems
J Singh April 11, 2011
2© J Singh, 2011 2
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
3© J Singh, 2011 3
Scheduler Architecture for CC
• Scheduler has two parts
1. Accepts read/write requests from transactions
2. Assures serialization• Keeps track of active
and pending transactions
• Controls commit, abort, delay
• Today’s lecture discusses Part 2 functionality
4© J Singh, 2011 4
The Lock Table
• A relation that associates database elements with locking information about that element
• Implemented as a hash table
• Size is proportional to the number of lock elements, not to the size of the entire database
DB element A
Lock information for A
5© J Singh, 2011 5
Scheduler Priority Logic
• When a transaction releases a lock that other transactions are waiting for, what policy to use?
– First-Come-First-Served: • Grant the lock to the longest waiting request. • No starvation (waiting forever for lock)
– Priority to Shared Locks: • Grant all S locks waiting, then one X lock. • Grant X lock if no others waiting
– Priority to Upgrading: • If there is a U lock waiting to upgrade to an X lock, grant that first.
• Each has its advantages and disadvantages– Configurable for a database instance
6© J Singh, 2011 6
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
7© J Singh, 2011 7
Motivation for intention locks
• Besides scanning through the table, if we need to modify a few tuples. What kind of lock to put on the table?
• Have to be X (if we only have S or X).
• But, blocks all other read requests!
8© J Singh, 2011 8
Intention Locks
• Allow intention locks IS, IX.
• Before S locking an item, must IS lock the root.
• Before X locking an item, must IX lock the root.
• Should make sure:– If Ti S locks a node, no Tj can X lock an ancestor.
• Achieved if S conflicts with IX
– If Tj X locks a node, no Ti can S or X lock an ancestor.• Achieved if X conflicts with IS and IX.
9© J Singh, 2011 9
Allowed Lock Sharings
IS IX S
IS
IX
S
Ö
Ö
Ö
Ö Ö
Ö
SIX X
ÖSIX
X
Ö Ö
Ö
Lock Requester
Lock
Hold
er
10© J Singh, 2011 10
Multiple Granularity Lock Protocol
• Each txn starts from the root of the hierarchy.
• To get a lock on any node, must hold an intentional lock on its parent node!
– E.g. to get S lock on a node, must hold IS or IX on parent.– E.g. to get X lock on a node, must hold IX or SIX on parent.– Full table of rules:
• Must release locks in bottom-up order.
Parent Locked In
Child may be locked by same txn in
IS IS, S
IX IS, S, IX, X, SIX
S none
SIX X, IX, (also SIX, but not necessary)
X none
11© J Singh, 2011 11
Example 1• T1 needs a shared lock on t2
• T2 needs a shared lock on R1
R1
t1t2 t3
t4
T1(IS)
T1(S)
, T2(S)
12© J Singh, 2011 12
Example 2• T1 needs a shared lock on t2
• T2 needs an exclusive lock on t4 – No conflict
R1
t1t2 t3
t4
T1(IS)
T1(S)
, T2(IX)
T2(IX)
13© J Singh, 2011 13
Examples 3, 4, 5
• T1 scans R, and updates a few tuples:
– T1 gets an SIX lock on R, and occasionally upgrades to X on the tuples.
• T2 uses an index to read only part of R:
– T2 gets an IS lock on R, and repeatedly gets an S lock on tuples of R.
• T3 reads all of R:– T3 gets an S lock on R. – OR, T3 could behave like T2;
can use lock escalation as it goes.
IS IX S
IS
IX
S
Ö
Ö
Ö
Ö Ö
Ö
SIX X
ÖSIX
X
Ö Ö
Ö
Lock Requester
Lock
Hold
er
14© J Singh, 2011 14
Insert and Delete
• Transactions– T1:
SELECT MAX(Price) WHERE Rating = 1;
SELECT MAX(Price) WHERE Rating = 2;
– T2:INSERT <Apple, Arkansas Black, 1,
96>;DELETE WHERE Rating = 2 AND Price = (SELECT MAX(Price)
WHERE Rating = 2);• Execution– T1 locks all records w/Rating=1 and
gets 80.– T2 inserts <Arkansas Black, 96>– T2 deletes <Fuji, 75>– T1 locks all records w/Rating=2 and
gets 65.
Fruit Variety * Price
Apple
Baldwin 1 80
Apple
Cortland 2 65
Apple
Delicious 2 55
Apple
Empire 1 60
Apple
Fuji 2 75
Apple
Granny Smith
1 65
• Result:– From T1: 80, 65– Actual: 96, 65– T1 then T2: 80,
75– T2 then T1: 96,
65
15© J Singh, 2011 15
Insert and Delete Rules
• When T1 inserts t1 into R,– Give X lock on t1 to T1
• When T2 deletes t2 from R,– It must obtain an X lock on t2
– This will fix the Fuji delete problem (how so?)
• But there is still a problem: Phantom Reads. – Seen with Arkansas Black in the example– Solution: use multiple granularity tree– Before inserting Q, obtain an X lock for parent(Q)
16© J Singh, 2011 16
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
17© J Singh, 2011 17
Did Insert/Delete expose a flaw in 2PL?
• The flaw was with the assumption that by locking all tuples, T1 had locked the set!
– We needed to lock the set– Would we bottleneck on the relation if the workload were insert-
and delete-heavy?
• There is another way to solve the problem:– Lock at the index (if one exists)– Since B+ trees are not 100% full, we can maintain multiple
locks in different sections of the tree.
r=1
Index Put a lock here.
18© J Singh, 2011 18
Index Locking (p1)
• Higher levels of the tree only direct searches for leaf pages.
• For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)
• We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.
19© J Singh, 2011 19
Index Locking (p2)
• Search: Start at root and go down; repeatedly, S lock child then unlock parent.
• Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe:
– If child is safe, release all locks on ancestors.
• Safe node: Node such that changes will not propagate up beyond this node.
– Inserts: Node is not full.– Deletes: Node is not half-empty.
20© J Singh, 2011 20
Example
Where to lock?1) Delete 38*2) Insert 45*3) Insert 25*
ROOT
A
B
C
D E
F
G H I
20
35
20*
38 44
22* 23* 24* 35* 36* 38* 41* 44*
23
21© J Singh, 2011 21
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
22© J Singh, 2011 22
Optimistic CC
• Locking is a conservative approach in which conflicts are prevented. Disadvantages:
– Lock management overhead.– Deadlock detection/resolution.
• Not discussed in CS-542 lectures, expecting that you are familiar with it
– If conflicts are rare, we may be able to gain performance by not locking, and instead checking for conflicts before txns commit.
• Two approaches– Kung-Robinson Model
• Divides every transaction into three phases: read, validate, write• Makes commit/abort decision based on what’s being read and
written– Timestamp Ordering Algorithms
• Clever use of timestamps to determine which operations are conflict-free and which must be aborted
23© J Singh, 2011 23
Kung-Robinson Model
• Key idea:– Let transactions work
in isolation– Validate reads and
writes when ready to commit
– Make Validation Atomic
– Validated ≡ Committed
• Transactions have three phases:– READ:
• txns read from the database, • make changes to private copies of
objects.– VALIDATE:
• Check if schedule so far is serializable.
– WRITE: • Make local copies of changes public.
ROOT
old
new
modifiedobjects
24© J Singh, 2011 24
Validation
• Test conditions that are sufficient to ensure that no conflict occurred.
– Each txn is assigned a numeric id.• Just use a timestamp.
• Transaction ids assigned at end of READ phase, just before validation begins.
– ReadSet(Ti): Set of objects read by txn Ti.
– WriteSet(Ti): Set of objects modified by Ti.
• Validation is atomic– Done in a critical section
25© J Singh, 2011 25
Validation Tests
• Test
FIN(Ti) < START(Tj)
FIN(Ti) < VAL(Tj) AND
WriteSet(Ti ) ∩ ReadSet(Tj ) is empty.
VAL(Ti) < VAL(Tj) ANDWriteSet(Ti ) ∩ ReadSet(Tj ) is empty
ANDWriteSet(Ti ) ∩ WriteSet(Tj ) is empty.
Ti
TjR V W
R V W
Ti
Tj
R V W
R V W
Ti
Tj
R V W
R V W
• Situation
26© J Singh, 2011 26
Overheads in Kung-Robinson CC
• Must record read/write activity in ReadSet and WriteSet per txn.
– Must create and destroy these sets as needed.
• Must check for conflicts during validation, and must make validated writes “global”.
– Critical section can reduce concurrency.– Scheme for making writes global can reduce clustering of
objects.
• Optimistic CC restarts transactions that fail validation.– Work done so far is wasted; requires clean-up.
27© J Singh, 2011 27
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
28© J Singh, 2011 28
Timestamp Ordering CC
• Main idea:– Put a timestamp on the last read and write action on every
object– Use this timestamp to detect if a transaction attempts an illegal
operation– Abort the offending transaction if it does
• Algorithm: – Give each object a read-timestamp (RTS) and a write-timestamp
(WTS), – Give each txn a timestamp (TS) when it begins– Action ai of txn Ti must occur before action aj of txn Tj if
• If action ai of txn Ti conflicts with action aj of txn Tj, and
• TS(Ti) < TS(Tj), then ai must occur before aj.
– Otherwise, restart the violating txn.
29© J Singh, 2011 29
Rules for Timestamps-Based scheduling
• Algorithm setup– RT(X)
• The read time of X, the highest timestamp of transaction that has read X.
– WT(X)• The write time of X, the highest timestamp of transaction that has
write X.– C(X)
• The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.
• Scheduler receives a request from T to operate on X– The request is realizable under some conditions and not under
others
30© J Singh, 2011 30
Physically Unrealizable
• Read too late– A transaction U that started after transaction T but wrote a
value for X before T reads X
– In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.
U writes X
T reads X
T start U start
31© J Singh, 2011 31
Physically Unrealizable
• Write too late– A transaction U that started after T, but read X before T got a
chance to write X.
– In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.
U reads X
T writes X
T start U start
32© J Singh, 2011 32
Dirty Read
• After T reads the value of X written by U, U could abort
– In other words, if TS(T) = RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later value in X.
• If C(X) is true, then the previous writer of X is committed, all is good.• If C(X) is false, we must delay T.
U writes X
T reads X
U start T start
U aborts
33© J Singh, 2011 33
Write after Write
• T tries to write X after a later transaction (U) has written it– OK to ignore the write by T because it will get overwritten
anyway– Except if U aborts
• And the new value of T is lost forever
– Solve this problem by introducing the concept of a “tentative write”
U writes XT writes X
T start
U start
T commit
U abort
34© J Singh, 2011 34
Rules for Timestamps-based Scheduling
• Scheduler receives a request to commit T. – It must find all the database elements X written by T and set
C(X)=true. – If any transactions are waiting for X to be committed, these
transactions are allowed to proceed.
• Scheduler receives a request to abort T or decides to rollback T,
– Any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.
35© J Singh, 2011 35
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
36© J Singh, 2011 36
Multiversion Timestamps
• Multiversion schemes keep old versions of data item to increase concurrency.
• Each successful write results in the creation of a new version of the data item written.
• Use timestamps to label versions.
• When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.
37© J Singh, 2011 37
Timestamps vs Locking
• Generally, timestamping performs better than locking in situations where:
– Most transactions are read-only.– It is rare that concurrent transaction will try to read and write
the same element.– This is generally the case for Web Applications
• In high-conflict situation, locking performs better than timestamps
38© J Singh, 2011 38
Practical Use
• 2-Phase Locks (or variants)– Used by most relational databases
• Multi-level granularity– Support for table, page and tuple-level locks– Used by most relational databases
• Multi-version concurrency control– Oracle 8 forward: Divide transactions into read-only and read-
write• Read-only transactions use multi-version concurrency and never wait• Read-write transactions use 2PL
– Postgres, others as well, offer some level of MVCC
39© J Singh, 2011 39
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
40© J Singh, 2011 40
Distributed Commit Motivation
• FruitCo has– Its main Sales office in
Oregon– Farms and Warehouse are in
Washington– Finance is in Utah– All three sites have local
data centers with their own systems
• When an order is placed, the Sales system must send the billing information to Utah and shipping information to Washington.
– When an order is placed, all three databases must be updated, or none should be.
41© J Singh, 2011 41
Two Phase Commit
• The Basic Idea
Transaction Manager (TM) Resource Manager (RM)1. Trans. arrives.
Message to ask for vote is sent to other site(s)
Message is recorded.Site votes Y or N (abort)Vote is sent to site 1
2. The vote is received. If vote = Y on both sites, then Commit else Abort
Either Commit or Abort based on the decision of site 1
42© J Singh, 2011 42
Two-Phase Commit (2PC)
• Phase 1 : The TM gets the RMs ready to write the results into the database
• Phase 2 : Everybody writes the results into the database– TM :The process at the site where the transaction originates
and which controls the execution– RM :The process at the other sites that participate in
executing the transaction
• Global Commit Rule:– The TM aborts a transaction if and only if at least one RM
votes to abort it.– The TM commits a transaction if and only if all of the RMs
vote to commit it.
43© J Singh, 2011 43
Centralized 2PC
ready? yes/nocommit/abort?commited/aborted
Phase 1 Phase 2
C C C
P
P
P
P
P
P
P
P
44© J Singh, 2011 44
State Transitions in 2PC
INITIAL
WAIT
Commit commandPrepare
Vote-commit (all)Global-commit
INITIAL
READY
Prepare Vote-commit
Global-commitAck
Prepare Vote-abort
Global-abortAck
TM RMs
Vote-abort Global-abort
ABORT COMMIT COMMITABORT
45© J Singh, 2011 45
When TM Fails…
• Timeout in INITIAL– Who cares
• Timeout in WAIT– Cannot unilaterally commit– Can unilaterally abort
• Timeout in ABORT or COMMIT
– Stay blocked and wait for the acks
• TM
• INITIAL
• WAIT
• Commit command• Prepare
• Vote-commit • Global-commit
• ABORT • COMMIT
• Vote-abort • Global-abort
46© J Singh, 2011 46
When an RM Fails…
• Timeout in INITIAL– TM must have failed in
INITIAL state– Unilaterally abort
• Timeout in READY– Stay blocked
• INITIAL
• READY
• Prepare • Vote-commit
• Global-commit• Ack
• Prepare • Vote-abort
• Global-abort• Ack
• ABORT • COMMIT
• RMs
47© J Singh, 2011 47
When TM Recovers…
• Failure in INITIAL– Start the commit process
upon recovery
• Failure in WAIT– Restart the commit process
upon recovery
• Failure in ABORT or COMMIT– Nothing special if all the
acks have been received– Otherwise the termination
protocol is involved
• TM
• INITIAL
• WAIT
• Commit command• Prepare
• Vote-commit • Global-commit
• ABORT • COMMIT
• Vote-abort • Global-abort
48© J Singh, 2011 48
When an RM Recovers…
• Failure in INITIAL– Unilaterally abort upon
recovery
• Failure in READY– The TM has been informed
about the local decision– Treat as timeout in READY
state and invoke the termination protocol
• Failure in ABORT or COMMIT– Nothing special needs to be
done
• INITIAL
• READY
• Prepare • Vote-commit
• Global-commit• Ack
• Prepare • Vote-abort
• Global-abort• Ack
• ABORT • COMMIT
• RMs
49© J Singh, 2011 49
2PC Protocol Actions RM TM
No
Yes
VOTE-COMMIT
Yes GLOBAL-ABORT
No
write abortin log
Abort
CommitACK
ACK
INITIAL
write abortin log
write readyin log
write commitin log
Type ofmsg
WAIT
Ready toCommit?
write commitin log
Any No?write abort
in log
ABORTCOMMIT
COMMITABORT
writebegin_commit
in log
writeend_of_transaction
in log
READY
INITIAL
PREPARE
VOTE-ABORT
VOTE-COMMIT
50© J Singh, 2011 50
Two-phase commit commentary
• Two-phase commit protocol limitation: it is a blocking protocol.
– The failure of the TM can cause the protocol to block until the TM is repaired.
• If the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted.
• RMs will block resource processes while waiting for a message from the TM.
– A TM will also block resources while waiting for replies from RMs. A TM can also block indefinitely if no acknowledgement is received from the RM.
• “Federated” two-phase commit protocols, aka three-phase protocols, have been proposed but are still unproven.
• Paxos Consensus Algorithm. – Consensus on Transaction Commit, Jim Gray and Leslie Lamport,
Microsoft Research, 2005, MSR-TR-2003-96
51© J Singh, 2011 51
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
52© J Singh, 2011 52
Fault-Tolerant Two Phase Commit
RequestCommit
Prepare
Prepared
client
TM RM
TM RMRequestCommit
Prepare
Prepare
Prepared
Prepared
If the 2PC Transaction Manager (TM) Fails, transaction blocks.
Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
53© J Singh, 2011 53
Fault-Tolerant Two Phase Commit
RequestCommit
Preparecommit
client
TM RM
TM RM
Prepare
Prepare
Prepared
Prepared
commitcommit
abort
commit
If the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
But… What if….?
TM
Prepare
Prepared
commit
abort
Inconsistent! Now What?
The complexity is a mess.
Prepared
54© J Singh, 2011 54
Fault Tolerant 2PC
• Several workarounds proposed in database community:
• Often called "3-phase" or "non-blocking" commit.
• None with complete algorithm and correctness proof.
55© J Singh, 2011 55
W Chosenclient
Propose X
consensusbox
client
clientPropose W
W Chosen
W Chosen
Consensus
• collects proposed values • Picks one proposed value• remembers it forever
56© J Singh, 2011 56
Consensus for Commit – The Obvious Approach
• Get consensus on TM’s decision.• TM just learns consensus value.• TM is “stateless”
RMPropose PreparedPrepared Chosen
consensusbox
Prepared Chosen
Prepared
Prepared
Prepared
RequestCommit
Prepare
Commit
client
TM RM
TMRequest Commit
Prepare
Prepare
CommitCommit
Commit
Commit
Propose Prepared
Prepared Chosen
57© J Singh, 2011 57
Consensus for Commit – The Paxos Commit Approach
• Get consensus on each RM’s choice.• TM just combines consensus values.• TM is “stateless”
RM
RM
RM1 Prepared Chosen
RM1 Prepared Chosen
RM2 Prepared Chosen
RequestCommit
Prepare
Commit
client
TM
TMRequest Commit
Prepare
Prepare
CommitCommit
Commit
Commitconsensus
box
consensusbox
Propose RM2 Prepared
Propose RM1 Prepared
Propose RM1 Prepared
RM2 Prepared Chosen
Propose RM2 Prepared
58© J Singh, 2011 58
Prepared Chosen
Prepared
Prepare
Commit
Propose Prepared
RM1 Prepared Chosen
Prepare
Commit
Propose RM1 Prepared
RM2 Prepared Chosen
Propose RM2 Prepared
The Obvious Approach
Paxos Commit
One fewer message delay
59© J Singh, 2011 59
RM
TM
TM
acceptor
acceptor
acceptor
Consensus boxPropose RM Prepared
Consensus in Action
• The normal (failure-free) case• Two message delays• Can optimize
Propose RM PreparedPropose RM Prepared
Vote RM Prepared
Vote RM Prepared
Vote RM PreparedRM
PreparedChosen
60© J Singh, 2011 60
RM
TM
TM
acceptor
acceptor
acceptor
Consensus box
Consensus in Action
TM
TM can always learn what was chosen,or get Aborted chosen if nothing chosen yet; if majority of acceptors working .
61© J Singh, 2011 61
The Complete Algorithm
• Subtle.
• More weird cases than most people imagine.
• Proved correct.
62© J Singh, 2011 62
Paxos Commit in a Nutshell
• N RMs
• 2F+1 acceptors (~2F+1 TMs)
• If F+1 acceptors see all RMs prepared, then transaction committed.
• 2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays. Clien
t TM
RM1…NAcceptors
0…2Frequestcommit
prepareprepared
all prepared
commit
63© J Singh, 2011 63
Paxos Commit Evaluation
• Two-Phase Commit– 3N+1 messages– N+1 stable writes– 4 message delays– 2 stable-write delays
• Availability is compromised
• Paxos Commit– 3N+ 2F(N+1) +1 messages– N+2F+1 stable writes– 5 message delays– 2 stable-write delays
• Tolerates F Faults
• Paxos ≡ 2PC for F = 0
• Paxos Algorithm is the basis of Google’s Global Distributed Lock Manager
– Chubby has F=2 (5 Acceptors)
64© J Singh, 2011 64
Today’s Meeting
• Concurrency Control– Intention Locks– Index Locking– Optimistic CC
• Validation• Timestamp Ordering
– Multi-version CC
• Commit in Distributed Databases
– Two Phase Commit– Paxos Algorithm
• Concluding thoughts
• References (aside from textbook): 1. Concurrency Control and Recovery
in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.
2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998
3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004
4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
65© J Singh, 2011 65
OLTP Through the Looking Glass (p1)
• Workload– TPC-C Benchmark
• Quote:– Overall, we identify
overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. …
– Substantial time is spent in logging, latching, locking, Btree, and buffer management.
• OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008
– Took out components of a DBMS and measured its performance impact
66© J Singh, 2011 66
OLTP Through the Looking Glass (p2)
• Concurrency Control– Look for applications where
it can be turned off– Some sort of optimistic
concurrency control
• Multi-core Support– Latching (inter-thread
communication) remains a significant bottleneck
• Cache-conscious B-Trees
• Replication Management– Loss of transactional
consistency if log shipping– Recovery is not
instantaneous– Maintaining transactional
consistency
• Weak Consistency– Starbucks doesn’t need two
phase commit– How to achieve eventual
consistency without transactional consistency
• Areas for Research that may yield dividends
67© J Singh, 2011 67
End of an Era?
• The Relational Model is not necessarily the answer
– It was excellent for data processing
– Not a natural fit for• Data Warehouses• Web-oriented search• Real-time analytics, and• Semi-structured data
– i.e., Semantic Web
• SQL is not the answer– Coupling between modern
programming languages and SQL are “ugly beyond belief”
– Programming languages have evolved while SQL has remained static
• Pascal• C/C++• Java• The little languages:
Python, Perl, PHP, Ruby
• The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
– A critique of the “one size fits all” assumption in DBMS
68© J Singh, 2011 68
What’s so fun about databases?
• Traditional database courses talked about– Employee records– Bank records
• Now we talk about– Web search– Data mining– The collective intelligence of tweets– Scientific and medical databases
• From a personal viewpoint,– I have enjoyed learning this material with you– Thank you.
From our January 13 Lecture…
69© J Singh, 2011 69
About CS 542
• CS 542 will– Build on database concepts
you already know– Provide you tools for
separating hype from reality
– Help you develop skills in evaluating the tradeoffs involved in using and/or creating a database
• CS 542 may– Train you to read technical
journals and apply them
• CS 542 will not– Cover the intricacies of SQL
programming– Spend much effort in
• Dynamic SQL• Stored Procedures• Interfaces with application
programming languages• Connectors, e.g., JDBC,
ODBC
From our January 13 Lecture…
70© J Singh, 2011 70
Thanks
• Contact Information:– President, Early Stage IT – a cloud-based consulting firm
• Email: J [dot] Singh [at] EarlyStageIT [dot] com• Phone: 978-760-2055
– Co-chair of Software and Services SIG at TiE-Boston– Founder, SQLnix.org, a local resource for NoSQL databases
• My WPI email will be good through the summer.