Download - CS 542 -- Concurrency Control, Distributed Commit

Transcript
Page 1: CS 542 -- Concurrency Control, Distributed Commit

CS 542 Database Management SystemsConcurrency ControlCommit in Distributed Systems

J Singh April 11, 2011

Page 2: CS 542 -- Concurrency Control, Distributed Commit

2© J Singh, 2011 2

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 3: CS 542 -- Concurrency Control, Distributed Commit

3© J Singh, 2011 3

Scheduler Architecture for CC

• Scheduler has two parts

1. Accepts read/write requests from transactions

2. Assures serialization• Keeps track of active

and pending transactions

• Controls commit, abort, delay

• Today’s lecture discusses Part 2 functionality

Page 4: CS 542 -- Concurrency Control, Distributed Commit

4© J Singh, 2011 4

The Lock Table

• A relation that associates database elements with locking information about that element

• Implemented as a hash table

• Size is proportional to the number of lock elements, not to the size of the entire database

DB element A

Lock information for A

Page 5: CS 542 -- Concurrency Control, Distributed Commit

5© J Singh, 2011 5

Scheduler Priority Logic

• When a transaction releases a lock that other transactions are waiting for, what policy to use?

– First-Come-First-Served: • Grant the lock to the longest waiting request. • No starvation (waiting forever for lock)

– Priority to Shared Locks: • Grant all S locks waiting, then one X lock. • Grant X lock if no others waiting

– Priority to Upgrading: • If there is a U lock waiting to upgrade to an X lock, grant that first.

• Each has its advantages and disadvantages– Configurable for a database instance

Page 6: CS 542 -- Concurrency Control, Distributed Commit

6© J Singh, 2011 6

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 7: CS 542 -- Concurrency Control, Distributed Commit

7© J Singh, 2011 7

Motivation for intention locks

• Besides scanning through the table, if we need to modify a few tuples. What kind of lock to put on the table?

• Have to be X (if we only have S or X).

• But, blocks all other read requests!

Page 8: CS 542 -- Concurrency Control, Distributed Commit

8© J Singh, 2011 8

Intention Locks

• Allow intention locks IS, IX.

• Before S locking an item, must IS lock the root.

• Before X locking an item, must IX lock the root.

• Should make sure:– If Ti S locks a node, no Tj can X lock an ancestor.

• Achieved if S conflicts with IX

– If Tj X locks a node, no Ti can S or X lock an ancestor.• Achieved if X conflicts with IS and IX.

Page 9: CS 542 -- Concurrency Control, Distributed Commit

9© J Singh, 2011 9

Allowed Lock Sharings

IS IX S

IS

IX

S

Ö

Ö

Ö

Ö Ö

Ö

SIX X

ÖSIX

X

Ö Ö

Ö

Lock Requester

Lock

Hold

er

Page 10: CS 542 -- Concurrency Control, Distributed Commit

10© J Singh, 2011 10

Multiple Granularity Lock Protocol

• Each txn starts from the root of the hierarchy.

• To get a lock on any node, must hold an intentional lock on its parent node!

– E.g. to get S lock on a node, must hold IS or IX on parent.– E.g. to get X lock on a node, must hold IX or SIX on parent.– Full table of rules:

• Must release locks in bottom-up order.

Parent Locked In

Child may be locked by same txn in

IS IS, S

IX IS, S, IX, X, SIX

S none

SIX X, IX, (also SIX, but not necessary)

X none

Page 11: CS 542 -- Concurrency Control, Distributed Commit

11© J Singh, 2011 11

Example 1• T1 needs a shared lock on t2

• T2 needs a shared lock on R1

R1

t1t2 t3

t4

T1(IS)

T1(S)

, T2(S)

Page 12: CS 542 -- Concurrency Control, Distributed Commit

12© J Singh, 2011 12

Example 2• T1 needs a shared lock on t2

• T2 needs an exclusive lock on t4 – No conflict

R1

t1t2 t3

t4

T1(IS)

T1(S)

, T2(IX)

T2(IX)

Page 13: CS 542 -- Concurrency Control, Distributed Commit

13© J Singh, 2011 13

Examples 3, 4, 5

• T1 scans R, and updates a few tuples:

– T1 gets an SIX lock on R, and occasionally upgrades to X on the tuples.

• T2 uses an index to read only part of R:

– T2 gets an IS lock on R, and repeatedly gets an S lock on tuples of R.

• T3 reads all of R:– T3 gets an S lock on R. – OR, T3 could behave like T2;

can use lock escalation as it goes.

IS IX S

IS

IX

S

Ö

Ö

Ö

Ö Ö

Ö

SIX X

ÖSIX

X

Ö Ö

Ö

Lock Requester

Lock

Hold

er

Page 14: CS 542 -- Concurrency Control, Distributed Commit

14© J Singh, 2011 14

Insert and Delete

• Transactions– T1:

SELECT MAX(Price) WHERE Rating = 1;

SELECT MAX(Price) WHERE Rating = 2;

– T2:INSERT <Apple, Arkansas Black, 1,

96>;DELETE WHERE Rating = 2 AND Price = (SELECT MAX(Price)

WHERE Rating = 2);• Execution– T1 locks all records w/Rating=1 and

gets 80.– T2 inserts <Arkansas Black, 96>– T2 deletes <Fuji, 75>– T1 locks all records w/Rating=2 and

gets 65.

Fruit Variety * Price

Apple

Baldwin 1 80

Apple

Cortland 2 65

Apple

Delicious 2 55

Apple

Empire 1 60

Apple

Fuji 2 75

Apple

Granny Smith

1 65

• Result:– From T1: 80, 65– Actual: 96, 65– T1 then T2: 80,

75– T2 then T1: 96,

65

Page 15: CS 542 -- Concurrency Control, Distributed Commit

15© J Singh, 2011 15

Insert and Delete Rules

• When T1 inserts t1 into R,– Give X lock on t1 to T1

• When T2 deletes t2 from R,– It must obtain an X lock on t2

– This will fix the Fuji delete problem (how so?)

• But there is still a problem: Phantom Reads. – Seen with Arkansas Black in the example– Solution: use multiple granularity tree– Before inserting Q, obtain an X lock for parent(Q)

Page 16: CS 542 -- Concurrency Control, Distributed Commit

16© J Singh, 2011 16

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 17: CS 542 -- Concurrency Control, Distributed Commit

17© J Singh, 2011 17

Did Insert/Delete expose a flaw in 2PL?

• The flaw was with the assumption that by locking all tuples, T1 had locked the set!

– We needed to lock the set– Would we bottleneck on the relation if the workload were insert-

and delete-heavy?

• There is another way to solve the problem:– Lock at the index (if one exists)– Since B+ trees are not 100% full, we can maintain multiple

locks in different sections of the tree.

r=1

Index Put a lock here.

Page 18: CS 542 -- Concurrency Control, Distributed Commit

18© J Singh, 2011 18

Index Locking (p1)

• Higher levels of the tree only direct searches for leaf pages.

• For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)

• We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.

Page 19: CS 542 -- Concurrency Control, Distributed Commit

19© J Singh, 2011 19

Index Locking (p2)

• Search: Start at root and go down; repeatedly, S lock child then unlock parent.

• Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe:

– If child is safe, release all locks on ancestors.

• Safe node: Node such that changes will not propagate up beyond this node.

– Inserts: Node is not full.– Deletes: Node is not half-empty.

Page 20: CS 542 -- Concurrency Control, Distributed Commit

20© J Singh, 2011 20

Example

Where to lock?1) Delete 38*2) Insert 45*3) Insert 25*

ROOT

A

B

C

D E

F

G H I

20

35

20*

38 44

22* 23* 24* 35* 36* 38* 41* 44*

23

Page 21: CS 542 -- Concurrency Control, Distributed Commit

21© J Singh, 2011 21

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 22: CS 542 -- Concurrency Control, Distributed Commit

22© J Singh, 2011 22

Optimistic CC

• Locking is a conservative approach in which conflicts are prevented. Disadvantages:

– Lock management overhead.– Deadlock detection/resolution.

• Not discussed in CS-542 lectures, expecting that you are familiar with it

– If conflicts are rare, we may be able to gain performance by not locking, and instead checking for conflicts before txns commit.

• Two approaches– Kung-Robinson Model

• Divides every transaction into three phases: read, validate, write• Makes commit/abort decision based on what’s being read and

written– Timestamp Ordering Algorithms

• Clever use of timestamps to determine which operations are conflict-free and which must be aborted

Page 23: CS 542 -- Concurrency Control, Distributed Commit

23© J Singh, 2011 23

Kung-Robinson Model

• Key idea:– Let transactions work

in isolation– Validate reads and

writes when ready to commit

– Make Validation Atomic

– Validated ≡ Committed

• Transactions have three phases:– READ:

• txns read from the database, • make changes to private copies of

objects.– VALIDATE:

• Check if schedule so far is serializable.

– WRITE: • Make local copies of changes public.

ROOT

old

new

modifiedobjects

Page 24: CS 542 -- Concurrency Control, Distributed Commit

24© J Singh, 2011 24

Validation

• Test conditions that are sufficient to ensure that no conflict occurred.

– Each txn is assigned a numeric id.• Just use a timestamp.

• Transaction ids assigned at end of READ phase, just before validation begins.

– ReadSet(Ti): Set of objects read by txn Ti.

– WriteSet(Ti): Set of objects modified by Ti.

• Validation is atomic– Done in a critical section

Page 25: CS 542 -- Concurrency Control, Distributed Commit

25© J Singh, 2011 25

Validation Tests

• Test

FIN(Ti) < START(Tj)

FIN(Ti) < VAL(Tj) AND

WriteSet(Ti ) ∩ ReadSet(Tj ) is empty.

VAL(Ti) < VAL(Tj) ANDWriteSet(Ti ) ∩ ReadSet(Tj ) is empty

ANDWriteSet(Ti ) ∩ WriteSet(Tj ) is empty.

Ti

TjR V W

R V W

Ti

Tj

R V W

R V W

Ti

Tj

R V W

R V W

• Situation

Page 26: CS 542 -- Concurrency Control, Distributed Commit

26© J Singh, 2011 26

Overheads in Kung-Robinson CC

• Must record read/write activity in ReadSet and WriteSet per txn.

– Must create and destroy these sets as needed.

• Must check for conflicts during validation, and must make validated writes “global”.

– Critical section can reduce concurrency.– Scheme for making writes global can reduce clustering of

objects.

• Optimistic CC restarts transactions that fail validation.– Work done so far is wasted; requires clean-up.

Page 27: CS 542 -- Concurrency Control, Distributed Commit

27© J Singh, 2011 27

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 28: CS 542 -- Concurrency Control, Distributed Commit

28© J Singh, 2011 28

Timestamp Ordering CC

• Main idea:– Put a timestamp on the last read and write action on every

object– Use this timestamp to detect if a transaction attempts an illegal

operation– Abort the offending transaction if it does

• Algorithm: – Give each object a read-timestamp (RTS) and a write-timestamp

(WTS), – Give each txn a timestamp (TS) when it begins– Action ai of txn Ti must occur before action aj of txn Tj if

• If action ai of txn Ti conflicts with action aj of txn Tj, and

• TS(Ti) < TS(Tj), then ai must occur before aj.

– Otherwise, restart the violating txn.

Page 29: CS 542 -- Concurrency Control, Distributed Commit

29© J Singh, 2011 29

Rules for Timestamps-Based scheduling

• Algorithm setup– RT(X)

• The read time of X, the highest timestamp of transaction that has read X.

– WT(X)• The write time of X, the highest timestamp of transaction that has

write X.– C(X)

• The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.

• Scheduler receives a request from T to operate on X– The request is realizable under some conditions and not under

others

Page 30: CS 542 -- Concurrency Control, Distributed Commit

30© J Singh, 2011 30

Physically Unrealizable

• Read too late– A transaction U that started after transaction T but wrote a

value for X before T reads X

– In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

U writes X

T reads X

T start U start

Page 31: CS 542 -- Concurrency Control, Distributed Commit

31© J Singh, 2011 31

Physically Unrealizable

• Write too late– A transaction U that started after T, but read X before T got a

chance to write X.

– In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

U reads X

T writes X

T start U start

Page 32: CS 542 -- Concurrency Control, Distributed Commit

32© J Singh, 2011 32

Dirty Read

• After T reads the value of X written by U, U could abort

– In other words, if TS(T) = RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later value in X.

• If C(X) is true, then the previous writer of X is committed, all is good.• If C(X) is false, we must delay T.

U writes X

T reads X

U start T start

U aborts

Page 33: CS 542 -- Concurrency Control, Distributed Commit

33© J Singh, 2011 33

Write after Write

• T tries to write X after a later transaction (U) has written it– OK to ignore the write by T because it will get overwritten

anyway– Except if U aborts

• And the new value of T is lost forever

– Solve this problem by introducing the concept of a “tentative write”

U writes XT writes X

T start

U start

T commit

U abort

Page 34: CS 542 -- Concurrency Control, Distributed Commit

34© J Singh, 2011 34

Rules for Timestamps-based Scheduling

• Scheduler receives a request to commit T. – It must find all the database elements X written by T and set

C(X)=true. – If any transactions are waiting for X to be committed, these

transactions are allowed to proceed.

• Scheduler receives a request to abort T or decides to rollback T,

– Any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

Page 35: CS 542 -- Concurrency Control, Distributed Commit

35© J Singh, 2011 35

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 36: CS 542 -- Concurrency Control, Distributed Commit

36© J Singh, 2011 36

Multiversion Timestamps

• Multiversion schemes keep old versions of data item to increase concurrency.

• Each successful write results in the creation of a new version of the data item written.

• Use timestamps to label versions.

• When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.

Page 37: CS 542 -- Concurrency Control, Distributed Commit

37© J Singh, 2011 37

Timestamps vs Locking

• Generally, timestamping performs better than locking in situations where:

– Most transactions are read-only.– It is rare that concurrent transaction will try to read and write

the same element.– This is generally the case for Web Applications

• In high-conflict situation, locking performs better than timestamps

Page 38: CS 542 -- Concurrency Control, Distributed Commit

38© J Singh, 2011 38

Practical Use

• 2-Phase Locks (or variants)– Used by most relational databases

• Multi-level granularity– Support for table, page and tuple-level locks– Used by most relational databases

• Multi-version concurrency control– Oracle 8 forward: Divide transactions into read-only and read-

write• Read-only transactions use multi-version concurrency and never wait• Read-write transactions use 2PL

– Postgres, others as well, offer some level of MVCC

Page 39: CS 542 -- Concurrency Control, Distributed Commit

39© J Singh, 2011 39

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 40: CS 542 -- Concurrency Control, Distributed Commit

40© J Singh, 2011 40

Distributed Commit Motivation

• FruitCo has– Its main Sales office in

Oregon– Farms and Warehouse are in

Washington– Finance is in Utah– All three sites have local

data centers with their own systems

• When an order is placed, the Sales system must send the billing information to Utah and shipping information to Washington.

– When an order is placed, all three databases must be updated, or none should be.

Page 41: CS 542 -- Concurrency Control, Distributed Commit

41© J Singh, 2011 41

Two Phase Commit

• The Basic Idea

Transaction Manager (TM) Resource Manager (RM)1. Trans. arrives.

Message to ask for vote is sent to other site(s)

Message is recorded.Site votes Y or N (abort)Vote is sent to site 1

2. The vote is received. If vote = Y on both sites, then Commit else Abort

Either Commit or Abort based on the decision of site 1

Page 42: CS 542 -- Concurrency Control, Distributed Commit

42© J Singh, 2011 42

Two-Phase Commit (2PC)

• Phase 1 : The TM gets the RMs ready to write the results into the database

• Phase 2 : Everybody writes the results into the database– TM :The process at the site where the transaction originates

and which controls the execution– RM :The process at the other sites that participate in

executing the transaction

• Global Commit Rule:– The TM aborts a transaction if and only if at least one RM

votes to abort it.– The TM commits a transaction if and only if all of the RMs

vote to commit it.

Page 43: CS 542 -- Concurrency Control, Distributed Commit

43© J Singh, 2011 43

Centralized 2PC

ready? yes/nocommit/abort?commited/aborted

Phase 1 Phase 2

C C C

P

P

P

P

P

P

P

P

Page 44: CS 542 -- Concurrency Control, Distributed Commit

44© J Singh, 2011 44

State Transitions in 2PC

INITIAL

WAIT

Commit commandPrepare

Vote-commit (all)Global-commit

INITIAL

READY

Prepare Vote-commit

Global-commitAck

Prepare Vote-abort

Global-abortAck

TM RMs

Vote-abort Global-abort

ABORT COMMIT COMMITABORT

Page 45: CS 542 -- Concurrency Control, Distributed Commit

45© J Singh, 2011 45

When TM Fails…

• Timeout in INITIAL– Who cares

• Timeout in WAIT– Cannot unilaterally commit– Can unilaterally abort

• Timeout in ABORT or COMMIT

– Stay blocked and wait for the acks

• TM

• INITIAL

• WAIT

• Commit command• Prepare

• Vote-commit • Global-commit

• ABORT • COMMIT

• Vote-abort • Global-abort

Page 46: CS 542 -- Concurrency Control, Distributed Commit

46© J Singh, 2011 46

When an RM Fails…

• Timeout in INITIAL– TM must have failed in

INITIAL state– Unilaterally abort

• Timeout in READY– Stay blocked

• INITIAL

• READY

• Prepare • Vote-commit

• Global-commit• Ack

• Prepare • Vote-abort

• Global-abort• Ack

• ABORT • COMMIT

• RMs

Page 47: CS 542 -- Concurrency Control, Distributed Commit

47© J Singh, 2011 47

When TM Recovers…

• Failure in INITIAL– Start the commit process

upon recovery

• Failure in WAIT– Restart the commit process

upon recovery

• Failure in ABORT or COMMIT– Nothing special if all the

acks have been received– Otherwise the termination

protocol is involved

• TM

• INITIAL

• WAIT

• Commit command• Prepare

• Vote-commit • Global-commit

• ABORT • COMMIT

• Vote-abort • Global-abort

Page 48: CS 542 -- Concurrency Control, Distributed Commit

48© J Singh, 2011 48

When an RM Recovers…

• Failure in INITIAL– Unilaterally abort upon

recovery

• Failure in READY– The TM has been informed

about the local decision– Treat as timeout in READY

state and invoke the termination protocol

• Failure in ABORT or COMMIT– Nothing special needs to be

done

• INITIAL

• READY

• Prepare • Vote-commit

• Global-commit• Ack

• Prepare • Vote-abort

• Global-abort• Ack

• ABORT • COMMIT

• RMs

Page 49: CS 542 -- Concurrency Control, Distributed Commit

49© J Singh, 2011 49

2PC Protocol Actions RM TM

No

Yes

VOTE-COMMIT

Yes GLOBAL-ABORT

No

write abortin log

Abort

CommitACK

ACK

INITIAL

write abortin log

write readyin log

write commitin log

Type ofmsg

WAIT

Ready toCommit?

write commitin log

Any No?write abort

in log

ABORTCOMMIT

COMMITABORT

writebegin_commit

in log

writeend_of_transaction

in log

READY

INITIAL

PREPARE

VOTE-ABORT

VOTE-COMMIT

Page 50: CS 542 -- Concurrency Control, Distributed Commit

50© J Singh, 2011 50

Two-phase commit commentary

• Two-phase commit protocol limitation: it is a blocking protocol.

– The failure of the TM can cause the protocol to block until the TM is repaired.

• If the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted.

• RMs will block resource processes while waiting for a message from the TM.

– A TM will also block resources while waiting for replies from RMs. A TM can also block indefinitely if no acknowledgement is received from the RM.

• “Federated” two-phase commit protocols, aka three-phase protocols, have been proposed but are still unproven.

• Paxos Consensus Algorithm. – Consensus on Transaction Commit, Jim Gray and Leslie Lamport,

Microsoft Research, 2005, MSR-TR-2003-96

Page 51: CS 542 -- Concurrency Control, Distributed Commit

51© J Singh, 2011 51

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 52: CS 542 -- Concurrency Control, Distributed Commit

52© J Singh, 2011 52

Fault-Tolerant Two Phase Commit

RequestCommit

Prepare

Prepared

client

TM RM

TM RMRequestCommit

Prepare

Prepare

Prepared

Prepared

If the 2PC Transaction Manager (TM) Fails, transaction blocks.

Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)

Page 53: CS 542 -- Concurrency Control, Distributed Commit

53© J Singh, 2011 53

Fault-Tolerant Two Phase Commit

RequestCommit

Preparecommit

client

TM RM

TM RM

Prepare

Prepare

Prepared

Prepared

commitcommit

abort

commit

If the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)

But… What if….?

TM

Prepare

Prepared

commit

abort

Inconsistent! Now What?

The complexity is a mess.

Prepared

Page 54: CS 542 -- Concurrency Control, Distributed Commit

54© J Singh, 2011 54

Fault Tolerant 2PC

• Several workarounds proposed in database community:

• Often called "3-phase" or "non-blocking" commit.

• None with complete algorithm and correctness proof.

Page 55: CS 542 -- Concurrency Control, Distributed Commit

55© J Singh, 2011 55

W Chosenclient

Propose X

consensusbox

client

clientPropose W

W Chosen

W Chosen

Consensus

• collects proposed values • Picks one proposed value• remembers it forever

Page 56: CS 542 -- Concurrency Control, Distributed Commit

56© J Singh, 2011 56

Consensus for Commit – The Obvious Approach

• Get consensus on TM’s decision.• TM just learns consensus value.• TM is “stateless”

RMPropose PreparedPrepared Chosen

consensusbox

Prepared Chosen

Prepared

Prepared

Prepared

RequestCommit

Prepare

Commit

client

TM RM

TMRequest Commit

Prepare

Prepare

CommitCommit

Commit

Commit

Propose Prepared

Prepared Chosen

Page 57: CS 542 -- Concurrency Control, Distributed Commit

57© J Singh, 2011 57

Consensus for Commit – The Paxos Commit Approach

• Get consensus on each RM’s choice.• TM just combines consensus values.• TM is “stateless”

RM

RM

RM1 Prepared Chosen

RM1 Prepared Chosen

RM2 Prepared Chosen

RequestCommit

Prepare

Commit

client

TM

TMRequest Commit

Prepare

Prepare

CommitCommit

Commit

Commitconsensus

box

consensusbox

Propose RM2 Prepared

Propose RM1 Prepared

Propose RM1 Prepared

RM2 Prepared Chosen

Propose RM2 Prepared

Page 58: CS 542 -- Concurrency Control, Distributed Commit

58© J Singh, 2011 58

Prepared Chosen

Prepared

Prepare

Commit

Propose Prepared

RM1 Prepared Chosen

Prepare

Commit

Propose RM1 Prepared

RM2 Prepared Chosen

Propose RM2 Prepared

The Obvious Approach

Paxos Commit

One fewer message delay

Page 59: CS 542 -- Concurrency Control, Distributed Commit

59© J Singh, 2011 59

RM

TM

TM

acceptor

acceptor

acceptor

Consensus boxPropose RM Prepared

Consensus in Action

• The normal (failure-free) case• Two message delays• Can optimize

Propose RM PreparedPropose RM Prepared

Vote RM Prepared

Vote RM Prepared

Vote RM PreparedRM

PreparedChosen

Page 60: CS 542 -- Concurrency Control, Distributed Commit

60© J Singh, 2011 60

RM

TM

TM

acceptor

acceptor

acceptor

Consensus box

Consensus in Action

TM

TM can always learn what was chosen,or get Aborted chosen if nothing chosen yet; if majority of acceptors working .

Page 61: CS 542 -- Concurrency Control, Distributed Commit

61© J Singh, 2011 61

The Complete Algorithm

• Subtle.

• More weird cases than most people imagine.

• Proved correct.

Page 62: CS 542 -- Concurrency Control, Distributed Commit

62© J Singh, 2011 62

Paxos Commit in a Nutshell

• N RMs

• 2F+1 acceptors (~2F+1 TMs)

• If F+1 acceptors see all RMs prepared, then transaction committed.

• 2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays. Clien

t TM

RM1…NAcceptors

0…2Frequestcommit

prepareprepared

all prepared

commit

Page 63: CS 542 -- Concurrency Control, Distributed Commit

63© J Singh, 2011 63

Paxos Commit Evaluation

• Two-Phase Commit– 3N+1 messages– N+1 stable writes– 4 message delays– 2 stable-write delays

• Availability is compromised

• Paxos Commit– 3N+ 2F(N+1) +1 messages– N+2F+1 stable writes– 5 message delays– 2 stable-write delays

• Tolerates F Faults

• Paxos ≡ 2PC for F = 0

• Paxos Algorithm is the basis of Google’s Global Distributed Lock Manager

– Chubby has F=2 (5 Acceptors)

Page 64: CS 542 -- Concurrency Control, Distributed Commit

64© J Singh, 2011 64

Today’s Meeting

• Concurrency Control– Intention Locks– Index Locking– Optimistic CC

• Validation• Timestamp Ordering

– Multi-version CC

• Commit in Distributed Databases

– Two Phase Commit– Paxos Algorithm

• Concluding thoughts

• References (aside from textbook): 1. Concurrency Control and Recovery

in Database Systems, Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman, Microsoft Research.

2. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998

3. Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004

4. OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

5. The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

Page 65: CS 542 -- Concurrency Control, Distributed Commit

65© J Singh, 2011 65

OLTP Through the Looking Glass (p1)

• Workload– TPC-C Benchmark

• Quote:– Overall, we identify

overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. …

– Substantial time is spent in logging, latching, locking, Btree, and buffer management.

• OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008

– Took out components of a DBMS and measured its performance impact

Page 66: CS 542 -- Concurrency Control, Distributed Commit

66© J Singh, 2011 66

OLTP Through the Looking Glass (p2)

• Concurrency Control– Look for applications where

it can be turned off– Some sort of optimistic

concurrency control

• Multi-core Support– Latching (inter-thread

communication) remains a significant bottleneck

• Cache-conscious B-Trees

• Replication Management– Loss of transactional

consistency if log shipping– Recovery is not

instantaneous– Maintaining transactional

consistency

• Weak Consistency– Starbucks doesn’t need two

phase commit– How to achieve eventual

consistency without transactional consistency

• Areas for Research that may yield dividends

Page 67: CS 542 -- Concurrency Control, Distributed Commit

67© J Singh, 2011 67

End of an Era?

• The Relational Model is not necessarily the answer

– It was excellent for data processing

– Not a natural fit for• Data Warehouses• Web-oriented search• Real-time analytics, and• Semi-structured data

– i.e., Semantic Web

• SQL is not the answer– Coupling between modern

programming languages and SQL are “ugly beyond belief”

– Programming languages have evolved while SQL has remained static

• Pascal• C/C++• Java• The little languages:

Python, Perl, PHP, Ruby

• The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007

– A critique of the “one size fits all” assumption in DBMS

Page 68: CS 542 -- Concurrency Control, Distributed Commit

68© J Singh, 2011 68

What’s so fun about databases?

• Traditional database courses talked about– Employee records– Bank records

• Now we talk about– Web search– Data mining– The collective intelligence of tweets– Scientific and medical databases

• From a personal viewpoint,– I have enjoyed learning this material with you– Thank you.

From our January 13 Lecture…

Page 69: CS 542 -- Concurrency Control, Distributed Commit

69© J Singh, 2011 69

About CS 542

• CS 542 will– Build on database concepts

you already know– Provide you tools for

separating hype from reality

– Help you develop skills in evaluating the tradeoffs involved in using and/or creating a database

• CS 542 may– Train you to read technical

journals and apply them

• CS 542 will not– Cover the intricacies of SQL

programming– Spend much effort in

• Dynamic SQL• Stored Procedures• Interfaces with application

programming languages• Connectors, e.g., JDBC,

ODBC

From our January 13 Lecture…

Page 70: CS 542 -- Concurrency Control, Distributed Commit

70© J Singh, 2011 70

Thanks

• Contact Information:– President, Early Stage IT – a cloud-based consulting firm

• Email: J [dot] Singh [at] EarlyStageIT [dot] com• Phone: 978-760-2055

– Co-chair of Software and Services SIG at TiE-Boston– Founder, SQLnix.org, a local resource for NoSQL databases

• My WPI email will be good through the summer.