1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha...

24
1 Scaleable Replicated Databases Jim Gray (Microsoft) Jim Gray (Microsoft) Pat Helland Pat Helland (Microsoft) (Microsoft) Dennis Shasha Dennis Shasha (Columbia) (Columbia) Pat O’Neil (U.Mass) Pat O’Neil (U.Mass)

Transcript of 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha...

Page 1: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

11

Scaleable Replicated Databases

Jim Gray (Microsoft)Jim Gray (Microsoft)

Pat Helland (Microsoft)Pat Helland (Microsoft)

Dennis Shasha (Columbia)Dennis Shasha (Columbia)

Pat O’Neil (U.Mass)Pat O’Neil (U.Mass)

Page 2: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

22

Outline Replication strategiesReplication strategies

– Lazy and EagerLazy and Eager– Master and GroupMaster and Group

How centralized databases scaleHow centralized databases scale– deadlocks rise non-linearly withdeadlocks rise non-linearly with

transaction size transaction size concurrencyconcurrency

Replication systems are unstable on scaleupReplication systems are unstable on scaleup A possible solutionA possible solution

Page 3: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

33

Scaleup, Replication, Partition

NN22 more more workwork

PartitioningTwo 1 TPS systems

ReplicationTwo 2 TPS systems

2 TPS server1 TPS server

100 Users

1 TPS server100 Users

O tp

s

O tp

s

100 Users

2 TPS server100 Users

1 tp

s

1 tp

s

1 TPS server100 Users

Base casea 1 TPS system

2 TPS server200 Users

Scaleupto a 2 TPS centralized system

Page 4: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

44

Why Replicate Databases?

Give users a local copy for Give users a local copy for – PerformancePerformance

– AvailabilityAvailability

– Mobility (they are disconnected)Mobility (they are disconnected) But... What if they update it?But... What if they update it? Must propagate updates to other copiesMust propagate updates to other copies

Page 5: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

55

Propagation Strategies Eager: Send update right awayEager: Send update right away

– (part of same transaction)(part of same transaction)

– NN times larger transactions times larger transactions Lazy: Send update asynchronouslyLazy: Send update asynchronously

– separate transactionseparate transaction

– NN times more transactions times more transactions Either wayEither way

– NN times more updates per second per node times more updates per second per node

– NN22 times more work overall times more work overall

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Page 6: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

66

Update Control Strategies

Master Master – Each object has a master nodeEach object has a master node

– All updates start with the masterAll updates start with the master

– Broadcast to the subscribersBroadcast to the subscribers GroupGroup

– Object can be updated by anyoneObject can be updated by anyone

– Update broadcast to all othersUpdate broadcast to all others Everyone Everyone wantswants Lazy Group: Lazy Group:

– update anywhere, anytime, anywayupdate anywhere, anytime, anyway

Page 7: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

77

Quiz Questions: Name One Eager Eager

– Master:Master: N-Plexed disksN-Plexed disks– Group: Group: ??

Lazy Lazy – Master: Master: Bibles, Bank accounts, SQLserverBibles, Bank accounts, SQLserver– Group:Group: Name servers, Oracle, Access...Name servers, Oracle, Access...

Note: Note: Lazy contradicts SerializableLazy contradicts Serializable– If two lazy updates collide, then ... If two lazy updates collide, then ... reconcilereconcile

discard one transaction (or use some other rule)discard one transaction (or use some other rule)Ask for human advice Ask for human advice

Meanwhile, Meanwhile, nodes disagree =>nodes disagree =>– Network DB state diverges: Network DB state diverges: System DelusionSystem Delusion

Page 8: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

88

Anecdotal Evidence

Update Anywhere systems are attractiveUpdate Anywhere systems are attractive Products offer the featureProducts offer the feature It demos wellIt demos well But when it scales upBut when it scales up

– Reconciliations start to cascadeReconciliations start to cascade– Database drifts “out of sync”Database drifts “out of sync” (System Delusion) (System Delusion)

What’s going on?What’s going on?

Page 9: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

99

Outline

Replication strategiesReplication strategies– Lazy and EagerLazy and Eager

– Master and GroupMaster and Group How centralized databases scaleHow centralized databases scale

– deadlocks rise non-linearly deadlocks rise non-linearly Replication is unstable on scaleupReplication is unstable on scaleup A possible solutionA possible solution

Page 10: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1010

Simple Model of Waits TPSTPS transactions per second transactions per second Each Each

– Picks Picks ActionsActions records uniformly records uniformly from set of from set of DBsizeDBsize records records

– Then commitsThen commits About About Transactions Transactions x x Actions/2 Actions/2 resources locked resources locked Chance a request waits isChance a request waits is Action rate is Action rate is TPS x ActionsTPS x Actions

Active Transactions Active Transactions TPS x Actions x Action_Time

Wait Rate = Wait Rate = Action rate Action rate xx Chance a request waits Chance a request waits

==

10x more transactions, 100x more waits10x more transactions, 100x more waits

DBsizeDBsize recordsrecords

TransctionsTransctionsxxActionsActions22

TPSTPS22 xx Actions Actions33 xx Action_Time Action_Time

2 2 xx DB_size DB_size

Transactions Transactions xx Actions Actions2 2 xx DB_size DB_size

Page 11: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1111

Simple Model of Deadlocks

TPSTPS22 xx Actions Actions33 xx Action_Time Action_Time

2 2 xx DB_size DB_size

TPS TPS xx Actions Actions33xx Action_Time Action_Time

2 2 xx DB_size DB_size

TPS x Actions x Action_Time

TPSTPS22 xx Actions Actions55 xx Action_Time Action_Time

4 4 xx DB_size DB_size22

A A deadlockdeadlock is a wait cycle is a wait cycle Cycle of length 2:Cycle of length 2:

– Wait rate x Chance Waitee waits for waiterWait rate x Chance Waitee waits for waiter

– Wait rate x (P(wait) / Transactions)Wait rate x (P(wait) / Transactions)

Cycles of length 3 are PWCycles of length 3 are PW33, so ignored, so ignored..

1010xx bigger trans = 100,000 bigger trans = 100,000xx more deadlocks more deadlocks

Page 12: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1212

Summary So Far

Even centralized systems unstableEven centralized systems unstable Waits:Waits:

– Square of concurrencySquare of concurrency

– 3rd power of transaction size3rd power of transaction size Deadlock rateDeadlock rate

– Square of concurrencySquare of concurrency

– 5th power of transaction size5th power of transaction size

Tra

ns S

ize

Tra

ns S

ize

Concu

rrenc

y

Concu

rrenc

y

Page 13: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1313

Outline

Replication strategiesReplication strategies How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup

Eager (master & group)Eager (master & group)Lazy (master & group & disconnected)Lazy (master & group & disconnected)

A possible solutionA possible solution

Page 14: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1414

Eager Transactions are FAT

If If NN nodes, eager transaction is nodes, eager transaction is NNxx bigger bigger– Takes Takes NNxx longer longer

– 1010xx nodes, 1,000 nodes, 1,000xx deadlocks deadlocks

– (derivation in paper)(derivation in paper) Master slightly better than groupMaster slightly better than group Good news: Good news:

– Eager transactions only deadlockEager transactions only deadlock

– No need for reconciliationNo need for reconciliation

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Page 15: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1515

Lazy Master & Group Use optimistic concurrency controlUse optimistic concurrency control

– Keep transaction timestamp with recordKeep transaction timestamp with record

– Updates carry old+new timestampUpdates carry old+new timestamp

– If record has old timestampIf record has old timestamp set value to new valueset value to new value set timestamp to new timestampset timestamp to new timestamp

– If record does not match old timestampIf record does not match old timestamp reject lazy transactionreject lazy transaction

– Not SNAPSHOT isolation Not SNAPSHOT isolation (stale reads)(stale reads)

Reconciliation:Reconciliation:– Some nodes are updatedSome nodes are updated

– Some nodes are “being reconciledSome nodes are “being reconciled””

New New TimestampTimestamp

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

Write A

Write B

Write C

Commit

OID, old time, new valueOID, old time, new value

TRID, TimestampTRID, TimestampA Lazy TransactionA Lazy Transaction

Page 16: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1616

Reconciliation

Reconciliation means System DelusionReconciliation means System Delusion– Data inconsistent with itself and realityData inconsistent with itself and reality

How frequent is it?How frequent is it? Lazy transactions are not fatLazy transactions are not fat

– but N times as manybut N times as many

– Eager waits become Lazy reconciliationsEager waits become Lazy reconciliations

– Rate is:Rate is:

– Assuming everyone is connectedAssuming everyone is connected

TPSTPS22 xx (Actions (Actions xx Nodes) Nodes)33 xx Action_Time Action_Time

2 2 xx DB_size DB_size

Page 17: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1717

Eager & Lazy: Disconnected Suppose mobile nodes disconnected for a daySuppose mobile nodes disconnected for a day When reconnect: When reconnect:

– get all incoming updatesget all incoming updates

– send all delayed updatessend all delayed updates Incoming is Incoming is Nodes x TPS Nodes x TPS xx Actions Actions xx disconnect_time disconnect_time

Outgoing is: Outgoing is: TPS TPS xx Actions Actions xx Disconnect_Time Disconnect_Time

Conflicts are intersection of these two setsConflicts are intersection of these two sets

Action_Time Action_Time

Action_Time Action_Time

Disconnect_Time Disconnect_Time xx ( (TPS TPS xxActions Actions xx Nodes) Nodes)22

DB_sizeDB_size

Page 18: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1818

Outline Replication strategies Replication strategies (lazy & eager, master & group)(lazy & eager, master & group)

How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup A possible solutionA possible solution

– Two-tier architecture: Mobile & Base nodesTwo-tier architecture: Mobile & Base nodes

– Base nodes master objectsBase nodes master objects

– Tentative transactions at mobile nodesTentative transactions at mobile nodesTransactions must be commutativeTransactions must be commutative

– Re-apply transactions on reconnectRe-apply transactions on reconnect

– Transactions may be rejectedTransactions may be rejected

Page 19: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

1919

Safe Approach Each object mastered at a nodeEach object mastered at a node Update Transactions onlyUpdate Transactions only

read and write master itemsread and write master items Lazy replication to other nodesLazy replication to other nodes Allow reads of stale data (on user request)Allow reads of stale data (on user request) PROBLEMS: PROBLEMS:

– doesn’t support mobile usersdoesn’t support mobile users

– deadlocks explode with scaleupdeadlocks explode with scaleup ?? How do banks work????? How do banks work???

Page 20: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

2020

Two Tier Replication

Two kinds of nodes:Two kinds of nodes:– Base nodes always connected, always upBase nodes always connected, always up

– Mobile nodes occasionally connectedMobile nodes occasionally connected Data mastered at base nodesData mastered at base nodes Mobile nodes Mobile nodes

– have stale copieshave stale copies

– make tentative updatesmake tentative updatesBaseNode

Mobile

Page 21: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

2121

Mobile Node Makes Tentative Updates

Updates local database while disconnectedUpdates local database while disconnected Saves transactions Saves transactions When Mobile node reconnects: When Mobile node reconnects:

Tentative transactions re-done Tentative transactions re-done as Eager-Master as Eager-Master (at original time??)(at original time??)

Some may be rejectedSome may be rejected– (replaces reconciliation)(replaces reconciliation)

No System Delusion.No System Delusion.

tentativetransactions

base updates &failed base transactions

BaseNode

Mobile

Page 22: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

2222

Tentative Transactions Must be commutative with othersMust be commutative with others

– Debit Debit 50$ 50$ rather than Change rather than Change 150$ 150$ to to 100$.100$.

Must have acceptance criteriaMust have acceptance criteria– Account balance is positiveAccount balance is positive

– Ship date no later than quotedShip date no later than quoted

– Price is no greater than quotedPrice is no greater than quoted

TentativeTentative TransactionsTransactions at local DBat local DB Updates & RejectsUpdates & Rejects

TransactionsTransactionsFrom From OthersOtherssend Tentative Xacts

send Tentative Xacts

Page 23: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

2424

Virtue of 2-Tier Approach

Allows mobile operationAllows mobile operation No system delusion No system delusion Rejects detected at reconnect Rejects detected at reconnect (know right away)(know right away)

If commutativity works, If commutativity works, – No reconciliationsNo reconciliations

– Even though work rises as (Mobile + Base)Even though work rises as (Mobile + Base)22

Page 24: 1 Scaleable Replicated Databases Jim Gray (Microsoft) Pat Helland (Microsoft) Dennis Shasha (Columbia) Pat ONeil (U.Mass)

2525

Outline

Replication strategies Replication strategies (lazy & eager, master & group)(lazy & eager, master & group)

How centralized databases scaleHow centralized databases scale Replication is unstable on scaleupReplication is unstable on scaleup A possible solution (two-tier architecture)A possible solution (two-tier architecture)

– Tentative transactions at mobile nodesTentative transactions at mobile nodes

– Re-apply transactions on reconnectRe-apply transactions on reconnect

– Transactions may be rejected & reconciledTransactions may be rejected & reconciled Avoids system delusionAvoids system delusion