Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data...

Post on 05-Jan-2016

213 views 0 download

Tags:

Transcript of Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data...

Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data storesTalk remotely at IITB, March 12 2015

Presented by Alan Fekete (University of Sydney)

Work by Peter Bailis (UCBerkeley), Alan Fekete (U of Sydney), Michael Franklin (UCBerkeley),Ali Ghodsi (UC Berkeley, KTH), Joseph M. Hellerstein (UC Berkeley), Ion Stoica (UC Berkeley)

Internet-Scale Data Storage

• Many early systems* offered scalability and availability but missed functionality expected in traditional database management platforms (=> “NoSQL”)– Access by id/key [without content-based access, without

joins]– Operations may see stale data – Lack all-or-nothing combining ops across items

*eg BigTable, PNUTS, S3, Dynamo, MongoDB, Cassandra, SimpleDB, Riak

2

Wouldn’t it be nice if…

• More recent papers and systems for internet-scale data offer extra features beyond early NoSQL approaches, including some familiar from DBMS– (Choice of) more consistency in an operation– Richer operations– Grouping operations on multiple items

• Our focus is transactions: ways to group operations on multiple items

3

Returning stale data

• Allowing weak consistency (return of stale data) in accesses was justified by CAP result: For single item access, you can’t offer strong consistency read and write, that will be always available, if partitions are possible in the system– Conjecture of Brewer (2000), proved by Gilbert

and Lynch (2002)

4

Not supporting transactions?

• A system can’t provide serializable transactions that are always available if the system can partition

• This was known long before Brewer; see Davidson et al (ACM Computing Surveys 1985)

5

Traditional DBMS txns

• Application can declare transaction boundary– Usually by ending current txn; another will

automatically start at the next request

– Sometimes restricted: no DDL statements (eg change to schema), only DML (SELECT, UPDATE etc)

• Application can set isolation level– Usually done per connection

6

“Isolated”

• Academic definition: Serializable – (ought to be default for systems, but not so in practice)• Key property: interleaved execution is equivalent (same

values returned, same final state of db) as some execution where transactions run serially (no interleaving at all)

• No dirty read, no lost update• If each transaction (running alone) preserves some

constraint I, then the whole execution preserves I• Implemented: Traditionally done with Commit-duration

locks on data and indices– “Two Phase Locking (2PL)”– Also newer multiversion implementations (eg Cahill et al, TODS’09)

7

Weaker Isolation Levels

• SQL standard offers several isolation levels• Each transaction can have level set separately• Read Uncommitted– Usually only for read-only code– Implemented: no read locks, commit-duration write locks• Read Committed– No dirty reads (can’t see uncommitted, aborted or intermediate values)– Implemented: short duration read locks, commit-duration write locks– MV implementation: can return older version, while concurrent update is happening• Repeatable Read– No “phantoms” (predicate evaluation that sees versions inserted concurrently)– Implemented: Commit-duration locks on data

• Should be the same as Serializable for a key-value store

– Some multiversion systems provide “snapshot reading” for this level

8

ACID Transactions with weaker I

• Serializability is the ideal for isolation of transactions but most transactions on (conventional, single site) dbms don’t run serializably!– Read Committed is

often the default level9

Coordination is Bad

• “Availability during partitions” is a signal of an algorithm that does not need to coordinate across sites– Coordination damages both latency and

throughput, during normal operation– Especially for georeplicated or geodistributed

systems, where intersite latency is high (and can’t be made much lower, because “speed of light”)

10

Coordination Costs

11

Maximum possible throughput for conflicting transactions that need coordinationwith current network timings

HAT

• We propose a guide for platform developers – offer txns that can be arbitrary collection of

accesses to arbitrary sets of read/write objects, – with semantics chosen to be as strong as

feasible to implement with availability even when partitioned

• “Highly Available Transactions”

12

Available?

• Clearly not possible if client is partitioned away from its data– However, we should tolerate partition between

item replicas within the data store

• So, we ask for:– IF client can get to (at least one replica of) each

item it asks for, THEN transaction can eventually commit (or it aborts voluntarily)

13

Isolation levels for HAT

• We have shown (VLDB14) that you can offer available transactions that have– All-or-nothing atomicity

– Isolation level like (the definitions of) read committed and repeatable read*

• But where reads may not always see the most recent committed changes

• And you don’t get all the extra properties of conventional locking implementation (eg timeline view)

– Causal consistency (including RYW, monotonic reads, write follows reads) {as long as client is sticky to a partition}

14*in absence of predicate reads [which is not an issue for key-value store]

Read Atomic

• A new proposal for an isolation level (SIGMOD’14)

• Read committed, PLUS “No fractured reads”– Avoid the following:

• T1 writes x, y

• T2 reads x (seeing T1 or later), y (not seeing T1)

15

Anomaly Prevented by RA

16

x, init 10 y, init 10

x=11

y=11x?

10

11

y?

T1T2

timeincreasingdown the page

Caveat

• RA does not always guarantee transaction consistent snapshot– Transitive information flow may be fractured– However, many common coding idioms are

supported effectively• Eg maintain both ends of bidirectional associations

consistently– Contrast with Facebook TAO, LinkedIn Espresso etc

• Eg maintain secondary index consistent with data

• Eg maintain referential integrity17

RAMP Algorithms

• We have shown 3 alternative implementation techniques that provide RA isolation

18

RAMP-Fast

• Key ideas: – Multiversion stores, that keep older version– Make new version visible once everything

written in the txn is stored at all sites – Store metadata with each version listing other

items written in same transaction– Detect races and repair atomicity by looking for

aligned versions

19

State kept by RAMP

• For each item:– A set of versions

• Each version has value, timestamp, metadata (which other items were written together with this)

– latestcommitted timestamp

• Those versions whose timestamp is greater than latestcommitted are ones whose commit has not yet arrived at this site

• Eg 20

y:(10,0,{x}),(11,1,{x})latestcommit=0

RAMP-F put-all phase 1

21

x, init 10 y, init 10

PREP: x=11, ts=1. {y}

PREP: y=11, ts=1, {x}

T1

T2 x:(10,0,{y})latestcommit=0

y:(10,0,{x}),(11,1,{x})latestcommit=0

x:(10,0,{y}),(11,1,{y})latestcommit=0

RAMP-F put-all phase 2

22

x, init 10 y, init 10

PREP

PREP

T1

T2 x:(10,0,{y})latestcommit=0

y:(10,0,{x}),(11,1,{x})latestcommit=0

x:(10,0,{y}),(11,1,{y})latestcommit=0

COMMIT(1)y:(10,0,{x}),(11,1,{x})latestcommit=1

x:(10,0,{y}),(11,1,{y})latestcommit=1

RAMP-F get-all phase-1 (fast)

23

x, init 10 y, init 10

PREP: x=11, ts=1. {y}

PREP: y=11, ts=1, {x}x?

x:10,0,{y}

y:10,0,{x}

y?

T1

T2 x:(10,0,{y})latestcommit=0

y:(10,0,{x}),(11,1,{x})latestcommit=0

No fracture;Can return these!

x:(10,0,{y}),(11,1,{y})latestcommit=0

RAMP-F get-all phase-1 (oops)

24

x, init 10 y, init 10

PREP

PREPx?

x:10,0,{y}

y:11,1,{x}

y?

T1

T2 x:(10,0,{y})latestcommit=0

y:(10,0,{x}),(11,1,{x})latestcommit=0

Detect fracture!Missing x with ts=1

COMMIT(1) y:(10,0,{x}),(11,1,{x})latestcommit=1

x:(10,0,{y}),(11,1,{y})latestcommit=0

x:(10,0,{y}),(11,1,{y})latestcommit=1

RAMP-F get-all phase-2

25

x, init 10 y, init 10

PREP

PREPx?

x:10,0,{y}

y:11,1,{x}

y?

T1

T2 x:(10,0,{y})latestcommit=0

y:(10,0,{x}),(11,1,{x})latestcommit=0

Detect fracture!Missing x with ts=1 COMMIT(1)

y:(10,0,{x}),(11,1,{x})latestcommit=1

x:11,1{y}

x:(10,0,{y}),(11,1,{y})latestcommit=0

YCSB Performance (95% Reads)

26

No consistency

Write locks only

RAMP-Fast

Strict 2PL

RAMP-Small

RAMP-Hybrid

Related work with Availability• Restricted form of transactions

– Operate on set of items that are colocated• Eg Google Megastore entity group, UCSB G-Store

– Multiple gets or multiple puts, not get with put• Eg Princeton COPS-GT, Eiger

• Restricted data types– Only allow commutative operations

• eg INRIA CRDTs, Berkeley BloomL

• Weak semantics– Without isolation properties

• Eg ETH Consistency rationing (some choices)

27

Related work without Availability

• Systems that support general (read committed, SI-like, or even serializable) transactions – but use coordination: 2PC, Paxos, a master replica

for ordering, etc– Eg Google Megastore (across entity groups), ETH

Consistency Rationing (some choices), Google Spanner, MSR Walter, UCSB Paxos-CP, Yale Calvin, Berkeley Planet (formerly MDCC)

28

Invariants

• RAMP supports many invariant-centric programming idioms

• Can we do this for other invariants?– Yes! (VLDB’15)

• Is this common?– Yes! (SIGMOD’15)

29

I-Confluence

• Given an invariant and a set of operations (and a way to merge conflicting updates)

• Can the invariant be maintained without coordination?– Yes, if the I-confluence property holds– Essentially: the result of merging invariant-

satisfying changes also satisfies the invariant

30

Invariants in Rails

• Study a sample of 67 most popular Rails apps from github

• Hardly any app-specified transactions• Lots of validations (check invariant)

– Many built-in– Some user-defined

• Most are I-confluent– Some are not (and not supported correctly by most

dbms at default weak isolation)31

Conclusion

• We advocate internet-scale data system to offer clients– Unrestricted sets of operations on arbitrary multiple

items as transaction– Semantics as strong as possible while avoiding

coordination• We offer RA, a choice that supports many idioms, and

can be implemented efficiently• We provide theory to check if an invariant can be

maintained coordination-free32

For further study

• http://www.bailis.org

33