Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data...
-
Upload
opal-robbins -
Category
Documents
-
view
213 -
download
0
Transcript of Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data...
Isolation properties, application behaviour, platform performance: The tradeoffs in distributed data storesTalk remotely at IITB, March 12 2015
Presented by Alan Fekete (University of Sydney)
Work by Peter Bailis (UCBerkeley), Alan Fekete (U of Sydney), Michael Franklin (UCBerkeley),Ali Ghodsi (UC Berkeley, KTH), Joseph M. Hellerstein (UC Berkeley), Ion Stoica (UC Berkeley)
Internet-Scale Data Storage
• Many early systems* offered scalability and availability but missed functionality expected in traditional database management platforms (=> “NoSQL”)– Access by id/key [without content-based access, without
joins]– Operations may see stale data – Lack all-or-nothing combining ops across items
*eg BigTable, PNUTS, S3, Dynamo, MongoDB, Cassandra, SimpleDB, Riak
2
Wouldn’t it be nice if…
• More recent papers and systems for internet-scale data offer extra features beyond early NoSQL approaches, including some familiar from DBMS– (Choice of) more consistency in an operation– Richer operations– Grouping operations on multiple items
• Our focus is transactions: ways to group operations on multiple items
3
Returning stale data
• Allowing weak consistency (return of stale data) in accesses was justified by CAP result: For single item access, you can’t offer strong consistency read and write, that will be always available, if partitions are possible in the system– Conjecture of Brewer (2000), proved by Gilbert
and Lynch (2002)
4
Not supporting transactions?
• A system can’t provide serializable transactions that are always available if the system can partition
• This was known long before Brewer; see Davidson et al (ACM Computing Surveys 1985)
5
Traditional DBMS txns
• Application can declare transaction boundary– Usually by ending current txn; another will
automatically start at the next request
– Sometimes restricted: no DDL statements (eg change to schema), only DML (SELECT, UPDATE etc)
• Application can set isolation level– Usually done per connection
6
“Isolated”
• Academic definition: Serializable – (ought to be default for systems, but not so in practice)• Key property: interleaved execution is equivalent (same
values returned, same final state of db) as some execution where transactions run serially (no interleaving at all)
• No dirty read, no lost update• If each transaction (running alone) preserves some
constraint I, then the whole execution preserves I• Implemented: Traditionally done with Commit-duration
locks on data and indices– “Two Phase Locking (2PL)”– Also newer multiversion implementations (eg Cahill et al, TODS’09)
7
Weaker Isolation Levels
• SQL standard offers several isolation levels• Each transaction can have level set separately• Read Uncommitted– Usually only for read-only code– Implemented: no read locks, commit-duration write locks• Read Committed– No dirty reads (can’t see uncommitted, aborted or intermediate values)– Implemented: short duration read locks, commit-duration write locks– MV implementation: can return older version, while concurrent update is happening• Repeatable Read– No “phantoms” (predicate evaluation that sees versions inserted concurrently)– Implemented: Commit-duration locks on data
• Should be the same as Serializable for a key-value store
– Some multiversion systems provide “snapshot reading” for this level
8
ACID Transactions with weaker I
• Serializability is the ideal for isolation of transactions but most transactions on (conventional, single site) dbms don’t run serializably!– Read Committed is
often the default level9
Coordination is Bad
• “Availability during partitions” is a signal of an algorithm that does not need to coordinate across sites– Coordination damages both latency and
throughput, during normal operation– Especially for georeplicated or geodistributed
systems, where intersite latency is high (and can’t be made much lower, because “speed of light”)
10
Coordination Costs
11
Maximum possible throughput for conflicting transactions that need coordinationwith current network timings
HAT
• We propose a guide for platform developers – offer txns that can be arbitrary collection of
accesses to arbitrary sets of read/write objects, – with semantics chosen to be as strong as
feasible to implement with availability even when partitioned
• “Highly Available Transactions”
12
Available?
• Clearly not possible if client is partitioned away from its data– However, we should tolerate partition between
item replicas within the data store
• So, we ask for:– IF client can get to (at least one replica of) each
item it asks for, THEN transaction can eventually commit (or it aborts voluntarily)
13
Isolation levels for HAT
• We have shown (VLDB14) that you can offer available transactions that have– All-or-nothing atomicity
– Isolation level like (the definitions of) read committed and repeatable read*
• But where reads may not always see the most recent committed changes
• And you don’t get all the extra properties of conventional locking implementation (eg timeline view)
– Causal consistency (including RYW, monotonic reads, write follows reads) {as long as client is sticky to a partition}
14*in absence of predicate reads [which is not an issue for key-value store]
Read Atomic
• A new proposal for an isolation level (SIGMOD’14)
• Read committed, PLUS “No fractured reads”– Avoid the following:
• T1 writes x, y
• T2 reads x (seeing T1 or later), y (not seeing T1)
15
Anomaly Prevented by RA
16
x, init 10 y, init 10
x=11
y=11x?
10
11
y?
T1T2
timeincreasingdown the page
Caveat
• RA does not always guarantee transaction consistent snapshot– Transitive information flow may be fractured– However, many common coding idioms are
supported effectively• Eg maintain both ends of bidirectional associations
consistently– Contrast with Facebook TAO, LinkedIn Espresso etc
• Eg maintain secondary index consistent with data
• Eg maintain referential integrity17
RAMP Algorithms
• We have shown 3 alternative implementation techniques that provide RA isolation
18
RAMP-Fast
• Key ideas: – Multiversion stores, that keep older version– Make new version visible once everything
written in the txn is stored at all sites – Store metadata with each version listing other
items written in same transaction– Detect races and repair atomicity by looking for
aligned versions
19
State kept by RAMP
• For each item:– A set of versions
• Each version has value, timestamp, metadata (which other items were written together with this)
– latestcommitted timestamp
• Those versions whose timestamp is greater than latestcommitted are ones whose commit has not yet arrived at this site
• Eg 20
y:(10,0,{x}),(11,1,{x})latestcommit=0
RAMP-F put-all phase 1
21
x, init 10 y, init 10
PREP: x=11, ts=1. {y}
PREP: y=11, ts=1, {x}
T1
T2 x:(10,0,{y})latestcommit=0
y:(10,0,{x}),(11,1,{x})latestcommit=0
x:(10,0,{y}),(11,1,{y})latestcommit=0
RAMP-F put-all phase 2
22
x, init 10 y, init 10
PREP
PREP
T1
T2 x:(10,0,{y})latestcommit=0
y:(10,0,{x}),(11,1,{x})latestcommit=0
x:(10,0,{y}),(11,1,{y})latestcommit=0
COMMIT(1)y:(10,0,{x}),(11,1,{x})latestcommit=1
x:(10,0,{y}),(11,1,{y})latestcommit=1
RAMP-F get-all phase-1 (fast)
23
x, init 10 y, init 10
PREP: x=11, ts=1. {y}
PREP: y=11, ts=1, {x}x?
x:10,0,{y}
y:10,0,{x}
y?
T1
T2 x:(10,0,{y})latestcommit=0
y:(10,0,{x}),(11,1,{x})latestcommit=0
No fracture;Can return these!
x:(10,0,{y}),(11,1,{y})latestcommit=0
RAMP-F get-all phase-1 (oops)
24
x, init 10 y, init 10
PREP
PREPx?
x:10,0,{y}
y:11,1,{x}
y?
T1
T2 x:(10,0,{y})latestcommit=0
y:(10,0,{x}),(11,1,{x})latestcommit=0
Detect fracture!Missing x with ts=1
COMMIT(1) y:(10,0,{x}),(11,1,{x})latestcommit=1
x:(10,0,{y}),(11,1,{y})latestcommit=0
x:(10,0,{y}),(11,1,{y})latestcommit=1
RAMP-F get-all phase-2
25
x, init 10 y, init 10
PREP
PREPx?
x:10,0,{y}
y:11,1,{x}
y?
T1
T2 x:(10,0,{y})latestcommit=0
y:(10,0,{x}),(11,1,{x})latestcommit=0
Detect fracture!Missing x with ts=1 COMMIT(1)
y:(10,0,{x}),(11,1,{x})latestcommit=1
x:11,1{y}
x:(10,0,{y}),(11,1,{y})latestcommit=0
YCSB Performance (95% Reads)
26
No consistency
Write locks only
RAMP-Fast
Strict 2PL
RAMP-Small
RAMP-Hybrid
Related work with Availability• Restricted form of transactions
– Operate on set of items that are colocated• Eg Google Megastore entity group, UCSB G-Store
– Multiple gets or multiple puts, not get with put• Eg Princeton COPS-GT, Eiger
• Restricted data types– Only allow commutative operations
• eg INRIA CRDTs, Berkeley BloomL
• Weak semantics– Without isolation properties
• Eg ETH Consistency rationing (some choices)
27
Related work without Availability
• Systems that support general (read committed, SI-like, or even serializable) transactions – but use coordination: 2PC, Paxos, a master replica
for ordering, etc– Eg Google Megastore (across entity groups), ETH
Consistency Rationing (some choices), Google Spanner, MSR Walter, UCSB Paxos-CP, Yale Calvin, Berkeley Planet (formerly MDCC)
28
Invariants
• RAMP supports many invariant-centric programming idioms
• Can we do this for other invariants?– Yes! (VLDB’15)
• Is this common?– Yes! (SIGMOD’15)
29
I-Confluence
• Given an invariant and a set of operations (and a way to merge conflicting updates)
• Can the invariant be maintained without coordination?– Yes, if the I-confluence property holds– Essentially: the result of merging invariant-
satisfying changes also satisfies the invariant
30
Invariants in Rails
• Study a sample of 67 most popular Rails apps from github
• Hardly any app-specified transactions• Lots of validations (check invariant)
– Many built-in– Some user-defined
• Most are I-confluent– Some are not (and not supported correctly by most
dbms at default weak isolation)31
Conclusion
• We advocate internet-scale data system to offer clients– Unrestricted sets of operations on arbitrary multiple
items as transaction– Semantics as strong as possible while avoiding
coordination• We offer RA, a choice that supports many idioms, and
can be implemented efficiently• We provide theory to check if an invariant can be
maintained coordination-free32
For further study
• http://www.bailis.org
33