Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions •...

44
Spanner: Google’s Globally- Distributed Database COS 418: Distributed Systems Precept 6 Themis Melissaris and Daniel Suo

Transcript of Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions •...

Page 1: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Spanner: Google’s Globally-

Distributed Database

COS 418: Distributed SystemsPrecept 6

Themis Melissaris and Daniel Suo

Page 2: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Review {linear, serial, strict serial}izability

• Review concurrency controls

• Spanner

• Pro tips for Assignment 3

2

Agenda

Page 3: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

3

ACID properties of transactions

• Atomicity: Either all constituent operations of the transaction complete successfully, or none do

• Consistency: Each transaction in isolation preserves a set of integrity constraints on the data

• Isolation: Transactions’ behavior not impacted by presence of other concurrent transactions

• Durability: The transaction’s effects survive failure of volatile (memory) or non-volatile (disk) storage

Page 4: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Review *izability

4

Page 5: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Terms come from two different communities (database and distributed systems). Overloaded!

• All refer to interleaving operations

• Definitions– Operation: typically refers to a single access

operation (e.g., read, write)– Transaction: one or more operations that must

be committed atomically5

Some context

Page 6: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Guarantee for a single operation on a singleobject

• Informally, writes should appear instantaneously within the system

• All later reads as defined by wall-clock time (i.e., real-time) reflect the written value or some later written value

• ‘Strong Consistency’ in CAP theorem– Yes, we use consistency in ACID to mean something

different6

Linearizability

Page 7: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Guarantee for transactions, or one or more operations on one or more objects

• A set of transactions over some objects should execute as though each transaction ran in someserial order (doesn’t specify which one!)

• No real-time (i.e., world-clock) constraints; in other words, no deterministic order for transactions

• ‘Isolation’ in ACID properties7

Serializability

Page 8: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Linearizability + serializability

• Transactions have some serial behavior and that behavior corresponds to wall-clock time

• Straightforward to reason about for non-overlapping transactions

• What about overlapping transactions?

8

Strict serializability

Page 9: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Review concurrency controls

9

Page 10: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

10

ACID properties of transactions

• Atomicity: write-ahead logs and checkpoints

• Consistency: application logic

• Isolation: concurrency controls (locks, 2PL, OCC, MVCC)

• Durability: write-ahead logs and checkpoints

Page 11: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

11

ACID properties of transactions

• Atomicity: write-ahead logs and checkpoints

• Consistency: application logic

• Isolation: concurrency controls (locks, 2PL, OCC, MVCC)

• Durability: write-ahead logs and checkpoints

Page 12: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Global lock: simple, but slow

• Per-object lock: doesn’t guarantee serializability (isolation)

• 2PL: gives serializability, but leaves opportunities on the table and can deadlock

• OCC: performs well if few conflicts, but poorly if many conflicts

• MVCC: snapshot isolation, not serializability 12

Concurrency controls

Page 13: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Distributed Transactions

13

Page 14: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

14

Consider partitioned data over servers

O

P

Q

• Why not just use 2PL?– Grab locks over entire read and write set

– Perform writes

– Release locks (at commit time)

L

L

L

U

U

U

R

R W

W

Page 15: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

15

Consider partitioned data over servers

O

P

Q

• How do you get serializability?

– On single machine, single COMMIT op in the WAL

– In distributed setting, assign global timestamp to txn(at sometime after lock acquisition and before commit)

• Centralized txn manager • Distributed consensus on timestamp (not all ops)

L

L

L

U

U

U

R

R W

W

Page 16: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

16

Strawman: Consensus per txn group?

O

P

Q

L

L

L

U

U

U

R

R W

W

R

S

• Single Lamport clock, consensus per group?– Linearizability composes!– But doesn’t solve concurrent, non-overlapping txn problem

Page 17: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Spanner: Google’s Globally-Distributed Database

OSDI 2012

17

Page 18: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Dozens of zones (datacenters)

• Per zone, 100-1000s of servers

• Per server, 100-1000 partitions (tablets)

• Every tablet replicated for fault-tolerance (e.g., 5x)

18

Google’s Setting

Page 19: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

19

Scale-out vs. fault tolerance

O

P

QQQ

PP

OO

• Every tablet replicated via Paxos (with leader election)

• So every “operation” within transactions across tablets actually a replicated operation within Paxos RSM

• Paxos groups can stretch across datacenters!– (COPS took same approach within datacenter)

Page 20: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Disruptive idea:

Do clocks really need to be arbitrarily unsynchronized?

Can you engineer some max divergence?

20

Page 21: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• “Global wall-clock time” with bounded uncertainty

time

earliest latest

TT.now()

2*ε

21

TrueTime

Consider event enow which invoked tt = TT.new():Guarantee: tt.earliest <= tabs(enow) <= tt.latest

Page 22: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Timestamps and TrueTime

T

Pick s > TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > ss

average ε

Commit wait

average ε

22

Page 23: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Commit Wait and Replication

T

Acquired locks

Start consensus

Notify followers

Commit wait donePick s

23

Achieve consensus

Release locks

Page 24: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Client:

1. Issues reads to leader of each tablet group, which acquires read locks and returns most recent data

2. Locally performs writes

3. Chooses coordinator from set of leaders, initiates commit

4. Sends commit message to each leader, include identify of coordinator and buffered writes

5. Waits for commit from coordinator

24

Client-driven transactions

Page 25: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• On commit msg from client, leaders acquire local write locks

– If non-coordinator:• Choose prepare ts > previous local timestamps• Log prepare record through Paxos• Notify coordinator of prepare timestamp

– If coordinator:• Wait until hear from other participants• Choose commit timestamp >= prepare ts, > local ts• Logs commit record through Paxos• Wait commit-wait period• Sends commit timestamp to replicas, other leaders, client

• All apply at commit timestamp and release locks25

Commit Wait and 2-Phase Commit

Page 26: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Commit Wait and 2-Phase Commit

TC

Acquired locks

TP1

TP2

26

Start logging Done logging

Prepared

Release locks

Acquired locks Release locks

Acquired locks Release locks

Notify participants sc

Commit wait doneCompute sp for each

Compute overall sc

Committed

Send sp

Page 27: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Example

27

TP

Remove X from friend list

Remove myself from X’s friend list

sp= 6

sp= 8

sc= 8 s = 15

Risky post P

sc= 8

Time <8[X]

[me]

15

TC T2

[P]My friendsMy postsX’s friends

8[]

[]

Page 28: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Given global timestamp, can implement read-only transactions lock-free (snapshot isolation)

• Step 1: Choose timestamp sread = TT.now.latest()

• Step 2: Snapshot read (at sread) to each tablet– Can be served by any up-to-date replica

28

Read-only optimizations

Page 29: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Disruptive idea:

Do clocks really need to be arbitrarily unsynchronized?

Can you engineer some max divergence?

29

Page 30: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

TrueTime Architecture

Datacenter 1 Datacenter n…Datacenter 2

GPS timemaster

GPS timemaster

GPS timemaster

Atomic-clock timemaster

GPS timemaster

Client

30

GPS timemaster

Compute reference [earliest, latest] = now ± ε

Page 31: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

time

ε

0sec 30sec 60sec 90sec

+6ms

now = reference now + local-clock offset

ε = reference ε + worst-case local-clock drift= 1ms + 200 μs/sec

31

TrueTime implementation

• What about faulty clocks? – Bad CPUs 6x more likely in 1 year of empirical data

Page 32: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Known unknowns > unknown unknowns

Rethink algorithms to reason about uncertainty

32

Page 33: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Pro tips

33

Page 34: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Not following Figure 2 (and, more generally, the paper) exactly

34

The single greatest source of head-and heartache

Page 35: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• These are just empty AppendEntries RPCs!

• That means you must handle all the same checks as you would for AppendEntries

• Otherwise, bad things can happen

• If just return true, leader thinks that follower’s log matches the leader’s log up through prevLogIndex

35

Ex. 1: heartbeat RPCs

Page 36: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• If the follower’s log isn’t as long as the leaders, conflict!

• Can’t just truncate follower’s log after prevLogIndex. Only do so if an existing entry conflicts with the leader’s

• If all entries match, follower must keep any additional log entries it has. Why?

36

Ex. 2: handling conflicts

Page 37: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• There are only three scenarios– Receive AppendEntries RPC from current leader– Start election– Grant vote to another peer

• Tempting to reset timers everywhere; why not?

37

Ex. 3: reset timers precisely!

Page 38: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• We must start a new election if our election timer fires, even if we were already a candidate in the middle of an election

• What can happen if we don’t?

38

Ex. 4: (re)start elections

Page 39: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• No matter what happens, if we receive a request with a higher term, convert to follower and update currentTerm

• Don’t forget to also change votedFor!

39

Ex. 5: abdicating the throne

Page 40: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• When checking whether a log is up to date, follow section 5.4! Checking length is insufficient

• If a step says ‘reply false’, return immediately and don’t execute subsequent steps

40

Ex. 6: when and when not to be lazy

Page 41: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• If the commitIndex (index of highest log entry known to be committed) is ever greater than lastApplied (index of highest log entry applied to state machine), apply!

• Don’t need to do right away, but should have dedicated way of handling so we don’t have multiple channels trying to apply the same entry

• P.S. – don’t forget to check commitIndex > lastApplied…

41

Ex. 7: applying log entries

Page 42: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• nextIndex is optimistic: assume that follower has all entries from previous interaction unless we received a negative response

• matchIndex is conservative: only update when we know a higher index log entry has been known to be replicated

42

Ex. 8: matchIndex vs. nextIndex

Page 43: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

• Yeah, don’t forget that

43

Ex. 9: ignore RPCs from old terms!

Page 44: Spanner: Google’s Globally- Distributed Database · 3 ACID properties of transactions • Atomicity: Either all constituent operations of the transaction complete successfully,

Monday lecture

Conflicting/concurrent writes in eventual/causal systems:

OT + CRDTs

(aka how Google Docs works)

44