External Consistency and Spanner

31
External Consistency and Spanner CS425/ECE428 — SPRING 2020 NIKITA BORISOV, UIUC

Transcript of External Consistency and Spanner

Page 1: External Consistency and Spanner

External Consistency and SpannerCS425/ECE428 — SPRING 2020

NIKITA BORISOV, UIUC

Page 2: External Consistency and Spanner

Transactions so farObjects distributed / partitioned among different servers

◦ For load balancing (sharding)◦ For separation of concerns / administration

Isolation enforced using two-phase locking (2PL)◦ Each server maintains locks on own objects◦ Deadlocks detected using e.g., edge-chasing

Atomic commit using 2PC◦ Prepare to commit ensures durability◦ Recover from coordinator and participant crashes

Page 3: External Consistency and Spanner

Dealing with FailuresNode failure

◦ Objects unavailable until recovery◦ 2PC “stuck” after coordinator failure

But! Node failure is common

Drive failures => no recovery!

Page 4: External Consistency and Spanner

ReplicationObjects distributed among 1000’s cluster nodes for load-balancing (sharding)

Objects replicated among a handful of nodes for availability / durability◦ Replication across data centers, too

Two-level operation:◦ Use transactions, coordinators, 2PC per object◦ Use Paxos / Raft among object replicas

Note: can be expensive!◦ Coordinator sends Prepare message to leaders of each replica group◦ Each leader uses Paxos / Raft to commit the Prepare to the group logs◦ Once commit succeeds, reply to coordinator◦ Coordinator uses Paxos / Raft to commit decision to its group log

Page 5: External Consistency and Spanner

Example transactionread A -> acquire read lock on A

read B -> acquire read lock on B

write A -> promote A’s lock to write lock

commit -> perform 2PC◦ Coordinator -> A, B: prepare◦ A, B -> OK◦ Coordinator -> A, B: commit

Page 6: External Consistency and Spanner

Read transactionsRead transactions often access many data items

◦ E.g., Facebook ”news feed”◦ E.g., Amazon front page◦ E.g., balances across all accounts

Read transactions still need (read) locks (Why?)

Acquiring locks requires consensus (Why?)

Locks prevent write transactions from moving forward

Page 7: External Consistency and Spanner

LinearizabilitySerial equivalence:

◦ Total effect on system is equivalent to a run that is serial and consistent with each client’s order

Linearizability◦ Total effect on system is equivalent to a run that is serial and consistent with actual order of events

E.g., buying a movie◦ Client makes RPC to bank transfers $3.99 to Amazon account◦ Client requests video from Amazon◦ Amazon makes RPC to bank, does not see transfer, rejects request!

Page 8: External Consistency and Spanner

Spanner: Google’sGlobally-Distributed Database

Wilson Hsieh representing a host of authors

OSDI 2012

Page 9: External Consistency and Spanner

What is Spanner?

• Distributed multiversion database• General-purpose transactions (ACID)• SQL query language• Schematized tables• Semi-relational data model

• Running in production• Storage for Google’s ad data• Replaced a sharded MySQL database

OSDI 2012 9

Page 10: External Consistency and Spanner

Example: Social Network

OSDI 2012

User postsFriend listsUser postsFriend listsUser postsFriend listsUser postsFriend lists

US

Brazil

RussiaSpain

San FranciscoSeattleArizona

Sao PauloSantiagoBuenos Aires

MoscowBerlinKrakow

LondonParisBerlinMadridLisbon

User postsFriend lists

10

x1000

x1000

x1000

x1000

Page 11: External Consistency and Spanner

Overview

• Feature: Lock-free distributed read transactions• Property: External consistency of distributed

transactions– First system at global scale

• Implementation: Integration of concurrency control, replication, and 2PC– Correctness and performance

• Enabling technology: TrueTime– Interval-based global time

OSDI 2012 11

Page 12: External Consistency and Spanner

Read Transactions

• Generate a page of friends’ recent posts– Consistent view of friend list and their posts

OSDI 2012

Why consistency matters1. Remove untrustworthy person X as friend2. Post P: “My government is repressive…”

12

Page 13: External Consistency and Spanner

User postsFriend listsUser postsFriend lists

Single Machine

Friend2 post

Generate my page

Friend1 post

Friend1000 postFriend999 post

Block writes

OSDI 2012

13

Page 14: External Consistency and Spanner

User postsFriend lists User postsFriend lists

Multiple Machines

User postsFriend lists

Generate my page

Friend2 postFriend1 post

Friend1000 postFriend999 post

User postsFriend lists

Block writes

OSDI 2012

14

Page 15: External Consistency and Spanner

User postsFriend lists

User postsFriend lists

User postsFriend lists

Multiple DatacentersUser postsFriend lists

Generate my page

Friend2 post

Friend1 post

Friend1000 post

Friend999 post

OSDI 2012

US

Spain

Russia

Brazil

15

x1000

x1000

x1000

x1000

Page 16: External Consistency and Spanner

Version Management

• Transactions that write use strict 2PL– Each transaction T is assigned a timestamp s– Data written by T is timestamped with s

OSDI 2012 16

Time 8<8

[X]

[me]

15

[P]My friendsMy postsX’s friends

[]

[]

Page 17: External Consistency and Spanner

Synchronizing Snapshots

==External Consistency:

Commit order respects global wall-time order

OSDI 2012 17

==Timestamp order respects global wall-time order

giventimestamp order == commit order

Global wall-clock time

Page 18: External Consistency and Spanner

Timestamps, Global Clock

• Strict two-phase locking for write transactions• Assign timestamp while locks are held

T

Pick s = now()

Acquired locks Release locks

OSDI 2012 18

Page 19: External Consistency and Spanner

Timestamp Invariants

OSDI 2012 19

• Timestamp order == commit order

• Timestamp order respects global wall-time order

T2

T3

T4

T1

Page 20: External Consistency and Spanner

TrueTime

• “Global wall-clock time” with bounded uncertainty

time

earliest latest

TT.now()

2*ε

OSDI 2012 20

Page 21: External Consistency and Spanner

Timestamps and TrueTime

T

Pick s = TT.now().latest

Acquired locks Release locks

Wait until TT.now().earliest > ss

OSDI 2012

average ε

Commit wait

average ε

21

Page 22: External Consistency and Spanner

Commit Wait and Replication

OSDI 2012

T

Acquired locks Release locks

Start consensus Notify slaves

Commit wait donePick s

22

Achieve consensus

Page 23: External Consistency and Spanner

Commit Wait and 2-Phase Commit

OSDI 2012

TC

Acquired locks Release locks

TP1

Acquired locks Release locks

TP2

Acquired locks Release locks

Notify participants of s

Commit wait doneCompute s for each

23

Start logging Done logging

Prepared

Compute overall s

Committed

Send s

Page 24: External Consistency and Spanner

Example

OSDI 2012 24

TP

Remove X from my friend list

Remove myself from X’s friend list

sC=6

sP=8

s=8 s=15

Risky post P

s=8

Time <8

[X]

[me]

15

TC T2

[P]My friendsMy postsX’s friends

8

[]

[]

Page 25: External Consistency and Spanner

What Have We Covered?

• Lock-free read transactions across datacenters• External consistency• Timestamp assignment• TrueTime– Uncertainty in time can be waited out

OSDI 2012 25

Page 26: External Consistency and Spanner

What Haven’t We Covered?

• How to read at the present time• Atomic schema changes–Mostly non-blocking– Commit in the future

• Non-blocking reads in the past– At any sufficiently up-to-date replica

OSDI 2012 26

Page 27: External Consistency and Spanner

TrueTime Architecture

Datacenter 1 Datacenter n…Datacenter 2

GPS timemaster

GPS timemaster

GPS timemaster

Atomic-clock timemaster

GPS timemaster

Client

OSDI 2012 27

GPS timemaster

Compute reference [earliest, latest] = now ± ε

Page 28: External Consistency and Spanner

TrueTime implementation

time

ε

0sec 30sec 60sec 90sec

+6ms

now = reference now + local-clock offsetε = reference ε + worst-case local-clock drift

referenceuncertainty

OSDI 2012 28

200 μs/sec

Page 29: External Consistency and Spanner

What If a Clock Goes Rogue?

• Timestamp assignment would violate external consistency• Empirically unlikely based on 1 year of data– Bad CPUs 6 times more likely than bad clocks

OSDI 2012 29

Page 30: External Consistency and Spanner

Network-Induced Uncertainty

OSDI 2012

Mar 29 Mar 30 Mar 31 Apr 1Date

2

4

6

8

10

Epsil

on (m

s)

99.99990

6AM 8AM 10AM 12PMDate (April 13)

1

2

3

4

5

6

30

Page 31: External Consistency and Spanner

Conclusions

• Reify clock uncertainty in time APIs– Known unknowns are better than unknown

unknowns– Rethink algorithms to make use of uncertainty

• Stronger semantics are achievable– Greater scale != weaker semantics

OSDI 2012 33