Google Spanner

39
Spanner: Google’s Globally- Distributed Database Vaidas Brundza EMDC

description

Presentation for Advanced topics in Distributed Computing class @ KTH.

Transcript of Google Spanner

Page 1: Google Spanner

Spanner: Google’s Globally-Distributed Database

Vaidas Brundza EMDC

Page 2: Google Spanner

What is Spanner ?

2

o Globally distributed multi-version database General-purpose transactions (ACID) SQL-like query language Schematized semi-relational tables

o Currently running in production Storage for Google’s F1 adv. backend data Replaced a sharded MySQL database

Page 3: Google Spanner

Overview

3

o Lock-free distributed read transactions o Global external consistency of distributed

transactions

o Used technologies: concurrency control, replication, 2PC and 2PL

o The key technology: TrueTime service

Page 4: Google Spanner

Overview

3

o Lock-free distributed read transactions o Global external consistency of distributed

transactions Same as linearizability: if a transaction T1 commits before

another transaction T2 starts, then T1’s commit timestamp is smaller than T2’s.

o Used technologies: concurrency control, replication, 2PC and 2PL

o The key technology: TrueTime service

Page 5: Google Spanner

Spanner server organization

4

o A Spanner deployment is called an universe

o It have two singletons: the universe master and the placement driver

Page 6: Google Spanner

Spanner server organization

4

o A Spanner deployment is called an universe

o It have two singletons: the universe master and the placement driver

o Can have up to several thousands spanservers

Page 7: Google Spanner

Spanner server organization

4

o A Spanner deployment is called an universe

o It have two singletons: the universe master and the placement driver

o Can have up to several thousands spanservers

o Organized as a set of zones

Page 8: Google Spanner

Datacenter in US

Datacenter in Spain

Datacenter in Sweden

Datacenter in Russia

Serving data from multiple datacenters

5

Data from US

Data from Spain

Data from Sweden

Data from Russia

Get the complete data set

Block writes

Page 9: Google Spanner

Datacenter in US

Datacenter in Spain

Datacenter in Sweden

Datacenter in Russia

Serving data from multiple datacenters

5

Data from US

Data from Spain

Data from Sweden

Data from Russia

Get the complete data set

Page 10: Google Spanner

Serving data from multiple datacenters

5

Data from US

Data from Spain

Data from Sweden

Data from Russia

Get the complete data set

Page 11: Google Spanner

Serving data from multiple datacenters

5

Data from US

Data from Spain

Data from Sweden

Data from Russia

Get the complete data set

Page 12: Google Spanner

Transaction example

6

Tc

TP1

TP2

Page 13: Google Spanner

Transaction example

6

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Page 14: Google Spanner

Transaction example

6

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Page 15: Google Spanner

time

earliest latest

2*ɛ

TT.now()

True Time API

7

o Provides an absolute time denoted as „Global wall-clock time“.

o Has bounded uncertainty ɛ, which varies between 1 to 7 ms over each poll interval

o Values derived from the worst case local-clock drift scenario

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) true if t has definitely passed

TT.before(t) true if t has definitely not arrived

Page 16: Google Spanner

time

earliest latest

2*ɛ

TT.now()

True Time API

7

o Provides an absolute time denoted as „Global wall-clock time“.

o Has bounded uncertainty ɛ, which varies between 1 to 7 ms over each poll interval

o Values derived from the worst case local-clock drift scenario

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) true if t has definitely passed

TT.before(t) true if t has definitely not arrived

Page 17: Google Spanner

time

earliest latest

2*ɛ

TT.now()

True Time API

7

o Provides an absolute time denoted as „Global wall-clock time“.

o Has bounded uncertainty ɛ, which varies between 1 to 7 ms over each poll interval

o Values derived from the worst case local-clock drift scenario

o Magic number: 200 μs/s

Method Returns

TT.now() TTinterval: [earliest, latest]

TT.after(t) true if t has definitely passed

TT.before(t) true if t has definitely not arrived

Page 18: Google Spanner

True Time Architecture

8

GPS timemaster

GPS timemaster

GPS timemaster

Atomic-clock timemaster

Atomic-clock timemaster

GPS timemaster

Client

Datacenter 1 Datacenter 2 Datacenter n …

now = reference now + local-clock offset ɛ = reference ɛ + worst-case local-clock drift

Page 19: Google Spanner

True Time Architecture

8

GPS timemaster

GPS timemaster

GPS timemaster

Atomic-clock timemaster

Atomic-clock timemaster

GPS timemaster

Client

Datacenter 1 Datacenter 2 Datacenter n …

now = reference now + local-clock offset ɛ = reference ɛ + worst-case local-clock drift

Page 20: Google Spanner

Transaction example

9

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Page 21: Google Spanner

Transaction example

9

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging

Page 22: Google Spanner

Spanserver software stack

10

o Tablet implements mappings: (key, timestamp) -> string

Page 23: Google Spanner

Spanserver software stack

10

o Tablet implements mappings: (key, timestamp) -> string

o The Paxos state machine for replication support

o Writes initiate the Paxos protocol at the leader

o Focuses on long-lived transactions

Page 24: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging

Page 25: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Page 26: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Page 27: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Compute overall ts

Page 28: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Compute overall ts

Commit wait done

Page 29: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Compute overall ts

Commit wait done

Release locks

Page 30: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Compute overall ts

Commit wait done

Release locks

Committed Send overall ts

Page 31: Google Spanner

Transaction example

11

Tc

TP1

TP2

Acquired locks

Acquired locks

Acquired locks

Compute ts for each

Start logging Done logging

Prepared + ts

Compute overall ts

Commit wait done

Release locks

Release locks

Release locks

Committed Send overall ts

Page 32: Google Spanner

Additional uncovered bits

12

Supports atomic schema changes Non-blocking snapshot reads in the past How to read at the present time Paxos protocol restriction Does not support in-Paxos configuration changes

Page 33: Google Spanner

Evaluation: TrueTime uncertainty

13

Distribution of TrueTime ɛ values, sampled right after time-slave daemon polls the time masters

Page 34: Google Spanner

Evaluation: F1 study case

14

# fragments # directories

1 >100M

2-4 341

5-9 5336

10-14 232

15-99 34

100-500 7

operation

latency (ms)

count mean std dev

all reads 8,7 376,4 21,5B

single-site commit 72,3 112,8 31,2M

multi-site commit 103,0 52,2 32,1M

Page 35: Google Spanner

Evaluation: F1 study case

14

# fragments # directories

1 >100M

2-4 341

5-9 5336

10-14 232

15-99 34

100-500 7

operation

latency (ms)

count mean std dev

all reads 8,7 376,4 21,5B

single-site commit 72,3 112,8 31,2M

multi-site commit 103,0 52,2 32,1M

Distribution of directory-fragment counts

Page 36: Google Spanner

Evaluation: F1 study case

14

# fragments # directories

1 >100M

2-4 341

5-9 5336

10-14 232

15-99 34

100-500 7

operation

latency (ms)

count mean std dev

all reads 8,7 376,4 21,5B

single-site commit 72,3 112,8 31,2M

multi-site commit 103,0 52,2 32,1M

Distribution of directory-fragment counts

Perceived operation latencies (over 24 hour course)

Page 37: Google Spanner

Evaluation: Microbenchmarks

15

replicas

latency (ms) throughput (Kops/sec) write read-only

transactions snapshot read write read-only

transactions snapshot read

1D 9,4±0,6 - - 4,0±0,3 - -

1 14,4±1,0 1,4±0,1 1,3±0,1 4,1±0,05 10,9±0,4 13,5±0,1

3 13,9±0,6 1,3±0,1 1,2±0,1 2,2±0,5 13,8±3,2 38,5±0,3

5 14,4±0,4 1,4±0,05 1,3±0,04 2,8±0,3 25,3±5,2 50,0±1,1

Page 38: Google Spanner

Evaluation: Microbenchmarks

15

replicas

latency (ms) throughput (Kops/sec) write read-only

transactions snapshot read write read-only

transactions snapshot read

1D 9,4±0,6 - - 4,0±0,3 - -

1 14,4±1,0 1,4±0,1 1,3±0,1 4,1±0,05 10,9±0,4 13,5±0,1

3 13,9±0,6 1,3±0,1 1,2±0,1 2,2±0,5 13,8±3,2 38,5±0,3

5 14,4±0,4 1,4±0,05 1,3±0,04 2,8±0,3 25,3±5,2 50,0±1,1

participants

latency (ms)

mean 99th percentile

1 17,0±1,4 75,0±34,9

2 24,5±2,5 87,6±35,9

5 31,5±6,2 104,5±52,2

10 30,0±3,7 95,6±25,4

25 35,5±5,6 100,4±42,7

50 42,7±4,1 93,7±22,9

100 71,4±7,6 131,2±17,6

200 150,5±11,0 320,3±35,1

Page 39: Google Spanner

Conclusion

16

The first service to provide global externally consistent

multi-version database

Relies on novel time API (TrueTime)

Improvements introduced over previous services