Google Spanner : our understanding of concepts and implications

Post on 15-Jan-2015

3.569 views 1 download

Tags:

description

Our understanding of Google Spanner and its implications. Spanner is a global-scale distributed database from Google.

Transcript of Google Spanner : our understanding of concepts and implications

Google Spanner: our understanding of concepts and implications

Harisankar HDOS lab weekly seminar

8/Dec/2012http://harisankarh.wordpress.com

"Google Spanner: our understanding of concepts and implications" by Harisankar H is licensed under a Creative Commons Attribution 3.0 Unported License.

Outline

• Spanner

– User perspective

• User = application programmer/administrator

– System architecture

– Implications

Spanner: user perspective

• Global scale database with strict transactional guarantees– Global scale

• designed to work across datacenters in different continents• Claim: “designed to scale up to millions of nodes, hundreds of

datacenters, trillions of database rows”

– Strict transactional guarantees• Supports general transactions(even inter-row)• Stronger properties than serializability*

– replaced MySQL cluster storing their critical ad-related data

• Reliable even during wide-area natural disasters

– Supports hierarchical schema of tables• Semi-relational

– Supports SQL-like query and definition language

– User-defined locality and availability

* means: explained in later slides

Need for Spanner

• Limitations of existing systems– BigTable, (could apply to NoSQL systems in general)

• Needed complex, evolving schemas

• Only eventual consistency across data centers– Needed wide-area replication with strong consistency

• Transactional scope limited to single row– Needed general cross-row transactions

– Megastore, (relational db-like system)

• Low performance– Layered on top of BigTable

» High communication costs

– Less efficient replica consistency algorithms*

• Better transactional guarantees in Spanner*

Spanner: transactional guarantee• External consistency

– Stricter than serializability

– E.g.,

T1

T2

T3

physical time

T1 T2T3

T1 T2 T3

T1T2 T3

T1T2 T3

Serial ordering

T2 after T1

External consistency: motivation

• Facebook-like example from OSDI talk

T1: unfriend Tom

T2: post comment

T3: view Jerry’s profile

physical time

by Jerry

by Tom

Jerry unfriends Tom to write a controversial comment

T1: Jerry unfriends TomT2: Jerry posts comment T3: Tom views Jerry’s profile

If serial order is as above, Jerry will be in trouble!

Formally, “If commit of T1 preceded the initiation of a new transaction T2 in wall-clock(physical) time, then commit of T1 should precede commit of T2 in the serial ordering also. ”

Spanner: transactional guarantee

• Additional (weaker)transaction modes for performance– Read-only transaction supporting snapshot isolation

• Snapshot isolation– Transactions read a consistent snapshot of the database– Values written should not have conflicting updates after the

snapshot was read– E.g., R1(X)R1(Y) R2(X)R2(Y) W2(Y) W1(X) is allowed– Weaker than serializability, but more efficient(lock-free)– Spanner do not allow writes for these transactions

» Probably, that is how they preserve isolation

– Snapshot read• Read of a consistent state of the database in the past

Hierarchical data model

– Universes(Spanner deployment)• Databases(collection of tables)

– Tables with schemas

» Ordered Rows, columns

» One or more primary-key columns

• Rows named during primary keys

– Hierarchies of tables

» Directory tables(top of table hierarchy)

• Directories

• Each row in directory table(with key K) along with the rows in descendant tables that start with K form a directory

Fig: a

Figures (a),(b) from Spanner, OSDI 2012 paper

User perspective: database configuration

• Database placement and reliability– Administrator:

• Create options which specify number of replicas and placement

– E.g., option (a): North America: 5 replicas, Europe: 3 replicas

option (b): Latin America: 3 replicas …

– Application• Directory is the smallest unit for which these properties can

be specified

• Tag each directory or database with these options– E.g., TomDir1: option (b)

JerryDir3: option (a) ….

Next: System architecture

Spanner architecture: basics

• Replica consistency– Using Paxos protocol

• Different Paxos groups for different sets of directories– Can be across data centers

• Concurrency control– Using two phase locking

• Chose over optimistic methods because of long-lived transactions(order of minutes)

• Transaction coordination– 2 phase commit

• 2 phase commit on top of Paxos ensures availability

• Timestamps for transactions and data items– To support snapshot isolation and snapshot reads– Multiple timestamped versions of data items maintained

Spanner components

Zone 1(physical location)

Span servers(data)

Zone master(assign data)

Location proxy(locate data)Location proxies(locate data)

*TrueTime

Service

Universe master(status + interactive debugging)

Placement driver(move data across zones automatically)

Zone 2(physical location)

Span servers(data)

Zone master(assign data)

Location proxy(locate data)Location proxies(locate data)

……

Network

Zones, directories and Paxos groups

Fig: (b)Figures (a),(b) from Spanner, OSDI 2012 paper

Replication-related components• Tablet: unit of storage

– Bag of directories

– Abstraction on top of underlying DFS Colossus

• Single Paxos state machine(replica) per tablet

• Replicas of each tablet form a Paxos group

• Leader elected among a Paxos group

dirs

Tablet replica: DC1,n2….

Tablet replica: DC2,n8….

….

Paxos group

Paxos leader

Transaction-related components

Tablet replica: DC1,n2….

Tablet replica: DC2,n8….

….

Paxos group(Coordinator)

Paxos leader

Coordinator leader(2PC +2PL)

Tablet replica:….

Tablet replica:….

….

Paxos group(Participant)

Paxos leader

Participant leader

…..

Coordinator slave

Participant slaveTransaction T5:

Next:

• Serializability ensured by the already explained components

• External consistency implemented with help of TrueTime service

– True time service also used for leader election using timed leases

TrueTime + transaction implementation

[by Aditya]

Implications of Spanner

[REMOVED]

Thank you

• Image credits– Figures (a),(b) from Spanner, OSDI 2012 paper