Google Spanner : our understanding of concepts and implications

18
Google Spanner: our understanding of concepts and implications Harisankar H DOS lab weekly seminar 8/Dec/2012 http://harisankarh.wordpress.com "Google Spanner: our understanding of concepts and implications" by Harisankar H is licensed under a Creative Commons Attribution 3.0 Unported License .

description

Our understanding of Google Spanner and its implications. Spanner is a global-scale distributed database from Google.

Transcript of Google Spanner : our understanding of concepts and implications

Page 1: Google Spanner : our understanding of concepts and implications

Google Spanner: our understanding of concepts and implications

Harisankar HDOS lab weekly seminar

8/Dec/2012http://harisankarh.wordpress.com

"Google Spanner: our understanding of concepts and implications" by Harisankar H is licensed under a Creative Commons Attribution 3.0 Unported License.

Page 2: Google Spanner : our understanding of concepts and implications

Outline

• Spanner

– User perspective

• User = application programmer/administrator

– System architecture

– Implications

Page 3: Google Spanner : our understanding of concepts and implications

Spanner: user perspective

• Global scale database with strict transactional guarantees– Global scale

• designed to work across datacenters in different continents• Claim: “designed to scale up to millions of nodes, hundreds of

datacenters, trillions of database rows”

– Strict transactional guarantees• Supports general transactions(even inter-row)• Stronger properties than serializability*

– replaced MySQL cluster storing their critical ad-related data

• Reliable even during wide-area natural disasters

– Supports hierarchical schema of tables• Semi-relational

– Supports SQL-like query and definition language

– User-defined locality and availability

* means: explained in later slides

Page 4: Google Spanner : our understanding of concepts and implications

Need for Spanner

• Limitations of existing systems– BigTable, (could apply to NoSQL systems in general)

• Needed complex, evolving schemas

• Only eventual consistency across data centers– Needed wide-area replication with strong consistency

• Transactional scope limited to single row– Needed general cross-row transactions

– Megastore, (relational db-like system)

• Low performance– Layered on top of BigTable

» High communication costs

– Less efficient replica consistency algorithms*

• Better transactional guarantees in Spanner*

Page 5: Google Spanner : our understanding of concepts and implications

Spanner: transactional guarantee• External consistency

– Stricter than serializability

– E.g.,

T1

T2

T3

physical time

T1 T2T3

T1 T2 T3

T1T2 T3

T1T2 T3

Serial ordering

T2 after T1

Page 6: Google Spanner : our understanding of concepts and implications

External consistency: motivation

• Facebook-like example from OSDI talk

T1: unfriend Tom

T2: post comment

T3: view Jerry’s profile

physical time

by Jerry

by Tom

Jerry unfriends Tom to write a controversial comment

T1: Jerry unfriends TomT2: Jerry posts comment T3: Tom views Jerry’s profile

If serial order is as above, Jerry will be in trouble!

Formally, “If commit of T1 preceded the initiation of a new transaction T2 in wall-clock(physical) time, then commit of T1 should precede commit of T2 in the serial ordering also. ”

Page 7: Google Spanner : our understanding of concepts and implications

Spanner: transactional guarantee

• Additional (weaker)transaction modes for performance– Read-only transaction supporting snapshot isolation

• Snapshot isolation– Transactions read a consistent snapshot of the database– Values written should not have conflicting updates after the

snapshot was read– E.g., R1(X)R1(Y) R2(X)R2(Y) W2(Y) W1(X) is allowed– Weaker than serializability, but more efficient(lock-free)– Spanner do not allow writes for these transactions

» Probably, that is how they preserve isolation

– Snapshot read• Read of a consistent state of the database in the past

Page 8: Google Spanner : our understanding of concepts and implications

Hierarchical data model

– Universes(Spanner deployment)• Databases(collection of tables)

– Tables with schemas

» Ordered Rows, columns

» One or more primary-key columns

• Rows named during primary keys

– Hierarchies of tables

» Directory tables(top of table hierarchy)

• Directories

• Each row in directory table(with key K) along with the rows in descendant tables that start with K form a directory

Fig: a

Figures (a),(b) from Spanner, OSDI 2012 paper

Page 9: Google Spanner : our understanding of concepts and implications

User perspective: database configuration

• Database placement and reliability– Administrator:

• Create options which specify number of replicas and placement

– E.g., option (a): North America: 5 replicas, Europe: 3 replicas

option (b): Latin America: 3 replicas …

– Application• Directory is the smallest unit for which these properties can

be specified

• Tag each directory or database with these options– E.g., TomDir1: option (b)

JerryDir3: option (a) ….

Next: System architecture

Page 10: Google Spanner : our understanding of concepts and implications

Spanner architecture: basics

• Replica consistency– Using Paxos protocol

• Different Paxos groups for different sets of directories– Can be across data centers

• Concurrency control– Using two phase locking

• Chose over optimistic methods because of long-lived transactions(order of minutes)

• Transaction coordination– 2 phase commit

• 2 phase commit on top of Paxos ensures availability

• Timestamps for transactions and data items– To support snapshot isolation and snapshot reads– Multiple timestamped versions of data items maintained

Page 11: Google Spanner : our understanding of concepts and implications

Spanner components

Zone 1(physical location)

Span servers(data)

Zone master(assign data)

Location proxy(locate data)Location proxies(locate data)

*TrueTime

Service

Universe master(status + interactive debugging)

Placement driver(move data across zones automatically)

Zone 2(physical location)

Span servers(data)

Zone master(assign data)

Location proxy(locate data)Location proxies(locate data)

……

Network

Page 12: Google Spanner : our understanding of concepts and implications

Zones, directories and Paxos groups

Fig: (b)Figures (a),(b) from Spanner, OSDI 2012 paper

Page 13: Google Spanner : our understanding of concepts and implications

Replication-related components• Tablet: unit of storage

– Bag of directories

– Abstraction on top of underlying DFS Colossus

• Single Paxos state machine(replica) per tablet

• Replicas of each tablet form a Paxos group

• Leader elected among a Paxos group

dirs

Tablet replica: DC1,n2….

Tablet replica: DC2,n8….

….

Paxos group

Paxos leader

Page 14: Google Spanner : our understanding of concepts and implications

Transaction-related components

Tablet replica: DC1,n2….

Tablet replica: DC2,n8….

….

Paxos group(Coordinator)

Paxos leader

Coordinator leader(2PC +2PL)

Tablet replica:….

Tablet replica:….

….

Paxos group(Participant)

Paxos leader

Participant leader

…..

Coordinator slave

Participant slaveTransaction T5:

Page 15: Google Spanner : our understanding of concepts and implications

Next:

• Serializability ensured by the already explained components

• External consistency implemented with help of TrueTime service

– True time service also used for leader election using timed leases

Page 16: Google Spanner : our understanding of concepts and implications

TrueTime + transaction implementation

[by Aditya]

Page 17: Google Spanner : our understanding of concepts and implications

Implications of Spanner

[REMOVED]

Page 18: Google Spanner : our understanding of concepts and implications

Thank you

• Image credits– Figures (a),(b) from Spanner, OSDI 2012 paper