Noha mega store

MegaStore Google Inc.

Presented by: Noha Elprince

22 June, 2011

Jason Baker, Chris Bond, James C Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh. CIDR 2011.

What is MegaStore?

§  A storage system developed to meet the

requirements of today’s online interactive services.

§  Megastore is the data engine supporting the Google

App Engine (GAE) https://appengine.google.com/

§  GAE cloud computing technology:

Ø  Hosts/virtualizes web apps across multiple servers on Google’s platform. Ø  Fast development and deployment. Ø  Simple administration. Ø  No need to worry about hardware patches or backups and scalability.

2

Outline �  Motivation & Problem

�  Methodology

�  Design of Megastore �  Data Model �  Data Storage �  Transactions and Concurrency Control

�  How Megastore achieves Availability and Scalability. �  PAXOS. �  Megastore’s approach.

�  Experience

�  Related Work

�  Conclusion

3

Megastore- Motivation

•  Storage requirements of today’s interactive online applications. �  Highly scalable

�  Rapid development

�  Low latency

�  Durability and consistency

�  Availability and fault tolerance.

•  These requirements are in conflict !

4

CAP Theorem – Eric Brewer 2000

“In a distributed database system, you can only have at most two of the following three characteristics:

Ø  Consistency

Ø  Availability

Ø  Partition tolerance

”

ACID = Atomicity, Consistency, Isolation, Durability.

5

Problem §  Conflicts between Available systems:

�  RDBMS Rich set of features, expressive language helps development, but difficult to scale. Eg: MySQL, PostgreSQL, MS SQL Server, Oracle RDB.

�  NoSQL datastores Highly Scalable but Limited API and loose consistency models. Eg: Google’s BigTable, Apache Hadoop’s Hbase, Facebook’s Cassandra.

§  Reliability of a single datacenter cant be guaranteed 100%.

[“Always expect the unexpected”—James Patterson]

6

Methodology �  Megastore blends the scalability of NoSQL with the

convenience of traditional RDBMS.

�  High reliability can be achieved by: Ø  Data lives in multiple data centers.

Ø  Write to a majority of datacenters synchronously.

Ø  Allow the infrastructure decide what datacenter to read from and write to.

7


�  Methodology



�  Experience

�  Related Work

�  Conclusion

8

þ

þ

Design of Megastore : DataModel

�  The data model is declared in a schema.

�  Each schema has a set of tables : root tables or child tables.

�  Entity Group – consists of a root entity along with all child entities.

9

CREATE SCHEMA PhotoApp;

CREATE TABLE User {

required int64 user_id;

required string name;

} PRIMARY KEY(user_id),

ENTITY GROUP ROOT;

CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag; } PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User;

10

•  (Hierarchical) data is de-normalized to eliminate the join costs Joins are implemented in application level

•  Outer joins with parallel queries using secondary indexed •  Provides an efficient stand-in for SQL-style joins

Design of Megastore : DataModel

How is it stored in BigTable?

11

“A Bigtable is a compressed, high performance, and proprietary database system built on :

Google File System (GFS), Chubby Lock service and other Google programs ”

Design of Megastore : Data Storage

Example:

User {user_id:101, name: ‘John’ }

Photo{ user_id:101, photo_id:501, time 2009, full_url: ‘john-pic1’,

tag:’vacation’, tag:’holiday’, tag:’Paris’}

Photo{ user_id:101, photo_id:502, time:2010, full_url: ‘john-pic2’, tag:’office’, tag:’friends’, tag:’pub’}

12

Design of Megastore : Data Storage Row Key

User.name

Photo. time

Photo. Tag

Photo URL

101 John

101, 501

2009 Vacation, Hoilday, Paris

…

101, 502

2010 Office, friends, pub

…

102 Mary

102, 600

2009 Office, Picnic, Paris

…

102, 601

2011 Birthday, Friends

…

User{user_id:102, name: ‘Mary’ }

Photo{ user_id:102, photo_id:600, time:2009, full_url: ‘mary-pic1’, tag:’office’, tag:’picnic’, tag:’Paris’}

Photo{ user_id:102, photo_id:601, time:2011, full_url: ‘mary-pic2’, tag:’birthday’, tag:’friends’}

�  Indexing �  Local Index – find data within Entity Group.

CREATE LOCAL INDEX PhotosByTime ON Photo(user_id, time);

�  Global Index - spans entity groups. CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING

(thumbnail_url);

�  The ‘Storing’ Clause Ø  Faster retrieval of certain properties.

13


14

How is it stored in BigTable?

Row Key

101,2009, 101,501

101,2010, 101,502

102,2009, 102,600

102,2011, 102,601

PhotosByTime Row Key Thumbnail.Url

Birthday,102, 601 …

Friends, 101, 502 …

Friends, 102,601 …

Holiday, 101, 501 …

Office, 101, 502 …

Office, 102, 600 …

Paris, 101, 501 …

Paris, 102, 600 …

Pub, 101, 502 …

PhotosByTag



�  Methodology



�  Experience

�  Related Work

�  Conclusion

15

þ

þ

þ ✓ ✓

Transactions and Concurrency Control •  Each Entity Group acts as mini-db, provides

ACID semantics.

•  Transaction management using Write Ahead Logging (WAL).

•  BigTable feature – ability to store multiple data for same row/column with different timestamps.

•  Cross entity group transactions supported via two-phase commit (2PC).

•  Entites in an Entity group employs Multiversion Concurrency Control (MVCC).

�  MVCC: multiversion concurrency control

Using timestamps - reads and writes do not block each other.

�  Read consistency

�  Current: wait for uncommitted writes then read last committed value

�  Snapshot: doesn't’t wait. Reads last committed values.

�  Inconsistent reads: ignore the state of log and read the last values directly (data may be stale)

�  Write consistency

�  Determine the next available log position

�  Assigns mutations of write-ahead log (WAL) a timestamp higher than any previous one

�  Employs Paxos to settle the resource contention : Select a winner to write on a certain entity group. The others will abort/retry their operations.

It uses optimistic concurrency OCC with mutations (write operations):

(Assumes there is no transaction ‘s data conficts => proceed without locks )

Transactions and Concurrency Control

18

Transactions and Concurrency Control

q  Queues §  Provide transactional messaging between entity groups. §  Each message either is : Ø  Synchronous: has a single

sending and receiving entity group. Ø  Asynchronous: has different

sending and receiving entity group.

Ø  Useful to perform operations that affect many entity groups.

Fig. Operations across entity groups

19

Transactions and Concurrency Control q  Two-Phase Commit (2PC) §  Coordinator: the component that receives the commit/abort request §  Participants: the resource managers that did work on behalf of

the transaction (by reading/updating resources). * Goal: Ensure that the coordinator and all participants either commit/abort the transaction => Atomicity is satisfied.

Source: Ref[2]

Disadv. High latency Adv. Simplify code for unique secondary key enforcement.

Other Features

�  Integrated Backup System Ø  used to restore back an entity group’s state to

any point in time

�  Data Encryption Ø  use distinct key/entity group

20


�  Methodology



�  Experience

�  Related Work

�  Conclusion

21

þ

þ

þ ✓ ✓ ✓

v  Megastore Replication System

Megastore – Availability / Scalability

•  Replication is done per entity group by: synchronously replicating the group’s transaction log into a number of replicas. •  Reads and writes can be initiated from any replicas. •  Writes require one round of inter-

datacenter communication. •  ACID semantics are preserved regardless of what replica a client starts from.

Fig. Scalable Replication

�  PAXOS Algorithm

Megastore – Replication

Adv. Tolerates delayed or reordered messages and replicas that fail by Stopping (can tolerate upto N/2 failures). Disadv. high-latency bec. it demands multiple rounds of communication. so Megastore uses an improved version.

•  a way to reach consensus among a group of replicas on a single value. •  Databases typically use PAXOS to replicate a transaction log, where a

separate instance of PAXOS is used for each position in the log.

Source: Ref[3]

•  Master-Based Approach

Ø  A Master-Slave model is generally used where the Master

handles all the replication of writes.

Ø  But it causes a bottleneck.


•  MegaStore Replication System (PAXOS-modified)

§  Fast Reads

- Allow local reads from any where.

- Tracks a set of entity groups for which its replica has observed all PAXOS writes and serve their local reads.

§  Fast Writes

- A specific replica is chosen as a leader.

- The leader decides the proposal no. and sends it to other writers.

- The first writer submits a value to the leader, wins the right to ask all replicas to accept that value.

•  Select the next write’s leader using the closest replica heuristic (aim: minimizes the writer-leader latency by observing: most apps submit writes from the same region repeatedly).



�  Methodology



�  Experience

�  Related Work

�  Conclusion

26

þ

þ

þ ✓ ✓ ✓

þ

Experience ²  Real-world deployment

�  More than 100 production application use Megastore (e.g. Google App Engine)

�  Most of applications see extremely high availability

�  Most of users see average write latencies of 100~400 ms.

Related Work

�  NoSQL data storage systems �  Bigtable, Cassandra, Yahoo PNUTS, Amazon SimpleDB

�  Data replication process �  Hbase, CouchDB, Dynamo, …

�  Extend replication scheme of traditional RDBMS systems

�  Paxos algorithm �  SCALARIS, Keyspace, …

�  Few have used Paxos to achieve synchronous replication

Conclusion

29

Megastore

Ø  A scalable, highly available datastore for interactive internet services.

Ø  Paxos is used for synchronous replication. Ø  Bigtable as the scalable datastore while adding richer

primitives (ACID, Indexes). Ø  Has over 100 applications in productions

Megastore

Any Questions?

References �  [1] “Megastore: Providing Scalable Highly Available Storage for

Interactive Services.” Jason Baker et al.. CIDR 2011.

�  [2] “Principles of transaction Processing.” Philip A. Bernstein, Eric Newcomer, Morgan Kaufmann, 2009.

�  [3] http://paprika.umw.edu/~ernie/cpsc321/10312006.html

�  [4] Google MegaStore’s Presentation at SIGMOD 2008. http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx.

31

�  Each replica stores mutations and metadata for the log entries �  Read process

�  1. Query Local �  Up-to-date check

�  2. Find position �  Highest log position �  Select replica

�  3. Catchup �  Check the consensus

value from other replica

�  4. Validate �  Synchronizing with

up-to-data

�  5. Query data �  Read data with timestamp

Megastore – Replication Megastore Read Process

�  Each replica stores mutations and metadata for the log entries �  Write process

�  1. Accept leader �  Ask the leader to accept

the value as proposal number

�  2. Prepare �  Run the Paxos Prepare

phase at all replica

�  3. Accept �  Ask remaining replicas

to accept the value

�  4. Invalidate �  Fault handling for replicas

which did not accept the value

�  5. Apply �  Apply the value’s mutation at as many replicas as possible

Megastore – Replication �  Megastore Write Process

Noha mega store

Technology

Transcript of Noha mega store