Noha mega store
-
Upload
noha-elprince -
Category
Technology
-
view
468 -
download
6
Transcript of Noha mega store
MegaStore Google Inc.
Presented by: Noha Elprince
22 June, 2011
Jason Baker, Chris Bond, James C Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh. CIDR 2011.
What is MegaStore?
§ A storage system developed to meet the
requirements of today’s online interactive services.
§ Megastore is the data engine supporting the Google
App Engine (GAE) https://appengine.google.com/
§ GAE cloud computing technology:
Ø Hosts/virtualizes web apps across multiple servers on Google’s platform. Ø Fast development and deployment. Ø Simple administration. Ø No need to worry about hardware patches or backups and scalability.
2
Outline � Motivation & Problem
� Methodology
� Design of Megastore � Data Model � Data Storage � Transactions and Concurrency Control
� How Megastore achieves Availability and Scalability. � PAXOS. � Megastore’s approach.
� Experience
� Related Work
� Conclusion
3
Megastore- Motivation
• Storage requirements of today’s interactive online applications. � Highly scalable
� Rapid development
� Low latency
� Durability and consistency
� Availability and fault tolerance.
• These requirements are in conflict !
4
CAP Theorem – Eric Brewer 2000
“In a distributed database system, you can only have at most two of the following three characteristics:
Ø Consistency
Ø Availability
Ø Partition tolerance
”
ACID = Atomicity, Consistency, Isolation, Durability.
5
Problem § Conflicts between Available systems:
� RDBMS Rich set of features, expressive language helps development, but difficult to scale. Eg: MySQL, PostgreSQL, MS SQL Server, Oracle RDB.
� NoSQL datastores Highly Scalable but Limited API and loose consistency models. Eg: Google’s BigTable, Apache Hadoop’s Hbase, Facebook’s Cassandra.
§ Reliability of a single datacenter cant be guaranteed 100%.
[“Always expect the unexpected”—James Patterson]
6
Methodology � Megastore blends the scalability of NoSQL with the
convenience of traditional RDBMS.
� High reliability can be achieved by: Ø Data lives in multiple data centers.
Ø Write to a majority of datacenters synchronously.
Ø Allow the infrastructure decide what datacenter to read from and write to.
7
Outline � Motivation & Problem
� Methodology
� Design of Megastore � Data Model � Data Storage � Transactions and Concurrency Control
� How Megastore achieves Availability and Scalability. � PAXOS. � Megastore’s approach.
� Experience
� Related Work
� Conclusion
8
þ
þ
Design of Megastore : DataModel
� The data model is declared in a schema.
� Each schema has a set of tables : root tables or child tables.
� Entity Group – consists of a root entity along with all child entities.
9
CREATE SCHEMA PhotoApp;
CREATE TABLE User {
required int64 user_id;
required string name;
} PRIMARY KEY(user_id),
ENTITY GROUP ROOT;
CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag; } PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User;
10
• (Hierarchical) data is de-normalized to eliminate the join costs Joins are implemented in application level
• Outer joins with parallel queries using secondary indexed • Provides an efficient stand-in for SQL-style joins
Design of Megastore : DataModel
How is it stored in BigTable?
11
“A Bigtable is a compressed, high performance, and proprietary database system built on :
Google File System (GFS), Chubby Lock service and other Google programs ”
Design of Megastore : Data Storage
Example:
User {user_id:101, name: ‘John’ }
Photo{ user_id:101, photo_id:501, time 2009, full_url: ‘john-pic1’,
tag:’vacation’, tag:’holiday’, tag:’Paris’}
Photo{ user_id:101, photo_id:502, time:2010, full_url: ‘john-pic2’, tag:’office’, tag:’friends’, tag:’pub’}
12
Design of Megastore : Data Storage Row Key
User.name
Photo. time
Photo. Tag
Photo URL
101 John
101, 501
2009 Vacation, Hoilday, Paris
…
101, 502
2010 Office, friends, pub
…
102 Mary
102, 600
2009 Office, Picnic, Paris
…
102, 601
2011 Birthday, Friends
…
User{user_id:102, name: ‘Mary’ }
Photo{ user_id:102, photo_id:600, time:2009, full_url: ‘mary-pic1’, tag:’office’, tag:’picnic’, tag:’Paris’}
Photo{ user_id:102, photo_id:601, time:2011, full_url: ‘mary-pic2’, tag:’birthday’, tag:’friends’}
� Indexing � Local Index – find data within Entity Group.
CREATE LOCAL INDEX PhotosByTime ON Photo(user_id, time);
� Global Index - spans entity groups. CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING
(thumbnail_url);
� The ‘Storing’ Clause Ø Faster retrieval of certain properties.
13
Design of Megastore : Data Storage
14
How is it stored in BigTable?
Row Key
101,2009, 101,501
101,2010, 101,502
102,2009, 102,600
102,2011, 102,601
PhotosByTime Row Key Thumbnail.Url
Birthday,102, 601 …
Friends, 101, 502 …
Friends, 102,601 …
Holiday, 101, 501 …
Office, 101, 502 …
Office, 102, 600 …
Paris, 101, 501 …
Paris, 102, 600 …
Pub, 101, 502 …
PhotosByTag
Design of Megastore : Data Storage
Outline � Motivation & Problem
� Methodology
� Design of Megastore � Data Model � Data Storage � Transactions and Concurrency Control
� How Megastore achieves Availability and Scalability. � PAXOS. � Megastore’s approach.
� Experience
� Related Work
� Conclusion
15
þ
þ
þ ✓ ✓
Transactions and Concurrency Control • Each Entity Group acts as mini-db, provides
ACID semantics.
• Transaction management using Write Ahead Logging (WAL).
• BigTable feature – ability to store multiple data for same row/column with different timestamps.
• Cross entity group transactions supported via two-phase commit (2PC).
• Entites in an Entity group employs Multiversion Concurrency Control (MVCC).
� MVCC: multiversion concurrency control
Using timestamps - reads and writes do not block each other.
� Read consistency
� Current: wait for uncommitted writes then read last committed value
� Snapshot: doesn't’t wait. Reads last committed values.
� Inconsistent reads: ignore the state of log and read the last values directly (data may be stale)
� Write consistency
� Determine the next available log position
� Assigns mutations of write-ahead log (WAL) a timestamp higher than any previous one
� Employs Paxos to settle the resource contention : Select a winner to write on a certain entity group. The others will abort/retry their operations.
It uses optimistic concurrency OCC with mutations (write operations):
(Assumes there is no transaction ‘s data conficts => proceed without locks )
Transactions and Concurrency Control
18
Transactions and Concurrency Control
q Queues § Provide transactional messaging between entity groups. § Each message either is : Ø Synchronous: has a single
sending and receiving entity group. Ø Asynchronous: has different
sending and receiving entity group.
Ø Useful to perform operations that affect many entity groups.
Fig. Operations across entity groups
19
Transactions and Concurrency Control q Two-Phase Commit (2PC) § Coordinator: the component that receives the commit/abort request § Participants: the resource managers that did work on behalf of
the transaction (by reading/updating resources). * Goal: Ensure that the coordinator and all participants either commit/abort the transaction => Atomicity is satisfied.
Source: Ref[2]
Disadv. High latency Adv. Simplify code for unique secondary key enforcement.
Other Features
� Integrated Backup System Ø used to restore back an entity group’s state to
any point in time
� Data Encryption Ø use distinct key/entity group
20
Outline � Motivation & Problem
� Methodology
� Design of Megastore � Data Model � Data Storage � Transactions and Concurrency Control
� How Megastore achieves Availability and Scalability. � PAXOS. � Megastore’s approach.
� Experience
� Related Work
� Conclusion
21
þ
þ
þ ✓ ✓ ✓
v Megastore Replication System
Megastore – Availability / Scalability
• Replication is done per entity group by: synchronously replicating the group’s transaction log into a number of replicas. • Reads and writes can be initiated from any replicas. • Writes require one round of inter-
datacenter communication. • ACID semantics are preserved regardless of what replica a client starts from.
Fig. Scalable Replication
� PAXOS Algorithm
Megastore – Replication
Adv. Tolerates delayed or reordered messages and replicas that fail by Stopping (can tolerate upto N/2 failures). Disadv. high-latency bec. it demands multiple rounds of communication. so Megastore uses an improved version.
• a way to reach consensus among a group of replicas on a single value. • Databases typically use PAXOS to replicate a transaction log, where a
separate instance of PAXOS is used for each position in the log.
Source: Ref[3]
• Master-Based Approach
Ø A Master-Slave model is generally used where the Master
handles all the replication of writes.
Ø But it causes a bottleneck.
Megastore – Replication
• MegaStore Replication System (PAXOS-modified)
§ Fast Reads
- Allow local reads from any where.
- Tracks a set of entity groups for which its replica has observed all PAXOS writes and serve their local reads.
§ Fast Writes
- A specific replica is chosen as a leader.
- The leader decides the proposal no. and sends it to other writers.
- The first writer submits a value to the leader, wins the right to ask all replicas to accept that value.
• Select the next write’s leader using the closest replica heuristic (aim: minimizes the writer-leader latency by observing: most apps submit writes from the same region repeatedly).
Megastore – Replication
Outline � Motivation & Problem
� Methodology
� Design of Megastore � Data Model � Data Storage � Transactions and Concurrency Control
� How Megastore achieves Availability and Scalability. � PAXOS. � Megastore’s approach.
� Experience
� Related Work
� Conclusion
26
þ
þ
þ ✓ ✓ ✓
þ
Experience ² Real-world deployment
� More than 100 production application use Megastore (e.g. Google App Engine)
� Most of applications see extremely high availability
� Most of users see average write latencies of 100~400 ms.
Related Work
� NoSQL data storage systems � Bigtable, Cassandra, Yahoo PNUTS, Amazon SimpleDB
� Data replication process � Hbase, CouchDB, Dynamo, …
� Extend replication scheme of traditional RDBMS systems
� Paxos algorithm � SCALARIS, Keyspace, …
� Few have used Paxos to achieve synchronous replication
Conclusion
29
Megastore
Ø A scalable, highly available datastore for interactive internet services.
Ø Paxos is used for synchronous replication. Ø Bigtable as the scalable datastore while adding richer
primitives (ACID, Indexes). Ø Has over 100 applications in productions
Megastore
Any Questions?
References � [1] “Megastore: Providing Scalable Highly Available Storage for
Interactive Services.” Jason Baker et al.. CIDR 2011.
� [2] “Principles of transaction Processing.” Philip A. Bernstein, Eric Newcomer, Morgan Kaufmann, 2009.
� [3] http://paprika.umw.edu/~ernie/cpsc321/10312006.html
� [4] Google MegaStore’s Presentation at SIGMOD 2008. http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx.
31
� Each replica stores mutations and metadata for the log entries � Read process
� 1. Query Local � Up-to-date check
� 2. Find position � Highest log position � Select replica
� 3. Catchup � Check the consensus
value from other replica
� 4. Validate � Synchronizing with
up-to-data
� 5. Query data � Read data with timestamp
Megastore – Replication Megastore Read Process
� Each replica stores mutations and metadata for the log entries � Write process
� 1. Accept leader � Ask the leader to accept
the value as proposal number
� 2. Prepare � Run the Paxos Prepare
phase at all replica
� 3. Accept � Ask remaining replicas
to accept the value
� 4. Invalidate � Fault handling for replicas
which did not accept the value
� 5. Apply � Apply the value’s mutation at as many replicas as possible
Megastore – Replication � Megastore Write Process