Megastore: Providing scalable and highly available storage

33
MegaStore: Providing Scalable, Highly Available Storage for Interactive Services Niels Claeys

description

This presentation presents the paper from Google about Megastore, their distributed, scalable and high available storage system.

Transcript of Megastore: Providing scalable and highly available storage

Page 1: Megastore: Providing scalable and highly available storage

MegaStore: Providing Scalable, Highly Available Storage for

Interactive Services

Niels Claeys

Page 2: Megastore: Providing scalable and highly available storage

2

Outline

1. Introduction2. Availability and Scale3. Megastore features4. Replication5. Results6. Conclusion

Page 3: Megastore: Providing scalable and highly available storage

3

1. Introduction (1)

• Interactive online services demand– High scalability– Rapid development– Low latency– Consistency of data– High availability

→ conflicting requirements

Solution Megastore – Scalability NoSQL

→ partition + replicate– Convenience RDBMS

→ ACID semantic within partition

– High availability

Page 4: Megastore: Providing scalable and highly available storage

4

1. Introduction (2) Widely deployed in Google for several years >100 production applications. 3 billion writes and 20 billion reads daily A petabyte of data across multiple datacenters Available on GAE since Jan 2011.

Page 5: Megastore: Providing scalable and highly available storage

5

2.1 Availability and scalability

• Availability: Paxos → fault-tolerant consensus algorithm

– No master– Replicate logs

• Scale:

– Partition data in small databases

– Each partition own replicated log

Page 6: Megastore: Providing scalable and highly available storage

6

2.2 Partitioning

Page 7: Megastore: Providing scalable and highly available storage

7

Outline

1. Introduction2. Availability and Scale3. Megastore features4. Replication5. Results6. Conclusion

Page 8: Megastore: Providing scalable and highly available storage

8

3.1 Megastore features: API

• Megastore = cost-transparent API

– No expressive queries– Storing and querying hierarchical data in

key-value store is easy– Joins in application logic:

• Merge phase supported• Outer joins based on indexes

→ understandable performance implications

Page 9: Megastore: Providing scalable and highly available storage

9

3.2 Megastore features: Data model

• Megastore Tables:

– Entity group root– Child table: reference to

root

• Entity: single row→ identified by concatenation of keys

Page 10: Megastore: Providing scalable and highly available storage

10

3.2 Megastore features: Indexes

• 2 levels of indexes:

– Local: for each entity group• updated atomically and consistently

– Global: spans entity groups• Find entities without keys• Not all updates visible

Page 11: Megastore: Providing scalable and highly available storage

11

3.2 Megastore features: Bigtable Primary Keys cluster entities together Each entity = single Bigtable row. “IN TABLE” includes tables into single Bigtable

→ key ordering ensure entities are stored adjacent Bigtable column name = Megastore table name +

property name

Page 12: Megastore: Providing scalable and highly available storage

12

3.3 Megastore features: Transactions (1) Entity group= mini-database

→ serializable ACID semantics MVCC (MultiVersion Concurrency Control)

→ transaction timestamp Reads and Writes are isolated

Page 13: Megastore: Providing scalable and highly available storage

13

3.3 Megastore features: Transactions (2)

• Three levels of read consistency

– Current: read EG after write logs are committed– Snapshot: read last completed transaction of EG– Inconsistent: ignore log and read latest values

Page 14: Megastore: Providing scalable and highly available storage

14

3.3 Megastore features: Transactions (3) Write transaction:

― Current read: Obtain the timestamp and log position of the last committed transaction

― Application logic: Read from Bigtable and gather writes into a log entry

― Commit: Use Paxos to achieve consensus for appending the log entry to log

― Apply: Write mutations to the entities and indexes in Bigtable

― Clean up: Delete temp data

Page 15: Megastore: Providing scalable and highly available storage

15

3.3 Megastore features: Transactions (4) Queue: transactional messaging between EG

― Transaction atomically handles messages― Perform operations on many EG― Associated with each EG (scalable)

Two phase commit Queue is preferred over two phase commit.

Page 16: Megastore: Providing scalable and highly available storage

16

Outline

1. Introduction2. Availability and Scale3. Megastore features4. Replication5. Results6. Conclusion

Page 17: Megastore: Providing scalable and highly available storage

18

4.1 Replication: Paxos

• Reach consensus between replicas

– Tolerate delay and reorder messages– Majority replicas must be reachable

• Proposers, Acceptors, learners

• Proposers: Requests with monotonously increasing sequence number

• Problems– High-latency: multiple round trips

→ Adjusted to use in Megastore

Page 18: Megastore: Providing scalable and highly available storage

19

4.1 Replication: Paxos illustration

Page 19: Megastore: Providing scalable and highly available storage

20

4.2 Replication: Paxos adaptation

• Fast reads: local through coordinators

– Eliminates prepare phase– Coordinator: controls EG up-to-date

→ Simple because no database

• Fast writes: through leaders– Eliminate prepare phase– Multiple writes issued to same leader– Leader = closest replica to the writer

Page 20: Megastore: Providing scalable and highly available storage

21

4.3 Replication: Algorithms (1)

• Replica stores log entries of EG

→ can accept out of order• Read:

→ >=1 replica up-to-date

Page 21: Megastore: Providing scalable and highly available storage

22

4.3 Replication: Algorithms (2)

• Prep: Package changes + timestamp + leader as log

• Write not succeed: invalidate

• Data only visible after invalidate step

Page 22: Megastore: Providing scalable and highly available storage

23

4.4 Replication: Coordinator availability

• Coordinator: in each datacenter→ keep state local replica

→ simple process = more stable• Failure detection:

– Chubby lock: other coordinators online

→ Looses majority locks: all EG out-of-date– Datacenter failure: writers wait for the locks of

coordinators to expire before write can be completed

• Validation races:– Always send log position– Higher number wins

Page 23: Megastore: Providing scalable and highly available storage

24

4.5 Replication: Replica types

Full replicas― What we have seen until now

Witness replica:– Can vote– Store write-ahead logs but not the data

Read-only replica: – Cannot vote– full snapshot data

Page 24: Megastore: Providing scalable and highly available storage

25

4.6 Replication: Architecture

Page 25: Megastore: Providing scalable and highly available storage

26

5. Results

• Read latency: 10+ ms• Write latency: 300 ms• Issues: Replica unavailable

• Solution– Reroute traffic to

servers nearby– Disable the replica

coordinator

Page 26: Megastore: Providing scalable and highly available storage

27

6. Conclusion

• Scalability and Availability• Simpler reasoning and usage

→ ACID semantics• Latency:

→ best effort (low enough for interactive apps)• Throughput within EG: few per second

→ not enough: sharding EG or placement replicas near each other

Page 27: Megastore: Providing scalable and highly available storage

Questions?

Page 28: Megastore: Providing scalable and highly available storage

29

Question1As being stated several times, only 2 elements of CAP can be kept. Megastore

focusses on which two and how?

Reasoning:

– Partition tolerance: dividing database into EG and replicating these over multiple data centers

– Availability: providing a service that is highly available through Paxos

– Consistency: Relaxed consistency between EG and global indexes

Page 29: Megastore: Providing scalable and highly available storage

30

Question2Current reads have the following guarantees:

- A read always observes the last-acknowledged write.

- After a write has been observed, all future reads observe that write. (A write might be observed before it is acknowledged.)

Contradiction?

→ No I do not think so but it is confusing

Reasoning:

– Two guarantees are focused on current reads (this are the reads that preserve consistency)

– The sentence between parentheses mentions that inconsistent reads are also possible but they are not current reads

Page 30: Megastore: Providing scalable and highly available storage

31

Question3In my opinion, a lot of their focus goes towards making the system consistent. But

in their API they also give you the possiblity to request current, snapshot and

inconsequent data. Do you think this is a valuable addition?

Reasoning:

→ Mainly due to performance bottleneck of consistent system

→ Depends on the application: There exist applications where you do not mind that you read something inconsistent

→ Current and snapshot still maintain consistency

=> Personally I think the value is limited: cannot think of an application where latency is so critical that rather wants inconsistent data than wait a bit longer

Page 31: Megastore: Providing scalable and highly available storage

32

Question44.4.3: For me it is not clear how the 'read-only' replicas receive their data, as it

needs to get consistent data. Do you have an idea?

→ not mentioned in paper

Idea:

– Coordinator of replica keeps track of the up-to-date EG

– Mechanism that periodically takes a snapshot of these up-to-date EGand copies it to the read-only replicas

Page 32: Megastore: Providing scalable and highly available storage

33

Question5Megastore is multiple times compared with Bigtable, maybe could you give what according to

you are the biggest differences in implementation an in types of usage?

Reasoning:

– Build on top of bigtable and based on different requirements:→ consistency guarantees + wide-area communication

– Bigtable used within one data center ↔ MegaStore across multiple→ increased availability (Paxos) but higher latency

– Consistency guarantees from Megastore: I suspect lower performance, throughput than with bigtable

– Implementation:

• Bigtable: master ensures replication ↔ Paxos (no master recovery)

• Bigtable: one log for each tablet server ↔ 1 log per EG in replica

• Very different APIs: Megastore supports schemas + indexes

– Note: Google App Engine moved from BigTable to Megastore→ problems with replication between datacenters (if one datacenters becomes unhealthy): Replication is done in the background and therefore (a bit) behind

Page 33: Megastore: Providing scalable and highly available storage

34

ReferencesMacDonald A., Paxos by example,

http://angusmacdonald.me/writing/paxos-by-example/, accessed 06-05-14

Google App Engine, Switch from Bigtable to MegaStore http://googleappengine.blogspot.be/2009/09/migration-to-better-datastore.html, accessed 08-05-13