Google Cloud Datastore Inside-Out

Google Cloud DatastoreInside-Out

Etsuji NakaiCloud Solutions Architect at GoogleFebruary 9, 2017 ver2.1

Etsuji NakaiCloud Solutions Architect at Google

Twitter @enakai00

Now On Sale!

2

Cloud Datastore 101

The mystery of entity groups

Dual nature of entities● An entity represents a row of a specific "kind".● You can think of "kind" as a table in the relational data model.● An entity is identified by an ID (user-specified string or

auto-generated UUID) plus its (mysterious) parent key.A row of a kind

4

Unique identifier

Dual nature of entities● An entity represents a node of an "entity group" tree.● An entity group can contain entities from multiple kinds.● An entity is identified by a key (ancestor path + ID).

○ A key must contain all entities from the root.○ Some entities in the ancestor path may not exist.

A node of an entity group

5

Organization: Flywheel (doesn't exist)

ancestor path ID

Key: (Organization, 'Flywheel', User, 'Alice', Mail, '15de6')

The bright/dark side of an entity● It's safe to treat an entity as a member of an entity group.

○ Entities treated as part of an entity group are guaranteed to be strongly consistent.

● An ancestor query is a query that specifies an ancestor.○ The search range is limited to the descendants of the specified ancestor.○ Ancestor queries are strongly consistent.○ In other words, it always retrieves the latest data.○ You can use a single phase transaction inside an entity group○ A cross group transaction can also be used, but slower than a single phase transaction.

● A global query is a query without specifying an ancestor. ○ Global queries are eventually consistent.○ You may see old content and/or fail to find newly created entities.

6

Mystery of composite indexes● Can you tell which query requires an additional (non-default) index?

○ Global query

○ Ancestor query

■

7

SELECT * FROM Mail WHERE size>256 ⇒ ◯(OK)SELECT * FROM Mail WHERE size=256 and access_count>5 ⇒ △(Need an additional index)SELECT * FROM Mail WHERE size>256 and access_count>5 ⇒ ✕(This is not allowed)SELECT size FROM Mail WHERE size>256 ⇒ ◯(OK)SELECT title FROM Mail WHERE size>256 ⇒ △(Need an additional index)

SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') ⇒ ◯SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') ⇒ △SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') ⇒ △

What's happening under the covers?● How is strong consistency guaranteed for ancestor queries?● Why do I have to define additional indexes for some queries?● When and why do I need to specify "ancestor = True" for an index?

Truth is here● Cloud Datastore is implemented on top of Megastore which has the layered structure

over Bigtable and Google File System. The internal architecture of Megastore, Bigtable and Google File System is explained in the published research papers.

● Megastore: Providing Scalable, Highly Available Storage for Interactive Services○ http://research.google.com/pubs/pub36971.html

● Bigtable: A Distributed Storage System for Structured Data○ http://research.google.com/archive/bigtable.html

● The Google File System ○ http://research.google.com/archive/gfs.html

9

Google File System

Bigtable

Megastore

Notes on Colossus● Colossus is a successor of Google File System which overcomes shortcomings of

Google File System. It is used as an infrastructure of Google Cloud Platform as well as Google's internal systems today.

● The following characteristics were mentioned at Google Faculty Summit 2010.○ Next-generation cluster-level file system○ Automatically sharded metadata layer○ Data typically written using Reed-Solomon (1.5x)○ Client-driven replication, encoding and replication○ Metadata space has enabled availability analyses

● Since the architectural details of Colossus is not yet published, this presentation explains the architecture of Google File System.

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.reverse-proxy.org/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf

Google File System

What is Google File System?● Large scale distributed file system used in Google's internal systems to store large files.● Optimized for file append and sequential file read for large files.

○ Other operations are supported but may be very slow.● Transparent file replication for redundancy.

○ Each file is split into multiple 64MB chunks and each chunk is stored in (at least) three chunk servers.

12

Handing over large data between servers

Streaming data aggregation

Typical access patterns

Optimized dataflow● Data is transferred serially from a client to chunk servers. The chunk server starts

sending the data right after it starts receiving it. ○ Faster than sending data from a client to all chunk servers in parallel.

● Control messages are handled by the primary chunk server to keep the consistency among replicas.

13

Client

Chunk servers PrimarySecondary Secondary

Client

Dataflow to append data Control flow to commit the write

Data corruption detection ● Each chunk is associated with a checksum to

detect data corruption.● The whole chunk is read and validated with the

checksum for the read operation. ○ This is optimized for the sequential read.

● A new checksum is calculated with appended data and the existing checksum for the write operation.

○ This is optimized for the file append.

14

Bigtable

What is Bigtable?● Large scale distributed key-value style datastore used in Google's internal systems to

store structured data with varying data sizes (from web page URLs to satellite imagery.)● Google Cloud Platform offers managed service for Bigtable with HBase compatible APIs.

16

Column family design to store HTML contents and inversed links(excerpt from the research paper)

Row as a Database● Data is identified with "Row Key + Column family: Column" (+ timestamp).● You may think a single row as a small database.

○ A column family represents a table.○ Columns can be dynamically added to a column family.○ Atomic operations can be used within a single row.

17

Column family design for user profiles and query histories

Global view of the "big" table● Rows are stored in lexicographic order by row key. The row range for a table is

dynamically partitioned into units called 'tablets'.○ This strategy is optimized for fast row range scans.

● Tablet servers provide the access to tablets. The tablet assignment is managed by a master node.

18

Tablet representation● Tablet data is consisted of in-memory data (memtable) and immutable files (SSTables)

stored in Google File System.○ SSTables store the freezed view of a tablet at some point of time. Updates are

appended to a tablet log and memtable. ○ A tablet server construct the united view of the tablet by combining memtable and

SSTables.

19Tablet representation mechanism(excerpt from the research paper)

● When memtable becomes too large, a new memtable is created and the old one is freezed to a new SSTable. (Minor compaction.)

● When SSTables becomes too many, they are merged into a single SSTable by discarding obsolete entries (Major compaction.)

Cloud Datastore / Megastore

Overview of Megastore● Megastore provides the ACID semantics for

globally distributed datasets using fast synchronous replication mechanism based on (an enhanced version of) Paxos.

● This part explains the index structure of Cloud Datastore implemented on top of Megastore.

● Note that ancestor/global query is additional features of Cloud Datastore. They are not a part of Megastore.

21

Multi datacenter replication architecture of Megastore(excerpt from the research paper)

Index structure for ancestor queries

How are entities stored in Bigtable?● Row key: entity key (ancestor path + ID).

○ The whole entity group can be scanned by a row range scan (depth-first search). ● Column family: properties of an entity.

○ An independent column family is used for each property.

23

Row key status of the group email title size access_count

Organization, 'Flywheel'

Organization, 'Flywheel', User, 'Alice' xxxx

Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9

Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5

Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' xxxx 256 3

Organization, 'Flywheel', User, 'Bob' xxxx

・

・

・

Transaction log and replication status is recorded for operations with strong consistency.

Row range scan

Ancestor query without inequality filters● The following queries don't require an additional index since they can be done by a row

range scan.

● The scan starts from a row with the specified ancestor key.

Row key status of the group email title size

Organization, 'Flywheel'


Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024

Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128

Starts from here

SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')

24

Ancestor query with inequality filters● The following query requires an additional index.

● Theoretically it's possible to do the same table scan, but may not be efficient enough. Instead, the following index should be used.

○ The row key of this index table consists of:■ "Ancestor of the entity" + "Property value" + "Entity key (ancestor path + ID)"

○ See next pages for details.

SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')

indexes:

- kind: Mail ancestor: yes properties: - name: size

25

Single-property indexes for ancestor queries● Each entity is mapped to multiple rows corresponding to all its ancestors.

○ The following example shows the rows for two entities.○ This will be sorted in the order of row keys, then...

Organization, 'Flywheel', | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity

Organization, 'Flywheel', User, 'Alice' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity

Organization, 'Flywheel', User, 'Alice', Mail, '15de6' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity

Organization, 'Flywheel', | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity

Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity

Organization, 'Flywheel', User, 'Alice', Mail, '65067' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity

Row key Column

Ancestors Property value Entity key (ancestor path + id)

26

Single-property indexes for ancestor queries● Using the row keys which are sorted in lexicographic order:

○ First, the row range is limited by the specified ancestor.○ The row range is narrowed further by the inequality filter.

Organization, 'Flywheel' | 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'

Organization, 'Flywheel' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'

Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'

Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''

Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'

Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'

Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'

Organization, 'Flywheel', User, 'Alice' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'

SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')

27

Composite indexes for multiple conditions● Indexes with multiple properties are used for queries with multiple conditions.● The following query requires the composite index.

● The order of properties in the index definition has meaning.○ The property for equality filter must come first.

indexes:

- kind: Mail ancestor: yes properties: - name: size - name: access_count

SELECT * FROM Mail WHERE size=256 and access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')

28

Organization, 'Flywheel' | 64 | 1 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'

Organization, 'Flywheel' | 128 | 5 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'

Organization, 'Flywheel' | 256 | 3 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'

Organization, 'Flywheel' | 256 | 8 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''

Organization, 'Flywheel' | 1024 | 9 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'

Organization, 'Flywheel' | 1024 | 2 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'

Multiple inequality filters are not allowed!● The following query is not allowed.

○ The rows of index table cannot be a single range for this condition.

SELECT * FROM Mail WHERE size>128 AND access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')

29

Strong consistency of ancestor queries● Indexes with "ancestor: yes" are used for ancestor queries where independent indexes

are created for each ancestor tree.○ A single index table contains entries only for one entity group.

● Indexes are created in each datacenter and replicated.○ Replication status is checked before starting a query to guarantee strong

consistency.

30

Row key status of the group email title size access_count

Organization, 'Flywheel' Replication Status


Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9

Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5

Root entity

Index structure for global queries

Indexes for global queries● Indexes with "ancestor: no" are used for global queries where indexes are created for

each kind.○ One index table contains all entities of a specific kind including entities from

multiple entity groups.

Operation across entity groups(excerpt from the research paper)

● Megastore handles operations across entity groups with weaker consistency unless two-phase commitment is used.

● On the Cloud Datastore layer, it results in the eventual consistency of global queries.

32

Default single-property indexes● Single-property indexes for global queries are automatically created (in both asc and

desc orders). ○ Ancestors are not included in row keys of the index table.

● For example, the following queries use the default indexes.

SELECT * FROM Mail WHERE size>256SELECT size FROM Mail WHERE size>256

33

64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'

128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'

256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'

256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''

1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'

1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'

Composite indexes for global queries● Indexes with multiple properties (composite indexes) need to be created manually.

○ Projection queries also need composite indexes so that values can be retrieved directly from the index table.

SELECT * FROM Mail WHERE size=256 and access_count>5SELECT title FROM Mail WHERE size>256

Projection query

indexes:

- kind: Mail ancestor: no properties: - name: size - name: access_count

- kind: Mail ancestor: no properties: - name: size - name: title

'title' can be retrieved directly from the index table.

34

Index direction matters for sort orders● "ORDER BY" requires the corresponding index.● When used with an equality filter, the index direction needs to match the sort order.

● "ORDER BY" cannot mixed with an inequality filter for other properties.○ The following query is not allowed.

SELECT * FROM Mail WHERE size=256 ORDER BY access_count DESC indexes:

- kind: Mail ancestor: no properties: - name: size - name: access_count direction: desc

35

SELECT * FROM Mail WHERE size>256 ORDER BY access_count DESC

Design guide for entity groups

Design guide for entity groups● Avoid global queries (queries without specifying an ancestor) unless you understand what

you are doing.○ Global queries may not retrieve the latest data.

● Splitting data into entity groups so that updates in a single entity group are less frequent.○ The update of entities in a single entity group should be less than 1 update/sec.

● Examples:○ Web mail service

■ An entity group of mails for each user.○ SNS user group service

■ An entity group of user profile for each user.■ An entity group of posts for each user group.■ An entity group of group names and pointers to group sites which provides a catalog of user

groups.○ Online map service

■ An entity group of patches for an arbitrary region of the globe.

37

References● Under the Covers of the Google App Engine Datastore● How Entities and Indexes are Stored● Balancing Strong and Eventual Consistency with Google Cloud Datastore

38

https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore

https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore

https://cloud.google.com/appengine/articles/storage_breakdown

https://cloud.google.com/appengine/articles/storage_breakdown

https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/

https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/

Notes on Spanner

What is Spanner?● Spanner: Google's Globally-Distributed Database

○ http://research.google.com/archive/spanner.html

● Spanner is a Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is used as a successor of Megastore in Google's internal systems.

● Designed to overcome the shortcomings of Megastore and support general-purpose transactions with SQL-based query language.

● Example of shortcomings of Megastore:○ It doesn't support the relational data model and SQL-based query language.○ Transaction and strong consistency is limited within an entity group.○ The number of updates is limited to 1 update/sec for each entity group.

40

Infrastructure design● The overall server architecture of Spanner resembles Megastore over Bigtable.

○ A cluster in each zone contains multiple span servers. Zones are distributed across data centers.

○ Each span server manages tablets which hold the key-value mappings:(key: string, timestamp: int64) → value: string

○ Backend data files are stored in Colossus.

41

Spanner server organization(excerpt from the research paper)

● Differently from Bigtable, rows in a tablet are versioned with a system time instead of user specified timestamps.

○ The versioning mechanism is used for snapshot read and lock-free read-only transactions.

Paxos-based tablet replication● Tablets in different zones are replicated with Paxos-based algorithm.

○ A leader in each replication group takes care of row-range write locks during read-write transactions. A leader is re-elected thorough Paxos if necessary.

○ In the case of transactions which involve multiple replication groups, transaction managers from each group cooperate to perform two phase commitment.

42Replication between tablets

(excerpt from the research paper)

So..., what's the problem?● The problem with Paxos-based algorithm is that replications are done asynchronously.

○ When half of the replicas have agreed to write the data, it's considered to be committed. The remaining replication will be done asynchronously.

○ If you enforce the genuine full-replication on each write, performance will be highly degraded. (This is partly the reason for the limited strongly consistent updates on Megastore.)

● Spanner associates timestamps with all writes, and every replica tracks a value called "safe time: t-safe" which is the maximum timestamp at which a replica is up-to-date.

○ A replica can satisfy a read request for a timestamp t if t <= t-safe. If not, another replica is used.

○ t-safe advances at each Paxos write. During a transaction, the advancement is delayed until the transaction finishes.

43

So..., again, what's the problem?● The timestamp-based tracking requires that the clocks on all replicas are synchronized.

○ At least, clocks should be calibrated within a limited amount of uncertainty, and the range of uncertainty is known to the system.

44

● Spanner clusters are equipped with TrueTime API system consisting of multiple time servers using GPS and atomic clocks.

○ TrueTime API provides the time interval in which the current time is guaranteed to be.

Fluctuations of time drifts from time servers(excerpt from the research paper)

Hardware maintenanceof two time servers

Network latency improvement

Thank you!

Google Cloud Datastore Inside-Out

Technology

Transcript of Google Cloud Datastore Inside-Out