Dissecting Scalable Database Architectures

Dissecting Scalable Database ArchitecturesDoug JuddCEO, Hypertable Inc.

Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends

Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable

Auto-Sharding

Auto-sharding Systems• Oracle NoSQL Database• MongoDB

Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”

– Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability

Consistent Hashing

Eventual Consistency

Vector Clocks

Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort

Bigtable• “Bigtable: A Distributed Storage System for Structured Data”

- Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication

Google Architecture

Google File System

Table: Growth Process

Scaling (part 1)

Scaling (part 2)

Scaling (part 3)

System overview

Database Model

• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key

• Row (string)• Column Family• Column Qualifier (string)• Timestamp

Table: Visual Representation

Table: Actual Representation

Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,

ones-compliment• Simple byte-wise comparison

Log Structured Merge Tree

Range Server: CellStore• Sequence of 65K blocks of

compressed key/value pairs

Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present

Request Routing

Bigtable-based Systems• Accumulo• HBase• Hypertable

Next-generation Architectures

• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)

PNUTS

• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records

• Hashed tables implemented via proprietary disk-based hash• Ordered tables implemented with MySQL+InnoDB

• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!

PNUTS System Architecture

Record-level Mastering

• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record

PNUTS API

• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)

Spanner

• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language

Spanner Server Organization

Spanserver

• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of

mappings: (key:string, timestamp:int64) -> string

• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories

• Set of contiguous keys that share a common prefix• Unit of data placement• Can be moved between Tablets for performance reasons

TrueTime

• Universal Clock• Set of time master servers per-datacenter

• GPL clock via GPS receivers with dedicated antennas• Atomic clock

• Time daemon runs on every machine• TrueTime API:

Spanner Software Stack

Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction

Dremel

• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds

Columnar Storage Format

• Novel format for storing lists of nested records (Protocol Buffers)

• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records

Multi-level Execution Trees

• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)

• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel

aggregation of partial results.

Performance

Example Queries

• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5

Future Evolution - Hardware Trends• SSD Drives• Disk Drives• Networking

Flash Memory Rated Lifetime(P/E Cycles)

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Flash Memory Average BER at Rated Lifetime

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Disk: Maximum SustainedBandwidth Trend


Time Required to Sequentially Fill a SATA Drive

Average Seek Time


Average Rotational Latency


Time Required to Randomly Read a SATA Drive

Ethernet• 10GbE

• Starting to replace 1GbE for server NICs• De facto network port for new servers in 2014

• 40GbE• Data center core & aggregation• Top-of-rack server aggregation

• 100GbE• Service Provider core and aggregation• Metro and large Campus core• Data center core & aggregation

• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber

• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”

The EndThank you!

Dissecting Scalable Database Architectures

Technology

Transcript of Dissecting Scalable Database Architectures