Dissecting Scalable Database Architectures

57
Dissecting Scalable Database Architectures Doug Judd CEO, Hypertable Inc.

description

Presentation by Doug Judd, co-founder of Hypertable Inc, at Groupon office in Palo Alto, CA on November 15th, 2012.

Transcript of Dissecting Scalable Database Architectures

Page 1: Dissecting Scalable Database Architectures

Dissecting Scalable Database ArchitecturesDoug JuddCEO, Hypertable Inc.

Page 2: Dissecting Scalable Database Architectures

Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends

Page 3: Dissecting Scalable Database Architectures

Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable

Page 4: Dissecting Scalable Database Architectures

Auto-Sharding

Page 5: Dissecting Scalable Database Architectures

Auto-Sharding

Page 6: Dissecting Scalable Database Architectures

Auto-sharding Systems• Oracle NoSQL Database• MongoDB

Page 7: Dissecting Scalable Database Architectures

Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”

– Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability

Page 8: Dissecting Scalable Database Architectures

Consistent Hashing

Page 9: Dissecting Scalable Database Architectures

Eventual Consistency

Page 10: Dissecting Scalable Database Architectures

Vector Clocks

Page 11: Dissecting Scalable Database Architectures

Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort

Page 12: Dissecting Scalable Database Architectures

Bigtable• “Bigtable: A Distributed Storage System for Structured Data”

- Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication

Page 13: Dissecting Scalable Database Architectures

Google Architecture

Page 14: Dissecting Scalable Database Architectures

Google File System

Page 15: Dissecting Scalable Database Architectures

Google File System

Page 16: Dissecting Scalable Database Architectures

Table: Growth Process

Page 17: Dissecting Scalable Database Architectures

Scaling (part 1)

Page 18: Dissecting Scalable Database Architectures

Scaling (part 2)

Page 19: Dissecting Scalable Database Architectures

Scaling (part 3)

Page 20: Dissecting Scalable Database Architectures

System overview

Page 21: Dissecting Scalable Database Architectures

Database Model

• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key

• Row (string)• Column Family• Column Qualifier (string)• Timestamp

Page 22: Dissecting Scalable Database Architectures

Table: Visual Representation

Page 23: Dissecting Scalable Database Architectures

Table: Actual Representation

Page 24: Dissecting Scalable Database Architectures

Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,

ones-compliment• Simple byte-wise comparison

Page 25: Dissecting Scalable Database Architectures

Log Structured Merge Tree

Page 26: Dissecting Scalable Database Architectures

Range Server: CellStore• Sequence of 65K blocks of

compressed key/value pairs

Page 27: Dissecting Scalable Database Architectures

Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present

Page 28: Dissecting Scalable Database Architectures

Request Routing

Page 29: Dissecting Scalable Database Architectures

Bigtable-based Systems• Accumulo• HBase• Hypertable

Page 30: Dissecting Scalable Database Architectures

Next-generation Architectures

• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)

Page 31: Dissecting Scalable Database Architectures

PNUTS

• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records

• Hashed tables implemented via proprietary disk-based hash• Ordered tables implemented with MySQL+InnoDB

• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!

Page 32: Dissecting Scalable Database Architectures

PNUTS System Architecture

Page 33: Dissecting Scalable Database Architectures

Record-level Mastering

• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record

Page 34: Dissecting Scalable Database Architectures

PNUTS API

• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)

Page 35: Dissecting Scalable Database Architectures

Spanner

• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language

Page 36: Dissecting Scalable Database Architectures

Spanner Server Organization

Page 37: Dissecting Scalable Database Architectures

Spanserver

• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of

mappings: (key:string, timestamp:int64) -> string

• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories

• Set of contiguous keys that share a common prefix• Unit of data placement• Can be moved between Tablets for performance reasons

Page 38: Dissecting Scalable Database Architectures

TrueTime

• Universal Clock• Set of time master servers per-datacenter

• GPL clock via GPS receivers with dedicated antennas• Atomic clock

• Time daemon runs on every machine• TrueTime API:

Page 39: Dissecting Scalable Database Architectures

Spanner Software Stack

Page 40: Dissecting Scalable Database Architectures

Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction

Page 41: Dissecting Scalable Database Architectures

Dremel

• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds

Page 42: Dissecting Scalable Database Architectures

Columnar Storage Format

• Novel format for storing lists of nested records (Protocol Buffers)

• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records

Page 43: Dissecting Scalable Database Architectures

Multi-level Execution Trees

• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)

• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel

aggregation of partial results.

Page 44: Dissecting Scalable Database Architectures

Performance

Page 45: Dissecting Scalable Database Architectures

Example Queries

• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1

• SELECT country, SUM(item.amount) FROM T2GROUP BY country

• SELECT domain, SUM(item.amount) FROM T2WHERE domain CONTAINS ’.net’GROUP BY domain

• SELECT COUNT(DISTINCT a) FROM T5

Page 46: Dissecting Scalable Database Architectures

Future Evolution - Hardware Trends• SSD Drives• Disk Drives• Networking

Page 47: Dissecting Scalable Database Architectures

Flash Memory Rated Lifetime(P/E Cycles)

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Page 48: Dissecting Scalable Database Architectures

Flash Memory Average BER at Rated Lifetime

Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012

Page 49: Dissecting Scalable Database Architectures

Disk: Areal Density Trend

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Page 50: Dissecting Scalable Database Architectures

Disk: Maximum SustainedBandwidth Trend

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Page 51: Dissecting Scalable Database Architectures

Time Required to Sequentially Fill a SATA Drive

Page 52: Dissecting Scalable Database Architectures

Average Seek Time

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Page 53: Dissecting Scalable Database Architectures

Average Rotational Latency

Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011

Page 54: Dissecting Scalable Database Architectures

Time Required to Randomly Read a SATA Drive

Page 55: Dissecting Scalable Database Architectures

Ethernet• 10GbE

• Starting to replace 1GbE for server NICs• De facto network port for new servers in 2014

• 40GbE• Data center core & aggregation• Top-of-rack server aggregation

• 100GbE• Service Provider core and aggregation• Metro and large Campus core• Data center core & aggregation

• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber

• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”

Page 56: Dissecting Scalable Database Architectures

10GbE Adoption Curve (?)

Source: CREHAN RESEARCH Inc. © Copyright 2012

Page 57: Dissecting Scalable Database Architectures

The EndThank you!