Database Architecture & Scaling Strategies, in the Cloud & on the Rack

52
© 2014 CLUSTRIX © 2015 CLUSTRIX Database Scaling Strategies, in the Cloud & on the Rack Robbie Mihalyi @Clustrix

Transcript of Database Architecture & Scaling Strategies, in the Cloud & on the Rack

Page 1: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

© 2014 CLUSTRIX© 2015 CLUSTRIX

Database Scaling Strategies, in the Cloud & on the Rack

Robbie Mihalyi@Clustrix

Page 2: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview2

SQL SCALE-OUT

Resiliency

Capacity

Elasticity Cloud

Page 3: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview3

Cloudo Commoditized hardware resources

Rapid deployment and pay by the hour

o Access Publish your applications quickly Use existing services from provider

o Capacity Scale resources as you need them

Utility Computing (bare metal)Platform as a Service (PaaS) SaaS

o Virtualized (Shared) Resources You do not always get the performance

envelope you ask for

o Dedicated (Hardware) Resources Available but expensive Less flexible

Page 4: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview4

E-Commerce Applications

Example of a Great Match for Cloud

o Need for capacity varies by seasonality and specific events Some events can generate 10x normal traffic & increased conversion rates

o Sensitive to performance characteristics Throughput and latency

o Up-time is most crucial at the busiest time Every minute of downtime can mean thousands of $$$$ in lost revenue

Page 5: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview5

SQL SCALE-OUT

Resiliency

Capacity

Elasticity

Page 6: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview6

SQL SCALE-OUT

Resiliency

Capacity

Elasticity

SCALE Data, Users, Session

THROUGHPUT Concurrency, Transactions

LATENCY Response Time

Page 7: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview7

Application Scaling (App Layer Only)

Easy Installation and Setup

o Load-Balancer HAProxy or equivalent Distributes incoming requests

o Scale out by adding servers All servers are the same – no master

o Redundant backend network Low-latency cluster intercommunication

Load Balancer

Commodity servers

APP APP APP

Page 8: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview8

Application Scaling (Database Layer)Database Scaling Is Very Hard

o Data Consistency

o Read vs. Write Scale

o ACID Properties (if you care about it)

o Throughput and Latency

o Application Impact

Page 9: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview9

Non-Relational (NoSQL) Database Architectures

o No imposed structure

o Relaxed or no ACID properties BASE – alternative to ACID

o Fast and Scalable

o Suited for specific applications IOT, click-stream, object store, document Good for Insert workload Not good for read / query apps

o RDBMS will provide fast non-structured data store

Page 10: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview10

RDBMS SCALING

Page 11: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview11

Scaling-Up

o Keep increasing the size of the (single) database servero Pros

Simple, no application changes needed

o Cons Expensive. At some point, you’re paying 5x for 2x the performance ‘Exotic’ hardware (128 cores and above) become price prohibitive Eventually you ‘hit the wall’, and you literally cannot scale-up anymore

Page 12: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview12

Scaling Reads: Master/Slave

o Add a ‘Slave’ read-server(s) to your ‘Master’ database servero Pros

Reasonably simple to implement. Read/write fan-out can be done at the proxy level

o Cons Only adds Read performance Data consistency issues can occur, especially if the application isn’t coded to

ensure reads from the slave are consistent with reads from the master

Page 13: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview13

Scaling Writes: Master/Master

o Add additional ‘Master’(s) to your ‘Master’ database servero Pros

Adds Write scaling without needing to shard

o Cons Adds write scaling at the cost of read-slaves Adding read-slaves would add even more latency Application changes are required to ensure data consistency / conflict resolution

Page 14: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview14

Scaling Reads & Writes: Sharding

SHARDO1 SHARDO2 SHARDO3 SHARDO4

o Partitioning tables across separate database serverso Pros

Adds both write and read scaling

o Cons Loses the ability of an RDBMS to manage transactionality, referential integrity and ACID ACID compliance & transactionality must be managed at the application level Consistent backups across all the shards are very hard to manage Read and Writes can be skewed / unbalanced Application changes can be significant

A - K L - O P - S T - Z

Page 15: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview15

Scaling Reads & Writes: MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL. (designed for Telco

deployments: minimal cross-node transactions, HA emphasis)o Pros

Distributed, multi-master model Provides high availability and high throughput

o Cons Only supports read-committed isolation Long-running transactions can block a node restart SBR replication not supported Range scans are expensive and lower performance than MySQL Unclear how it scales with many nodes

Page 16: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview16

Application Workload Partitioning

o Partition entire application + RDBMS stack across several “pods”

o Pros Adds both write and read scaling Flexible: can keep scaling with addition of pods

o Cons No data consistency across pods (only suited for cases

where it is not needed) High overhead in DBMS maintenance and upgrade Queries / Reports across all pods can be very complex Complex environment to setup and support

APP APP APP

APP APP APP

Page 17: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview17

SQL SCALE-OUT

Resiliency

Capacity

Elasticity

Page 18: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview18

SQL SCALE-OUT

Resiliency

Capacity

Elasticity

Ease of ADDING and REMOVING resources

Flex Up or Down Capacity On-Demand

Adapt Resources to Price-Performance Requirements

Page 19: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview19

Elasticity – flexing up and down

o Application (only)

o NoSQL databases

o Scale-up

o Master – Slave

o Master – Master

o Sharding

o MySQL Cluster

o Application Partitioning

Scaling Options Flex UP Flex DOWNo Easy o Easy

o Easy o Unclear if it is possible

o Expensive o Not Applicable

o Reasonably simple o Turn off read slaves

o Involved o Involved

o Expensive and complex o Not feasible

o Involved o Involved

o Expensive and complex o Expensive and complex

Page 20: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview20

SQL SCALE-OUT

Resiliency

Resilience to Failures Hardware or Software

Fault Tolerance and High Availability

Capacity

Elasticity

Page 21: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview21

Resiliency – high-availably and fault tolerance

o Application (only)

o NoSQL databases

o Scale-up

o Master – Slave

o Master – Master

o Sharding

o MySQL Cluster

o Application Partitioning

Scaling Optionso No single point failure – failed node bypassed

Resilience to failures

o Support exists

o One large machine Single point failure

o Fail-over to Slave

o Resilient to one of the Masters failing

o Multiple points of failures

o No single point failure

o Multiple points of failures

Page 22: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

RDBMS Capacity, Elasticity and Resiliency

ClustrixDB Overview22

Scale-up

Master – Slave

Master – Master

MySQL Cluster

Sharding

RDBMS Scaling

Many cores – very expensive

Reads Only

Read / Write

Read / Write

Unbalanced Read/Writes

Capacity

Single Point Failure

Fail-over

Yes

Yes

Multiple points of failure

ResiliencyElasticity

No

No

No

No

No

None

Yes – for read scale

High – update conflict

None (or minor)

Very High

Application Impact

Page 23: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview23

CLUSTRIXDB FULL ACID COMPLIANT RDBMS MYSQL COMPATIBLE ARCHITECTED FROM THE GROUND-UP TO ADDRESS:

CAPACITY, ELASTICITY AND RESILIENCY.

Page 24: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview24

ClustrixDB – Shared Nothing Symmetric ArchitectureEach Node Containso Database Engine:

all nodes can perform all database operations (no leader, aggregator, leaf, data-only, special nodes)

o Query Compiler: distribute compiled partial query fragments to the

node containing the ranking replica

o Data: Table Slices: All table slices auto-redistributed by the

Rebalancer (default: replicas=2)

o Data Map: all nodes know where all replicas are

ClustrixDB

Compiler Map

Engine Data

Compiler Map

Engine Data

Compiler Map

Engine Data

Page 25: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview25

Bill

ions

of R

ows

DatabaseTables

S1 S2S2S3

S3S4S4

S5S5

Intelligent Data Distributiono Tables auto-split into slices o Every slice has a replica on another server

Auto-distributed and auto-protected

S1

ClustrixDB

Page 26: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview26

S1

S2

S3

S3

S4

S4

S5

Database Capacity And Elasticity

o Easy and simple Flex Up (and Flex Down) Flex multiple nodes at the same time

o Data is automatically rebalancedacross the cluster

o All servers handle writes + reads

o Application always sees a singleDatabase instance

S1

ClustrixDB

S2

S5

Page 27: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview27

S1

S2

S3

S3

S4

S4

S5

Built-in Fault Tolerance

o No Single Point-of-Failure No Data Loss No Downtime

o Server node goes down… Data is automatically rebalanced across

the remaining nodes

S1

ClustrixDB

S2

S5

Page 28: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview28

Query

Distributed Query Processingo Queries are fielded by any peer node

Routed to node holding the data

o Complex queries are split into fragments processed in parallel Automatically distributed for optimized performance

ClustrixDBLoad

Balancer

TRXTRXTRX

Page 29: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview29

Replication and Disaster Recovery

Asynchronous multi-point Replication

ClustrixDBParallel Backup up to 10x faster

Replicate to any cloud, any datacenter, anywhere

Page 30: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview30

CLUSTRIXDB

UNDER THE HOOD

o DISTRIBUTION STRATEGYo REBALANCER TASKSo QUERY OPTIMIZERo EVALUATION MODELo CONCURRENCY CONTROL

Page 31: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview31

ClustrixDB key components enabling Scale-Out

o Shared-nothing architecture Eliminates potential bottlenecks.

o Independent Index Distribution Hash each distribution key to a 64-bit number space divided into ranges with a specific slice owning

each rangeo Rebalancer

Ensures optimal data distribution across all nodes. Rebalancer assigns slices to available nodes for data capacity and access balance

o Query Optimizer Distributed query planner, compiler, and distributed shared-nothing execution engine Executes queries with max parallelism and many simultaneous queries concurrently.

o Evaluation Model Parallelizes queries, which are distributed to the node(s) with the relevant data.

o Consistency and Concurrency Control Using Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL)

Page 32: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview32

Rebalancer Process

o User tables are vertically partitioned in representations.

o Representations are horizontally partitioned into slices.

o Rebalancer ensures: The representation has an appropriate number of slices. Slices are well distributed around the cluster on storage devices Slices are not placed on server(s) that are being flexed-down. Reads from each representation are balanced across the nodes

Page 33: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview33

ClustrixDB Rebalancer Tasks

o Flex-UP Re-distribute replicas to new nodes

o Flex-DOWN Move replicas from the flex-down nodes to other nodes in the cluster

o Under-Protection – when a slice has fewer replicas than desired Create a new copy of the slice on a different node.

o Slice Too Big Split the slice into several new slices and re-distribute them

Page 34: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview34

ClustrixDB Query Optimizero The ClustrixDB Query Optimizer is modeled on the Cascades optimization framework.

Other RDBMS leverage Cascades are Tandem's Nonstop SQL and Microsoft's SQL Server. Cost-driven - Extensible via a rule based mechanism Top-down approach

o Query Optimizer must answer the following, per SQL query: In what order should the tables be joined? Which indexes should be used? Should the sort/aggregate be non-blocking?

Page 35: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview35

ClustrixDB Evaluation Model

o Parallel query evaluation

o Massively Parallel Processing (MPP) for analytic queries

o The Fair Scheduler ensures OLTP prioritized ahead of OLAP

o Queries are broken into fragments (functions).

o Joins require more data movement by their nature. ClustrixDB is able to achieve minimal data movement Each representation (table or index) has its own distribution map,

allowing direct look-ups for which node/slice to go to next, removing broadcasts.

There is no a central node orchestrating data motion. Data moves directly to the next node it needs to go to. This reduces hops to the minimum possible given the data distribution.

COMPILATIONFRAGMENTS

FRAGMENT1

FRAGMENT2

VM

FRAGMENT 1Node := lookup id = 15

<forward to node>

VM

FRAGMENT 2SELECT id, amount

<return>

SELECT id, amountFROM donationWHERE id=15

Page 36: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview36

Concurrency Control

Time

readerreader

writer

writerwriter row conflict one

writer blocked

no conflictno blocking

o Readers never interfere with writers (or vice-versa). Writers use explicit locking for updates

o MVCC maintains a version of each row as writers modify rows

o Readers have lock-free snapshot isolation while writers use 2PL to manage conflict

Lock Conflict Matrix

Reader WriterReader None NoneWriter None Row

Page 37: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview37

CLUSTRIXDB

DEPLOYMENT EXAMPLES

Page 38: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview38

Example: Huge Write Workload (AWS Deployment)

The ApplicationInserts 254 million / day

Updates 1.35 million / day

Reads 252.3 million / day

Deletes 7,800 / day

The DatabaseQueries 5-9k per sec

CPU Load 45-65%

Nodes - Cores 10 nodes - 80 cores

Page 39: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview39

Example: Huge Update Workload (Bare-Metal Deployment)

The ApplicationInserts 31.4 million / day

Updates 3.7 billion / day

Reads 1 billion / day

Deletes 4,300 / day

The DatabaseQueries 35-55k per sec

CPU Load 25-35%

Nodes - Cores 6 nodes - 120 cores

Page 40: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview40

CLUSTRIXDB

IN DEVELOPMENT

Page 41: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview41

Next Releaseo Additional Performance Improvements

Further improvements to read and write scaling

o Deployment and Provisioning Optimization Cloud templates and deployment scripts Instance testing and validation

o New Admin architecture and much improved Web UI Services based architecture with (RESTful) API Simplified single-click FLEX Management Significant Graphing and Reporting improvements Multi-Cluster topology view and management

Page 42: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview42

New Web UI – Enhanced Dashboard

482 tps

Page 43: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview43

New Web UI – Historical Workload Comparison

Page 44: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview44

New Web UI – FLEX Administration

Page 45: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview45

FINAL THOUGHTS

Page 46: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview46

Capacity

Massiveread write scalability

Very highconcurrency

Linear throughput scale

Elasticity

Flex UP in minutes

Flex DOWN easily

Right-size resources on-demand

Resiliency

Automatic, 100%fault tolerance

No singlepoint of failure

Battle-testedperformance

Flexible Deployment

Cloud, VM, or bare-metal

Virtual Images available

Point/click Scale-out

ClustrixDB

Page 47: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview47

Thank You.

facebook.com/clustrix

www.clustrix.com

@clustrix

linkedin.com/clustrix

Page 48: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview48

Competitive Cluster Solutionso Most MySQL clustering solutions leverage Master/Master via

replication: MySQL Cluster Galera (open-source library) Percona XtraDB Cluster (leverages Galera replication library) Tungsten

o ClustrixDB does NOT use replication to keep all the servers in sync

Replication cannot scale writes as highly as our own technology Replication has inherent potential consistency and latency issues Transactional workloads such as OLTP (e.g. E-Commerce) are

exactly the workloads that replication struggles the most with

Page 49: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview49

MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL (designed

for Telco deployments: minimal cross-node transactions, HA emphasis)o Pros:

Distributed, multi-master with no SPOF Designed to provide high availability and high throughput with low latency, while

allowing for near linear scalability Synchronous replication, 2-Phase Commit

o Cons: Global checkpoint is 2sec. “There are no guaranteed durable COMMITs to disk” Only supports read_committed isolation “MySQL cluster does not handle large transactions well” Long-running transactions can block a node restart Overflow of data in replication stream drops node from cluster, consistency loss ‘True’ HA requires multiple replication lines; “1 is not sufficient” for HA DELETEs release memory for same-table; full release requires cluster rolling restart Range scans are expensive and low(er) performance than MySQL No distributed table locks

Page 50: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview50

Galera Clustero Is a multi-master topology using their own replication protocol (designed

primarily for High-Availability, and secondarily for scale) o Pros:

Writes to any master are replicated to the other master(s) in sync, ensuring all masters have the same data.

It is open source, and 24/7 Support can be purchased for $7,950/yr/server. Percona also provides support, for a higher price.

o Cons: Write-scale is limited. Galera support recommends that writes go to one master,

rather than be distributed across the nodes. That helps with isolation issues, but increases consistency and latency issues across the nodes.

Snapshot isolation does NOT use first-committer-wins (and so fails Aphyr Jepsen CAP tests). ClustrixDB does use first-committer wins for snapshot consistency

Writesets are processed as a single memory-resident buffer and as a result, extremely large transactions (e.g. LOAD DATA) may adversely affect node performance.

Locking is lax with DDL. Eg, if your DML transaction uses a table, and a parallel DDL statement is started, Galera won’t wait for a metadata lock, causing potential consistency issues

Page 51: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview51

Percona XtraDB Clustero Is an active/active high availability and high scalability open source solution for

MySQL® clustering. It integrates Percona Server and Percona XtraBackup with the Galera replication library

o Pros: Synchronous replication Multi-master replication support Parallel replication Automatic node provisioning

o Cons: Not designed for write scaling SELECT FOR UPDATE can easily create deadlocks Not true synchronous replication, but ‘virtually synchronous’: The data is committed on the

originating node and ack is sent to the application, but the other nodes are committed asynchronously. This can lead to consistency issues for applications reading from the other nodes

“If multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.”

Page 52: Database Architecture & Scaling Strategies, in the Cloud & on the Rack

ClustrixDB Overview52

Tungsten Replicatoro Is an open source replication engine. Compatible with MySQL, Oracle, and

Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, InfiniDB, and Hadoop

o Pros: Allows data to be exchanged between different databases and different

database versions During replication, information can be filtered and modified, and deployment can

be between on-premise or cloud-based databases For performance, Tungsten Replicator includes support for parallel replication,

and advanced topologies such as fan-in, star and multi-master, and can be used efficiently in cross-site deployments

o Cons: Very complicated to setup, maintain No automated management, automated failover, transparent connections, nor

built-in conflict resolution Only allows asynchronous replication Cannot suppress slave-side triggers. Need to alter each trigger to add an IF

statement that prevents the trigger from running on the slave.