Database Architecture & Scaling Strategies, in the Cloud & on the Rack

© 2014 CLUSTRIX© 2015 CLUSTRIX

Database Scaling Strategies, in the Cloud & on the Rack

Robbie Mihalyi@Clustrix

ClustrixDB Overview2

SQL SCALE-OUT

Resiliency

Capacity

Elasticity Cloud


Cloudo Commoditized hardware resources

Rapid deployment and pay by the hour

o Access Publish your applications quickly Use existing services from provider

o Capacity Scale resources as you need them

Utility Computing (bare metal)Platform as a Service (PaaS) SaaS

o Virtualized (Shared) Resources You do not always get the performance

envelope you ask for

o Dedicated (Hardware) Resources Available but expensive Less flexible


E-Commerce Applications

Example of a Great Match for Cloud

o Need for capacity varies by seasonality and specific events Some events can generate 10x normal traffic & increased conversion rates

o Sensitive to performance characteristics Throughput and latency

o Up-time is most crucial at the busiest time Every minute of downtime can mean thousands of $$$$ in lost revenue


SQL SCALE-OUT

Resiliency

Capacity

Elasticity


SQL SCALE-OUT

Resiliency

Capacity

Elasticity

SCALE Data, Users, Session

THROUGHPUT Concurrency, Transactions

LATENCY Response Time


Application Scaling (App Layer Only)

Easy Installation and Setup

o Load-Balancer HAProxy or equivalent Distributes incoming requests

o Scale out by adding servers All servers are the same – no master

o Redundant backend network Low-latency cluster intercommunication

Load Balancer

Commodity servers

APP APP APP


Application Scaling (Database Layer)Database Scaling Is Very Hard

o Data Consistency

o Read vs. Write Scale

o ACID Properties (if you care about it)

o Throughput and Latency

o Application Impact


Non-Relational (NoSQL) Database Architectures

o No imposed structure

o Relaxed or no ACID properties BASE – alternative to ACID

o Fast and Scalable

o Suited for specific applications IOT, click-stream, object store, document Good for Insert workload Not good for read / query apps

o RDBMS will provide fast non-structured data store


RDBMS SCALING


Scaling-Up

o Keep increasing the size of the (single) database servero Pros

Simple, no application changes needed

o Cons Expensive. At some point, you’re paying 5x for 2x the performance ‘Exotic’ hardware (128 cores and above) become price prohibitive Eventually you ‘hit the wall’, and you literally cannot scale-up anymore


Scaling Reads: Master/Slave

o Add a ‘Slave’ read-server(s) to your ‘Master’ database servero Pros

Reasonably simple to implement. Read/write fan-out can be done at the proxy level

o Cons Only adds Read performance Data consistency issues can occur, especially if the application isn’t coded to

ensure reads from the slave are consistent with reads from the master


Scaling Writes: Master/Master

o Add additional ‘Master’(s) to your ‘Master’ database servero Pros

Adds Write scaling without needing to shard

o Cons Adds write scaling at the cost of read-slaves Adding read-slaves would add even more latency Application changes are required to ensure data consistency / conflict resolution


Scaling Reads & Writes: Sharding

SHARDO1 SHARDO2 SHARDO3 SHARDO4

o Partitioning tables across separate database serverso Pros

Adds both write and read scaling

o Cons Loses the ability of an RDBMS to manage transactionality, referential integrity and ACID ACID compliance & transactionality must be managed at the application level Consistent backups across all the shards are very hard to manage Read and Writes can be skewed / unbalanced Application changes can be significant

A - K L - O P - S T - Z


Scaling Reads & Writes: MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL. (designed for Telco

deployments: minimal cross-node transactions, HA emphasis)o Pros

Distributed, multi-master model Provides high availability and high throughput

o Cons Only supports read-committed isolation Long-running transactions can block a node restart SBR replication not supported Range scans are expensive and lower performance than MySQL Unclear how it scales with many nodes


Application Workload Partitioning

o Partition entire application + RDBMS stack across several “pods”

o Pros Adds both write and read scaling Flexible: can keep scaling with addition of pods

o Cons No data consistency across pods (only suited for cases

where it is not needed) High overhead in DBMS maintenance and upgrade Queries / Reports across all pods can be very complex Complex environment to setup and support

APP APP APP

APP APP APP


SQL SCALE-OUT

Resiliency

Capacity

Elasticity


SQL SCALE-OUT

Resiliency

Capacity

Elasticity

Ease of ADDING and REMOVING resources

Flex Up or Down Capacity On-Demand

Adapt Resources to Price-Performance Requirements


Elasticity – flexing up and down

o Application (only)

o NoSQL databases

o Scale-up

o Master – Slave

o Master – Master

o Sharding

o MySQL Cluster

o Application Partitioning

Scaling Options Flex UP Flex DOWNo Easy o Easy

o Easy o Unclear if it is possible

o Expensive o Not Applicable

o Reasonably simple o Turn off read slaves

o Involved o Involved

o Expensive and complex o Not feasible

o Involved o Involved

o Expensive and complex o Expensive and complex


SQL SCALE-OUT

Resiliency

Resilience to Failures Hardware or Software

Fault Tolerance and High Availability

Capacity

Elasticity


Resiliency – high-availably and fault tolerance

o Application (only)

o NoSQL databases

o Scale-up

o Master – Slave

o Master – Master

o Sharding

o MySQL Cluster

o Application Partitioning

Scaling Optionso No single point failure – failed node bypassed

Resilience to failures

o Support exists

o One large machine Single point failure

o Fail-over to Slave

o Resilient to one of the Masters failing

o Multiple points of failures

o No single point failure

o Multiple points of failures

RDBMS Capacity, Elasticity and Resiliency


Scale-up

Master – Slave

Master – Master

MySQL Cluster

Sharding

RDBMS Scaling

Many cores – very expensive

Reads Only

Read / Write

Read / Write

Unbalanced Read/Writes

Capacity

Single Point Failure

Fail-over

Yes

Yes

Multiple points of failure

ResiliencyElasticity

No

No

No

No

No

None

Yes – for read scale

High – update conflict

None (or minor)

Very High

Application Impact


CLUSTRIXDB FULL ACID COMPLIANT RDBMS MYSQL COMPATIBLE ARCHITECTED FROM THE GROUND-UP TO ADDRESS:

CAPACITY, ELASTICITY AND RESILIENCY.


ClustrixDB – Shared Nothing Symmetric ArchitectureEach Node Containso Database Engine:

all nodes can perform all database operations (no leader, aggregator, leaf, data-only, special nodes)

o Query Compiler: distribute compiled partial query fragments to the

node containing the ranking replica

o Data: Table Slices: All table slices auto-redistributed by the

Rebalancer (default: replicas=2)

o Data Map: all nodes know where all replicas are

ClustrixDB

Compiler Map

Engine Data

Compiler Map

Engine Data

Compiler Map

Engine Data


Bill

ions

of R

ows

DatabaseTables

S1 S2S2S3

S3S4S4

S5S5

Intelligent Data Distributiono Tables auto-split into slices o Every slice has a replica on another server

Auto-distributed and auto-protected

S1

ClustrixDB


S1

S2

S3

S3

S4

S4

S5

Database Capacity And Elasticity

o Easy and simple Flex Up (and Flex Down) Flex multiple nodes at the same time

o Data is automatically rebalancedacross the cluster

o All servers handle writes + reads

o Application always sees a singleDatabase instance

S1

ClustrixDB

S2

S5


S1

S2

S3

S3

S4

S4

S5

Built-in Fault Tolerance

o No Single Point-of-Failure No Data Loss No Downtime

o Server node goes down… Data is automatically rebalanced across

the remaining nodes

S1

ClustrixDB

S2

S5


Query

Distributed Query Processingo Queries are fielded by any peer node

Routed to node holding the data

o Complex queries are split into fragments processed in parallel Automatically distributed for optimized performance

ClustrixDBLoad

Balancer

TRXTRXTRX


Replication and Disaster Recovery

Asynchronous multi-point Replication

ClustrixDBParallel Backup up to 10x faster

Replicate to any cloud, any datacenter, anywhere


CLUSTRIXDB

UNDER THE HOOD

o DISTRIBUTION STRATEGYo REBALANCER TASKSo QUERY OPTIMIZERo EVALUATION MODELo CONCURRENCY CONTROL


ClustrixDB key components enabling Scale-Out

o Shared-nothing architecture Eliminates potential bottlenecks.

o Independent Index Distribution Hash each distribution key to a 64-bit number space divided into ranges with a specific slice owning

each rangeo Rebalancer

Ensures optimal data distribution across all nodes. Rebalancer assigns slices to available nodes for data capacity and access balance

o Query Optimizer Distributed query planner, compiler, and distributed shared-nothing execution engine Executes queries with max parallelism and many simultaneous queries concurrently.

o Evaluation Model Parallelizes queries, which are distributed to the node(s) with the relevant data.

o Consistency and Concurrency Control Using Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL)


Rebalancer Process

o User tables are vertically partitioned in representations.

o Representations are horizontally partitioned into slices.

o Rebalancer ensures: The representation has an appropriate number of slices. Slices are well distributed around the cluster on storage devices Slices are not placed on server(s) that are being flexed-down. Reads from each representation are balanced across the nodes

http://docs.clustrix.com/display/CLXDOC/Rebalancer

http://docs.clustrix.com/display/CLXDOC/Rebalancer


ClustrixDB Rebalancer Tasks

o Flex-UP Re-distribute replicas to new nodes

o Flex-DOWN Move replicas from the flex-down nodes to other nodes in the cluster

o Under-Protection – when a slice has fewer replicas than desired Create a new copy of the slice on a different node.

o Slice Too Big Split the slice into several new slices and re-distribute them


ClustrixDB Query Optimizero The ClustrixDB Query Optimizer is modeled on the Cascades optimization framework.

Other RDBMS leverage Cascades are Tandem's Nonstop SQL and Microsoft's SQL Server. Cost-driven - Extensible via a rule based mechanism Top-down approach

o Query Optimizer must answer the following, per SQL query: In what order should the tables be joined? Which indexes should be used? Should the sort/aggregate be non-blocking?


ClustrixDB Evaluation Model

o Parallel query evaluation

o Massively Parallel Processing (MPP) for analytic queries

o The Fair Scheduler ensures OLTP prioritized ahead of OLAP

o Queries are broken into fragments (functions).

o Joins require more data movement by their nature. ClustrixDB is able to achieve minimal data movement Each representation (table or index) has its own distribution map,

allowing direct look-ups for which node/slice to go to next, removing broadcasts.

There is no a central node orchestrating data motion. Data moves directly to the next node it needs to go to. This reduces hops to the minimum possible given the data distribution.

COMPILATIONFRAGMENTS

FRAGMENT1

FRAGMENT2

VM

FRAGMENT 1Node := lookup id = 15

<forward to node>

VM

FRAGMENT 2SELECT id, amount

<return>

SELECT id, amountFROM donationWHERE id=15


Concurrency Control

Time

readerreader

writer

writerwriter row conflict one

writer blocked

no conflictno blocking

o Readers never interfere with writers (or vice-versa). Writers use explicit locking for updates

o MVCC maintains a version of each row as writers modify rows

o Readers have lock-free snapshot isolation while writers use 2PL to manage conflict

Lock Conflict Matrix

Reader WriterReader None NoneWriter None Row


CLUSTRIXDB

DEPLOYMENT EXAMPLES


Example: Huge Write Workload (AWS Deployment)

The ApplicationInserts 254 million / day

Updates 1.35 million / day

Reads 252.3 million / day

Deletes 7,800 / day

The DatabaseQueries 5-9k per sec

CPU Load 45-65%

Nodes - Cores 10 nodes - 80 cores


Example: Huge Update Workload (Bare-Metal Deployment)

The ApplicationInserts 31.4 million / day

Updates 3.7 billion / day

Reads 1 billion / day

Deletes 4,300 / day

The DatabaseQueries 35-55k per sec

CPU Load 25-35%

Nodes - Cores 6 nodes - 120 cores


CLUSTRIXDB

IN DEVELOPMENT


Next Releaseo Additional Performance Improvements

Further improvements to read and write scaling

o Deployment and Provisioning Optimization Cloud templates and deployment scripts Instance testing and validation

o New Admin architecture and much improved Web UI Services based architecture with (RESTful) API Simplified single-click FLEX Management Significant Graphing and Reporting improvements Multi-Cluster topology view and management


New Web UI – Enhanced Dashboard

482 tps


New Web UI – Historical Workload Comparison


New Web UI – FLEX Administration


FINAL THOUGHTS


Capacity

Massiveread write scalability

Very highconcurrency

Linear throughput scale

Elasticity

Flex UP in minutes

Flex DOWN easily

Right-size resources on-demand

Resiliency

Automatic, 100%fault tolerance

No singlepoint of failure

Battle-testedperformance

Flexible Deployment

Cloud, VM, or bare-metal

Virtual Images available

Point/click Scale-out

ClustrixDB


Thank You.

facebook.com/clustrix

www.clustrix.com

@clustrix

linkedin.com/clustrix


Competitive Cluster Solutionso Most MySQL clustering solutions leverage Master/Master via

replication: MySQL Cluster Galera (open-source library) Percona XtraDB Cluster (leverages Galera replication library) Tungsten

o ClustrixDB does NOT use replication to keep all the servers in sync

Replication cannot scale writes as highly as our own technology Replication has inherent potential consistency and latency issues Transactional workloads such as OLTP (e.g. E-Commerce) are

exactly the workloads that replication struggles the most with


MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL (designed

for Telco deployments: minimal cross-node transactions, HA emphasis)o Pros:

Distributed, multi-master with no SPOF Designed to provide high availability and high throughput with low latency, while

allowing for near linear scalability Synchronous replication, 2-Phase Commit

o Cons: Global checkpoint is 2sec. “There are no guaranteed durable COMMITs to disk” Only supports read_committed isolation “MySQL cluster does not handle large transactions well” Long-running transactions can block a node restart Overflow of data in replication stream drops node from cluster, consistency loss ‘True’ HA requires multiple replication lines; “1 is not sufficient” for HA DELETEs release memory for same-table; full release requires cluster rolling restart Range scans are expensive and low(er) performance than MySQL No distributed table locks


Galera Clustero Is a multi-master topology using their own replication protocol (designed

primarily for High-Availability, and secondarily for scale) o Pros:

Writes to any master are replicated to the other master(s) in sync, ensuring all masters have the same data.

It is open source, and 24/7 Support can be purchased for $7,950/yr/server. Percona also provides support, for a higher price.

o Cons: Write-scale is limited. Galera support recommends that writes go to one master,

rather than be distributed across the nodes. That helps with isolation issues, but increases consistency and latency issues across the nodes.

Snapshot isolation does NOT use first-committer-wins (and so fails Aphyr Jepsen CAP tests). ClustrixDB does use first-committer wins for snapshot consistency

Writesets are processed as a single memory-resident buffer and as a result, extremely large transactions (e.g. LOAD DATA) may adversely affect node performance.

Locking is lax with DDL. Eg, if your DML transaction uses a table, and a parallel DDL statement is started, Galera won’t wait for a metadata lock, causing potential consistency issues


Percona XtraDB Clustero Is an active/active high availability and high scalability open source solution for

MySQL® clustering. It integrates Percona Server and Percona XtraBackup with the Galera replication library

o Pros: Synchronous replication Multi-master replication support Parallel replication Automatic node provisioning

o Cons: Not designed for write scaling SELECT FOR UPDATE can easily create deadlocks Not true synchronous replication, but ‘virtually synchronous’: The data is committed on the

originating node and ack is sent to the application, but the other nodes are committed asynchronously. This can lead to consistency issues for applications reading from the other nodes

“If multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.”


Tungsten Replicatoro Is an open source replication engine. Compatible with MySQL, Oracle, and

Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, InfiniDB, and Hadoop

o Pros: Allows data to be exchanged between different databases and different

database versions During replication, information can be filtered and modified, and deployment can

be between on-premise or cloud-based databases For performance, Tungsten Replicator includes support for parallel replication,

and advanced topologies such as fan-in, star and multi-master, and can be used efficiently in cross-site deployments

o Cons: Very complicated to setup, maintain No automated management, automated failover, transparent connections, nor

built-in conflict resolution Only allows asynchronous replication Cannot suppress slave-side triggers. Need to alter each trigger to add an IF

statement that prevents the trigger from running on the slave.

Database Architecture & Scaling Strategies, in the Cloud & on the Rack

Technology

Transcript of Database Architecture & Scaling Strategies, in the Cloud & on the Rack