Database Architecture & Scaling Strategies, in the Cloud & on the Rack
-
Upload
clustrix -
Category
Technology
-
view
847 -
download
0
Transcript of Database Architecture & Scaling Strategies, in the Cloud & on the Rack
© 2014 CLUSTRIX© 2015 CLUSTRIX
Database Scaling Strategies, in the Cloud & on the Rack
Robbie Mihalyi@Clustrix
ClustrixDB Overview2
SQL SCALE-OUT
Resiliency
Capacity
Elasticity Cloud
ClustrixDB Overview3
Cloudo Commoditized hardware resources
Rapid deployment and pay by the hour
o Access Publish your applications quickly Use existing services from provider
o Capacity Scale resources as you need them
Utility Computing (bare metal)Platform as a Service (PaaS) SaaS
o Virtualized (Shared) Resources You do not always get the performance
envelope you ask for
o Dedicated (Hardware) Resources Available but expensive Less flexible
ClustrixDB Overview4
E-Commerce Applications
Example of a Great Match for Cloud
o Need for capacity varies by seasonality and specific events Some events can generate 10x normal traffic & increased conversion rates
o Sensitive to performance characteristics Throughput and latency
o Up-time is most crucial at the busiest time Every minute of downtime can mean thousands of $$$$ in lost revenue
ClustrixDB Overview5
SQL SCALE-OUT
Resiliency
Capacity
Elasticity
ClustrixDB Overview6
SQL SCALE-OUT
Resiliency
Capacity
Elasticity
SCALE Data, Users, Session
THROUGHPUT Concurrency, Transactions
LATENCY Response Time
ClustrixDB Overview7
Application Scaling (App Layer Only)
Easy Installation and Setup
o Load-Balancer HAProxy or equivalent Distributes incoming requests
o Scale out by adding servers All servers are the same – no master
o Redundant backend network Low-latency cluster intercommunication
Load Balancer
Commodity servers
APP APP APP
ClustrixDB Overview8
Application Scaling (Database Layer)Database Scaling Is Very Hard
o Data Consistency
o Read vs. Write Scale
o ACID Properties (if you care about it)
o Throughput and Latency
o Application Impact
ClustrixDB Overview9
Non-Relational (NoSQL) Database Architectures
o No imposed structure
o Relaxed or no ACID properties BASE – alternative to ACID
o Fast and Scalable
o Suited for specific applications IOT, click-stream, object store, document Good for Insert workload Not good for read / query apps
o RDBMS will provide fast non-structured data store
ClustrixDB Overview10
RDBMS SCALING
ClustrixDB Overview11
Scaling-Up
o Keep increasing the size of the (single) database servero Pros
Simple, no application changes needed
o Cons Expensive. At some point, you’re paying 5x for 2x the performance ‘Exotic’ hardware (128 cores and above) become price prohibitive Eventually you ‘hit the wall’, and you literally cannot scale-up anymore
ClustrixDB Overview12
Scaling Reads: Master/Slave
o Add a ‘Slave’ read-server(s) to your ‘Master’ database servero Pros
Reasonably simple to implement. Read/write fan-out can be done at the proxy level
o Cons Only adds Read performance Data consistency issues can occur, especially if the application isn’t coded to
ensure reads from the slave are consistent with reads from the master
ClustrixDB Overview13
Scaling Writes: Master/Master
o Add additional ‘Master’(s) to your ‘Master’ database servero Pros
Adds Write scaling without needing to shard
o Cons Adds write scaling at the cost of read-slaves Adding read-slaves would add even more latency Application changes are required to ensure data consistency / conflict resolution
ClustrixDB Overview14
Scaling Reads & Writes: Sharding
SHARDO1 SHARDO2 SHARDO3 SHARDO4
o Partitioning tables across separate database serverso Pros
Adds both write and read scaling
o Cons Loses the ability of an RDBMS to manage transactionality, referential integrity and ACID ACID compliance & transactionality must be managed at the application level Consistent backups across all the shards are very hard to manage Read and Writes can be skewed / unbalanced Application changes can be significant
A - K L - O P - S T - Z
ClustrixDB Overview15
Scaling Reads & Writes: MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL. (designed for Telco
deployments: minimal cross-node transactions, HA emphasis)o Pros
Distributed, multi-master model Provides high availability and high throughput
o Cons Only supports read-committed isolation Long-running transactions can block a node restart SBR replication not supported Range scans are expensive and lower performance than MySQL Unclear how it scales with many nodes
ClustrixDB Overview16
Application Workload Partitioning
o Partition entire application + RDBMS stack across several “pods”
o Pros Adds both write and read scaling Flexible: can keep scaling with addition of pods
o Cons No data consistency across pods (only suited for cases
where it is not needed) High overhead in DBMS maintenance and upgrade Queries / Reports across all pods can be very complex Complex environment to setup and support
APP APP APP
APP APP APP
ClustrixDB Overview17
SQL SCALE-OUT
Resiliency
Capacity
Elasticity
ClustrixDB Overview18
SQL SCALE-OUT
Resiliency
Capacity
Elasticity
Ease of ADDING and REMOVING resources
Flex Up or Down Capacity On-Demand
Adapt Resources to Price-Performance Requirements
ClustrixDB Overview19
Elasticity – flexing up and down
o Application (only)
o NoSQL databases
o Scale-up
o Master – Slave
o Master – Master
o Sharding
o MySQL Cluster
o Application Partitioning
Scaling Options Flex UP Flex DOWNo Easy o Easy
o Easy o Unclear if it is possible
o Expensive o Not Applicable
o Reasonably simple o Turn off read slaves
o Involved o Involved
o Expensive and complex o Not feasible
o Involved o Involved
o Expensive and complex o Expensive and complex
ClustrixDB Overview20
SQL SCALE-OUT
Resiliency
Resilience to Failures Hardware or Software
Fault Tolerance and High Availability
Capacity
Elasticity
ClustrixDB Overview21
Resiliency – high-availably and fault tolerance
o Application (only)
o NoSQL databases
o Scale-up
o Master – Slave
o Master – Master
o Sharding
o MySQL Cluster
o Application Partitioning
Scaling Optionso No single point failure – failed node bypassed
Resilience to failures
o Support exists
o One large machine Single point failure
o Fail-over to Slave
o Resilient to one of the Masters failing
o Multiple points of failures
o No single point failure
o Multiple points of failures
RDBMS Capacity, Elasticity and Resiliency
ClustrixDB Overview22
Scale-up
Master – Slave
Master – Master
MySQL Cluster
Sharding
RDBMS Scaling
Many cores – very expensive
Reads Only
Read / Write
Read / Write
Unbalanced Read/Writes
Capacity
Single Point Failure
Fail-over
Yes
Yes
Multiple points of failure
ResiliencyElasticity
No
No
No
No
No
None
Yes – for read scale
High – update conflict
None (or minor)
Very High
Application Impact
ClustrixDB Overview23
CLUSTRIXDB FULL ACID COMPLIANT RDBMS MYSQL COMPATIBLE ARCHITECTED FROM THE GROUND-UP TO ADDRESS:
CAPACITY, ELASTICITY AND RESILIENCY.
ClustrixDB Overview24
ClustrixDB – Shared Nothing Symmetric ArchitectureEach Node Containso Database Engine:
all nodes can perform all database operations (no leader, aggregator, leaf, data-only, special nodes)
o Query Compiler: distribute compiled partial query fragments to the
node containing the ranking replica
o Data: Table Slices: All table slices auto-redistributed by the
Rebalancer (default: replicas=2)
o Data Map: all nodes know where all replicas are
ClustrixDB
Compiler Map
Engine Data
Compiler Map
Engine Data
Compiler Map
Engine Data
ClustrixDB Overview25
Bill
ions
of R
ows
DatabaseTables
S1 S2S2S3
S3S4S4
S5S5
Intelligent Data Distributiono Tables auto-split into slices o Every slice has a replica on another server
Auto-distributed and auto-protected
S1
ClustrixDB
ClustrixDB Overview26
S1
S2
S3
S3
S4
S4
S5
Database Capacity And Elasticity
o Easy and simple Flex Up (and Flex Down) Flex multiple nodes at the same time
o Data is automatically rebalancedacross the cluster
o All servers handle writes + reads
o Application always sees a singleDatabase instance
S1
ClustrixDB
S2
S5
ClustrixDB Overview27
S1
S2
S3
S3
S4
S4
S5
Built-in Fault Tolerance
o No Single Point-of-Failure No Data Loss No Downtime
o Server node goes down… Data is automatically rebalanced across
the remaining nodes
S1
ClustrixDB
S2
S5
ClustrixDB Overview28
Query
Distributed Query Processingo Queries are fielded by any peer node
Routed to node holding the data
o Complex queries are split into fragments processed in parallel Automatically distributed for optimized performance
ClustrixDBLoad
Balancer
TRXTRXTRX
ClustrixDB Overview29
Replication and Disaster Recovery
Asynchronous multi-point Replication
ClustrixDBParallel Backup up to 10x faster
Replicate to any cloud, any datacenter, anywhere
ClustrixDB Overview30
CLUSTRIXDB
UNDER THE HOOD
o DISTRIBUTION STRATEGYo REBALANCER TASKSo QUERY OPTIMIZERo EVALUATION MODELo CONCURRENCY CONTROL
ClustrixDB Overview31
ClustrixDB key components enabling Scale-Out
o Shared-nothing architecture Eliminates potential bottlenecks.
o Independent Index Distribution Hash each distribution key to a 64-bit number space divided into ranges with a specific slice owning
each rangeo Rebalancer
Ensures optimal data distribution across all nodes. Rebalancer assigns slices to available nodes for data capacity and access balance
o Query Optimizer Distributed query planner, compiler, and distributed shared-nothing execution engine Executes queries with max parallelism and many simultaneous queries concurrently.
o Evaluation Model Parallelizes queries, which are distributed to the node(s) with the relevant data.
o Consistency and Concurrency Control Using Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL)
ClustrixDB Overview32
Rebalancer Process
o User tables are vertically partitioned in representations.
o Representations are horizontally partitioned into slices.
o Rebalancer ensures: The representation has an appropriate number of slices. Slices are well distributed around the cluster on storage devices Slices are not placed on server(s) that are being flexed-down. Reads from each representation are balanced across the nodes
ClustrixDB Overview33
ClustrixDB Rebalancer Tasks
o Flex-UP Re-distribute replicas to new nodes
o Flex-DOWN Move replicas from the flex-down nodes to other nodes in the cluster
o Under-Protection – when a slice has fewer replicas than desired Create a new copy of the slice on a different node.
o Slice Too Big Split the slice into several new slices and re-distribute them
ClustrixDB Overview34
ClustrixDB Query Optimizero The ClustrixDB Query Optimizer is modeled on the Cascades optimization framework.
Other RDBMS leverage Cascades are Tandem's Nonstop SQL and Microsoft's SQL Server. Cost-driven - Extensible via a rule based mechanism Top-down approach
o Query Optimizer must answer the following, per SQL query: In what order should the tables be joined? Which indexes should be used? Should the sort/aggregate be non-blocking?
ClustrixDB Overview35
ClustrixDB Evaluation Model
o Parallel query evaluation
o Massively Parallel Processing (MPP) for analytic queries
o The Fair Scheduler ensures OLTP prioritized ahead of OLAP
o Queries are broken into fragments (functions).
o Joins require more data movement by their nature. ClustrixDB is able to achieve minimal data movement Each representation (table or index) has its own distribution map,
allowing direct look-ups for which node/slice to go to next, removing broadcasts.
There is no a central node orchestrating data motion. Data moves directly to the next node it needs to go to. This reduces hops to the minimum possible given the data distribution.
COMPILATIONFRAGMENTS
FRAGMENT1
FRAGMENT2
VM
FRAGMENT 1Node := lookup id = 15
<forward to node>
VM
FRAGMENT 2SELECT id, amount
<return>
SELECT id, amountFROM donationWHERE id=15
ClustrixDB Overview36
Concurrency Control
Time
readerreader
writer
writerwriter row conflict one
writer blocked
no conflictno blocking
o Readers never interfere with writers (or vice-versa). Writers use explicit locking for updates
o MVCC maintains a version of each row as writers modify rows
o Readers have lock-free snapshot isolation while writers use 2PL to manage conflict
Lock Conflict Matrix
Reader WriterReader None NoneWriter None Row
ClustrixDB Overview37
CLUSTRIXDB
DEPLOYMENT EXAMPLES
ClustrixDB Overview38
Example: Huge Write Workload (AWS Deployment)
The ApplicationInserts 254 million / day
Updates 1.35 million / day
Reads 252.3 million / day
Deletes 7,800 / day
The DatabaseQueries 5-9k per sec
CPU Load 45-65%
Nodes - Cores 10 nodes - 80 cores
ClustrixDB Overview39
Example: Huge Update Workload (Bare-Metal Deployment)
The ApplicationInserts 31.4 million / day
Updates 3.7 billion / day
Reads 1 billion / day
Deletes 4,300 / day
The DatabaseQueries 35-55k per sec
CPU Load 25-35%
Nodes - Cores 6 nodes - 120 cores
ClustrixDB Overview40
CLUSTRIXDB
IN DEVELOPMENT
ClustrixDB Overview41
Next Releaseo Additional Performance Improvements
Further improvements to read and write scaling
o Deployment and Provisioning Optimization Cloud templates and deployment scripts Instance testing and validation
o New Admin architecture and much improved Web UI Services based architecture with (RESTful) API Simplified single-click FLEX Management Significant Graphing and Reporting improvements Multi-Cluster topology view and management
ClustrixDB Overview42
New Web UI – Enhanced Dashboard
482 tps
ClustrixDB Overview43
New Web UI – Historical Workload Comparison
ClustrixDB Overview44
New Web UI – FLEX Administration
ClustrixDB Overview45
FINAL THOUGHTS
ClustrixDB Overview46
Capacity
Massiveread write scalability
Very highconcurrency
Linear throughput scale
Elasticity
Flex UP in minutes
Flex DOWN easily
Right-size resources on-demand
Resiliency
Automatic, 100%fault tolerance
No singlepoint of failure
Battle-testedperformance
Flexible Deployment
Cloud, VM, or bare-metal
Virtual Images available
Point/click Scale-out
ClustrixDB
ClustrixDB Overview47
Thank You.
facebook.com/clustrix
www.clustrix.com
@clustrix
linkedin.com/clustrix
ClustrixDB Overview48
Competitive Cluster Solutionso Most MySQL clustering solutions leverage Master/Master via
replication: MySQL Cluster Galera (open-source library) Percona XtraDB Cluster (leverages Galera replication library) Tungsten
o ClustrixDB does NOT use replication to keep all the servers in sync
Replication cannot scale writes as highly as our own technology Replication has inherent potential consistency and latency issues Transactional workloads such as OLTP (e.g. E-Commerce) are
exactly the workloads that replication struggles the most with
ClustrixDB Overview49
MySQL Clustero Provides shared-nothing clustering and auto-sharding for MySQL (designed
for Telco deployments: minimal cross-node transactions, HA emphasis)o Pros:
Distributed, multi-master with no SPOF Designed to provide high availability and high throughput with low latency, while
allowing for near linear scalability Synchronous replication, 2-Phase Commit
o Cons: Global checkpoint is 2sec. “There are no guaranteed durable COMMITs to disk” Only supports read_committed isolation “MySQL cluster does not handle large transactions well” Long-running transactions can block a node restart Overflow of data in replication stream drops node from cluster, consistency loss ‘True’ HA requires multiple replication lines; “1 is not sufficient” for HA DELETEs release memory for same-table; full release requires cluster rolling restart Range scans are expensive and low(er) performance than MySQL No distributed table locks
ClustrixDB Overview50
Galera Clustero Is a multi-master topology using their own replication protocol (designed
primarily for High-Availability, and secondarily for scale) o Pros:
Writes to any master are replicated to the other master(s) in sync, ensuring all masters have the same data.
It is open source, and 24/7 Support can be purchased for $7,950/yr/server. Percona also provides support, for a higher price.
o Cons: Write-scale is limited. Galera support recommends that writes go to one master,
rather than be distributed across the nodes. That helps with isolation issues, but increases consistency and latency issues across the nodes.
Snapshot isolation does NOT use first-committer-wins (and so fails Aphyr Jepsen CAP tests). ClustrixDB does use first-committer wins for snapshot consistency
Writesets are processed as a single memory-resident buffer and as a result, extremely large transactions (e.g. LOAD DATA) may adversely affect node performance.
Locking is lax with DDL. Eg, if your DML transaction uses a table, and a parallel DDL statement is started, Galera won’t wait for a metadata lock, causing potential consistency issues
ClustrixDB Overview51
Percona XtraDB Clustero Is an active/active high availability and high scalability open source solution for
MySQL® clustering. It integrates Percona Server and Percona XtraBackup with the Galera replication library
o Pros: Synchronous replication Multi-master replication support Parallel replication Automatic node provisioning
o Cons: Not designed for write scaling SELECT FOR UPDATE can easily create deadlocks Not true synchronous replication, but ‘virtually synchronous’: The data is committed on the
originating node and ack is sent to the application, but the other nodes are committed asynchronously. This can lead to consistency issues for applications reading from the other nodes
“If multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.”
ClustrixDB Overview52
Tungsten Replicatoro Is an open source replication engine. Compatible with MySQL, Oracle, and
Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, InfiniDB, and Hadoop
o Pros: Allows data to be exchanged between different databases and different
database versions During replication, information can be filtered and modified, and deployment can
be between on-premise or cloud-based databases For performance, Tungsten Replicator includes support for parallel replication,
and advanced topologies such as fan-in, star and multi-master, and can be used efficiently in cross-site deployments
o Cons: Very complicated to setup, maintain No automated management, automated failover, transparent connections, nor
built-in conflict resolution Only allows asynchronous replication Cannot suppress slave-side triggers. Need to alter each trigger to add an IF
statement that prevents the trigger from running on the slave.