Post on 20-May-2015
Laine Campbell, Owner/Principal, laine@palominodb.comCharlie Killian, Director of Engineering, charlie@palominodb.com
Scaling and Performance for Operational Excellence
*
Who we are
● A boutique consultancy offering custom solutions.
● An operations support team providing a combined 100+ years of experience in distributed, performant and scalable solutions.
● A team of architects, engineers and operators who have worked at some of the most trafficked sites, games and companies since 1999.
*
Operational Excellence
● Configuration management and documentation.● Change management.● Availability management.● Incident and problem management● Backup, recovery and business continuity.● Monitoring and Trending.
*
Configuration Management
● Consistent couchbase configurations.○ Guis are great, but don't meet automation needs.
● Self documenting environments.
● Incorporating your infrastructure into your application to leverage couchbase ease of scale.
● Chef, puppet, ansible or "roll your own" using the couchbase API.
*
Change and Release Management
● Schemaless is great, but data governance is key.
● Your code needs to build a data dictionary or confusion reigns.
● DevOps style relationships build collaboration that can overcome the wild west mentality of schemaless environments.
*
Availability Management
● Moxi provides availability during node failures, supporting reads and writes.
● XDCR support in Couchbase 2.0 provides availability across datacenters and regions in an active/active topology.
● Special consideration in cloud environments must take into account AZ and region failovers.
*
Incident and Problem Management
● While not Couchbase specific, crucial to maintaining any highly available architecture.
● Appropriate alerting, response and communication processes ensure that isolated issues don't cascade into massive failures.
● Failing hardware, networks, design issues can all cause failures that can cascade into an entire cluster being down.
● Tracking recurring problems help with a continuous improvement on meeting SLAs.
*
Backup and Recovery
● Define your recovery SLAs.● Track how long backups take.● Test restores and track how long they take.● Recognize all failure scenarios:
○ Node failure○ Physical data corruption○ Logical data corruption○ Audits and forensics
*
Backup and Recovery 1.8
● In 1.8, per node backup is supported. Replica sets are also backed-up, which can cause long, or non-completing backups.
● SQLite3 can be used as a logical dump to ease backups.
● Cluster-wide consistency can not be guaranteed.● No incremental backups available.
*
Backup and Recovery 2.0
● Cluster wide backups are now available, as well as incremental.
● EBS snapshots (or LVM, hardware, etc...) work well due to log-style writes to disk.
● With incremental, it is easier to meet SLAs without breaking the bank on storage.
*
Monitoring and Alerting
● Use logs! Centralized syslogs, splunk, custom scripts to identify and track error types and rates.
● Track your app! Latency of web pages, forms and api-calls are key indicators.
● Define key alerts, make them actionable and tied to documentation.
● Palomino builds plugins and templates to provide proper alerts that are useful and work!
*
Trending and Diagnostics
● Alerts aren't enough, you must track usage and internal metrics to understand trends, workloads and bottlenecks.
● Graph everything! All exposed metrics, trend health checks.
● Interleave graphs from internal metrics to external factors: Code pushes, application metrics (logins, purchases, api calls)
*
Care and Feeding
● Regular performance reviews.● Defragmentation.● Incorporate recovery tests into building test and dev
environments.● Scale-up/Scale-down, preferably via automated
processes.● Rolling upgrades.● Coffee, pie, beer.
*
Partnering with Couchbase
Providing remote Architecture, Engineering and DBA services to clients.
Vendor neutral operations and scaling expertise for Couchbase clients in need of operators.
*
Remote Architecture and Engineering Services
● Architecture review and recommendations ● Data modeling● Data model migration● Data migration● Cluster sizing● Tools development
*
DBA and Operations Services
● Infrastructure builds and management● Proactive operational support● 24x7 operational support with 30 minutes SLA● System health checks● Backup and recovery● Tuning for performance and scale● Query reviews, indexing, benchmarking● Capacity reviews
*
How we can help
● Support your proof of concept● Migrate you to Couchbase Server● Support your Couchbase Server clusters
*
Is Couchbase Server a good fit?
● Architecture review● Data model review● Recommendation on moving to Couchbase Server● Data access best practices
*
Migrating from a RDBMS to Couchbase Server?
● Data model migration from relational to document● Data migration from SQL Server to Couchbase
Server● Couchbase Server cluster sizing● Infrastructure builds
*
Do you need operational experts?
● 24x7 operational support with 30 minutes SLA● Multiple Couchbase Server 1.8 clusters● Wanted Couchbase operational experts● Escalate to Couchbase for software support
*
Contact Info
Laine Campbell, laine@palominodb.comCharlie Killian, charlie@palominodb.com
www.palominodb.com@palominodb on Twitter