Preventing and Resolving MySQL Downtime

Jervin Real, Michael CoburnPercona

About Us

•Jervin Real, Technical Services Manager• Engineer Engineering Engineers

• APAC

•Michael Coburn, Principal Technical Account Manager

• Responsible for managing technical relationship with Percona's

highest revenue customers

What is Downtime?

•When your Application is completely unavailable

•When your Application is in a degraded state

•Whenever your boss says so :)

Why Prevent Downtime?

•Your business loses money when the Application is down

•You and your team's reputation suffers

•Real world adventures• Problems

• Solutions

• Prevention

•Putting them all together

Agenda

I Had a Crash On You

I Had a Crash On You (1): Page Corruption

•Disk bad sectors problem, not monitored or checked

•Page corruption on disk level

•Server crashes when reading page from disk

•Keeps crashing :(

I Had a Crash On You (1): Page Corruption > About

•Percona Server, we tried:• innodb_table_corrupt_action = salvage

•Worked!

•Dropped table, recreated - application back online

•Worst case:• innodb_force_recovery > 0

• Data Recovery

I Had a Crash On You (1): Page Corruption > Solutions

•Running 5.6.11, early adopter, InnoDB FULLTEXT

•Upgrade to 5.6.18, MySQL crashed

•Data was unusable - bug#72079

I Had a Crash On You (2): Assertion > About

•Downgrade and restore from backup

•Re-execute upgrade to avoid the bug

I Had a Crash On You (2): Assertion > Solutions

•innodb_corrupt_table_action=salvage / warn

•pt-table-checksum• Regularly recurse your data and check for errors in error log

•RAID card health checks• Can vary by vendor

•SMART checks• Be vigilant for disk level errors

I Had a Crash On You (1): Page Corruption > Preventions

Nobody’s Watching

•Percona XtraDB Cluster, 3 nodes

•Few months ago node 3 went down due to conflict, but

nobody noticed

•Few hours ago, node 2 was killed by OOM, cluster lost

quorum

•EVERYBODY NOTICED!

Nobody’s Watching (1): Nobody Cared > About

•Bootstrap remaining node• SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;

•SST second and 3rd node

•Define wsrep_notify_cmd temporarily

•Implement better alerting

Nobody’s Watching (1): Nobody Cared > Solutions

•New sysadmin received disk space alert

•du -hx --max-depth=1 /

•/var has lots of data

•find /var/ -size +5G -exec rm -rf {} \;

•Bam, ibdata1 gone!

•Restart maintenance occurred later in the day ...

Nobody’s Watching (2): Dropped the Bomb > About

•Restore from backup

•Really, they were lucky!

Nobody’s Watching (2): Dropped the Bomb > Solutions

•Percona Monitoring Plugins• pmp-check-deleted-files

• pmp-check-mysql-status

• pmp-check-mysql-innodb

•Define a script executable by mysql user• Triggered on node state changes

•Take backups, and alert on failure

•Don't restart the server - file handles are still open!

Nobody’s Watching: Prevention

Self Induced Pain

•“Waiting for query cache lock”

root# ~> pt-sift /var/lib/pt-stalk/

--processlist--

90 Waiting for query cache lock

4 Sending data

4 Master has sent all binlog to slave; waiting for binlog to be updated

2 init

Self Induced Pain (1): Query Cache

● Global mutex

● Point of contention

● Especially on hot dataset/table

● More so, with large QC

Self Induced Pain (1): Query Cache > About

Self Induced Pain (1): Query Cache > Solutions

● Set it to small size - to reduce performance overhead

● Disable completely to to avoid contention

● Hint offending queries to skip the query cache i.e. SELECT

SQL_NO_CACHE

Self Induced Pain (2): Buffer Pool Dump/Restore

● Dumps buffer pool page list to disk

● Reloads buffer pool based on this list at startup

● Meant to help speed up buffer pool warmup

● Maintenance restart, buffer dump and restore enabled

● Yey! Expecting everything to go well.

● 30mins in performance still really bad, IO trashing

● Large buffer pool, busy read/write

Self Induced Pain (2): Buffer Pool Dump/Restore > About

● Extend your maintenance period to let the server warmup

if possible, otherwise they will contend on IO

● RAID1 of 2 SATA disks is not a license to use buffer pool

warmup on 240GB of buffer pool

Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions

Self-Induced Pain Prevention

•Percona Toolkit• pt-stalk

• pt-sift

• pt-kill

•Disable OOM killer

•Configure appropriate disk scheduler

•Check the error log for "Buffer pool load complete"

MySQL, MySQL! What Have Suffereth Ye Thee?

•Slow queries

•Connections build up

•Slow response times

•Long running transactions

•Stop the World scenario

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

--innodb--

txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)

0 queries inside InnoDB, 0 queries in queue

Main thread: sleeping, pending reads 0, writes 28, flush 1

Log: lsn = 2147483647, chkp = 2147483647, chkp age =

210625191

---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows

mysql tables in use 1, locked 1

80337 lock struct(s), heap size 8271400, 10979242 row lock(s)

MySQL thread id 671621, OS thread handle 0x7fe03528a700,

query id 37505085 localhost magento Sending data

SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item`

LIMIT 376 OFFSET 491056

•KILL long running trx

•pt-kill for persistent long running trx

•Deploy immediate code changes to disable erroring code

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions

•MySQL is still responding

•All sorts of mutexes• trx_sys->mutex

• block->lock

• lock_sys->mutex

• lock_sys->wait_mutex

•… and is killing latency

•Service impact means lost income

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About

•innodb_thread_concurrency > 0

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions

● “Opening tables”, “Closing tables”

--processlist--

578 Opening tables

32 closing tables

● Contention on LOCK_open mutex

● Risk of negative scalability

● Tune table_open_cache/table_definition_cache

● table_open_cache_instances (5.6+)

● Shard either logically/horizontally, run multiple mysql

instances to reduce object size by instance

MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions

•pt-kill --log

•MySQL Server Configurationa. Remember to tune innodb_thread_ concurrency (default is 0)

b. innodb_table_cache + innodb_table_cache_instances

•Application Stack Configuration (Schema Design)a. Single tenant per schema

b. Multiple tenants per schema (each table has client_id column)

c. All tenants in one schema

MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention

•Disk performance cascading to MySQL to application

Wizard of OS (1): Disk Performance

•Slow writes, binlogs, redo logs, syncs

•Transactions stalling on COMMIT, updating, inserting …•Replication getting delayed if node is a slave

•Translates to latency

Wizard of OS (1): Disk Performance > About

● RAID Controller in Write-Through

● Could also be a bad disk!

Wizard of OS (1): Disk Performance > Solutions

● Swapping heavily, with significant amount of RAM free

Wizard of OS (2): Swapping

● Swapping induces significant amount of IO

● Swapping in and out of disk is mighty expensive

● Affects MySQL in magnificent ways

● Swap Insanity!

Wizard of OS (2): Swapping > About

● NUMA Interleave

● Percona Server is NUMA configurable○ numa_interleave

○ Flush_caches

● Check numastat - perl check_numa.pl

Wizard of OS (2): Swapping > Solutions

● Tune:○ Vm.swappiness

○ NUMA policy

○ disk scheduler

○ mount options appropriately (ext4, xfs)■ (nobarrier, noatime)

● pt-heartbeat - monitor replication delay

Wizard of OS : Prevention

Percona Server Features

•Enable InnoDB Buffer Pool warming

•Enable userstat for table & index statistics

•Enable verbose slow log

•Enable Query Response Time plugin

Thank You!

•Jervin Real jervin.real@percona.com• Technical Services Manager, APAC

•Michael Coburn michael.coburn@percona.com• Principal Technical Account Manager, USA

Preventing and Resolving MySQL Downtime

Engineering

Transcript of Preventing and Resolving MySQL Downtime

MySQL Reference Manual - HZDR1 General Information About MySQL MySQL.

PHP and MySQL php-mysql PHP and MySQL

Why MySQL Enterprise Edition? MySQL Enterprise Edition ...€¦ · MySQL Enterprise Audit Adds regulatory compliance to MySQL applications (HIPAA, Sarbanes-Oxley, PCI, etc.) MySQL

Reduce Downtime

MySQL Replication: Latest Developments · MySQL 5.1.30 row-based replication MySQL 4.0.12 MySQL 4.1.7 MySQL 5.5.8 semi-sync replication MySQL 5.6.10 crash-safe replication metadata

MySQL Introduction to the MySQL products. Agenda Company Overview Open Source & MySQL Momentum Why MySQL? MySQL OEM, Community & Enterprise offerings.

Upstream Downstream - FOSDEM · HA Audit MySQL 5.6 MySQL Workbench 6.1 MySQL Utilities MySQL Applier for Hadoop MySQL Workbench 5.2 & 6.0 MySQL Enterprise Oracle Certifications Windows

Clinical Downtime Resource Manual - Welcome to LHSC · failures or software failures. Degrees of Downtime The extent of computer downtime may vary. The downtime may impact only certain

Advanced : Upgrading from Native MySQL Replication to a ...continuent-videos.s3.amazonaws.com/...Native-MySQL... · Upgrade Methods: In Place • Will require downtime – Port change

MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)

Upgrade to MySQL 5.6 Without Downtime

MySQL High Availability...High Availability •Service Level Agreements (SLAs) Business Requirements based on RTO, RPO, Cost of Downtime Support Tiers (typically 4 tiers) Tier 1 (MySQL

Xenofon Vasilakospages.cs.aueb.gr/~xvas/pdfs/detailedCV.pdf^downtime _ after handover events. Decreasing mobile users [ perception of service ^Downtime _ (i.e., achieving ^Zero Downtime

What Causes Downtime in MySQL, and How Can You Prevent It? · Top Ten Incident Types “Cause” Category Count Percent SQL Performance 20 12.9% Data difference Replication 14 9.1%

MySQL Performance Benchmarks - jonahharris.comjonahharris.com/osdb/mysql/mysql-performance-whitepaper.pdf · MySQL Performance Benchmarks Measuring MySQL’s Scalability and Throughput

MySQL - kuliah.unnes.ac.idkuliah.unnes.ac.id/~hardy/mysql/MySQL-komlan.pdf · mysql>flush privileges ! Reloads the privileges from the grant tables in the database mysql ! An alternative

MySQL Enterprise Monitor€¦ · mysql-monitor.log MySQL log file. Replication 1.dot The calculated MySQL server replication structure preferences.properties MySQL Enterprise Monitor

Resolving and PreventingMySQL Downtime - Percona · • Technical Services Manager – APAC ... 37505085 localhost magento Sending data ... The Percona Live Open Source Database Conference

MySQL Installation Guide. MySQL Downloading MySQL Installer.

Resolving Conflict Resolving Conflict Cards€¦ · Resolving Conflict Cards Resolving Conflict Cards Resolving Conflict Cards A big part of resolving a conflict is managing our emotions.