Preventing and Resolving MySQL Downtime

Post on 16-Apr-2017

202 views 6 download

Transcript of Preventing and Resolving MySQL Downtime

Jervin Real, Michael CoburnPercona

Preventing and Resolving MySQL Downtime

About Us

•Jervin Real, Technical Services Manager• Engineer Engineering Engineers

• APAC

•Michael Coburn, Principal Technical Account Manager

• Responsible for managing technical relationship with Percona's

highest revenue customers

2

What is Downtime?

•When your Application is completely unavailable

•When your Application is in a degraded state

•Whenever your boss says so :)

3

Why Prevent Downtime?

•Your business loses money when the Application is down

•You and your team's reputation suffers

4

•Real world adventures• Problems

• Solutions

• Prevention

•Putting them all together

Agenda

5

I Had a Crash On You

6

I Had a Crash On You (1): Page Corruption

7

•Disk bad sectors problem, not monitored or checked

•Page corruption on disk level

•Server crashes when reading page from disk

•Keeps crashing :(

I Had a Crash On You (1): Page Corruption > About

8

•Percona Server, we tried:• innodb_table_corrupt_action = salvage

•Worked!

•Dropped table, recreated - application back online

•Worst case:• innodb_force_recovery > 0

• Data Recovery

I Had a Crash On You (1): Page Corruption > Solutions

9

•Running 5.6.11, early adopter, InnoDB FULLTEXT

•Upgrade to 5.6.18, MySQL crashed

•Data was unusable - bug#72079

I Had a Crash On You (2): Assertion > About

10

•Downgrade and restore from backup

•Re-execute upgrade to avoid the bug

I Had a Crash On You (2): Assertion > Solutions

11

•innodb_corrupt_table_action=salvage / warn

•pt-table-checksum• Regularly recurse your data and check for errors in error log

•RAID card health checks• Can vary by vendor

•SMART checks• Be vigilant for disk level errors

I Had a Crash On You (1): Page Corruption > Preventions

12

Nobody’s Watching

13

•Percona XtraDB Cluster, 3 nodes

•Few months ago node 3 went down due to conflict, but

nobody noticed

•Few hours ago, node 2 was killed by OOM, cluster lost

quorum

•EVERYBODY NOTICED!

Nobody’s Watching (1): Nobody Cared > About

14

•Bootstrap remaining node• SET GLOBAL wsrep_provider_options=’pc.bootstrap=1’;

•SST second and 3rd node

•Define wsrep_notify_cmd temporarily

•Implement better alerting

Nobody’s Watching (1): Nobody Cared > Solutions

15

•New sysadmin received disk space alert

•du -hx --max-depth=1 /

•/var has lots of data

•find /var/ -size +5G -exec rm -rf {} \;

•Bam, ibdata1 gone!

•Restart maintenance occurred later in the day ...

Nobody’s Watching (2): Dropped the Bomb > About

16

•Restore from backup

•Really, they were lucky!

Nobody’s Watching (2): Dropped the Bomb > Solutions

17

•Percona Monitoring Plugins• pmp-check-deleted-files

• pmp-check-mysql-status

• pmp-check-mysql-innodb

•Define a script executable by mysql user• Triggered on node state changes

•Take backups, and alert on failure

•Don't restart the server - file handles are still open!

Nobody’s Watching: Prevention

18

Self Induced Pain

19

•“Waiting for query cache lock”

root# ~> pt-sift /var/lib/pt-stalk/

...

--processlist--

State

226

90 Waiting for query cache lock

4 Sending data

4 Master has sent all binlog to slave; waiting for binlog to be updated

2 init

Self Induced Pain (1): Query Cache

20

● Global mutex

● Point of contention

● Especially on hot dataset/table

● More so, with large QC

Self Induced Pain (1): Query Cache > About

21

Self Induced Pain (1): Query Cache > Solutions

22

● Set it to small size - to reduce performance overhead

● Disable completely to to avoid contention

● Hint offending queries to skip the query cache i.e. SELECT

SQL_NO_CACHE

Self Induced Pain (2): Buffer Pool Dump/Restore

23

● Dumps buffer pool page list to disk

● Reloads buffer pool based on this list at startup

● Meant to help speed up buffer pool warmup

● Maintenance restart, buffer dump and restore enabled

● Yey! Expecting everything to go well.

● 30mins in performance still really bad, IO trashing

● Large buffer pool, busy read/write

Self Induced Pain (2): Buffer Pool Dump/Restore > About

24

● Extend your maintenance period to let the server warmup

if possible, otherwise they will contend on IO

● RAID1 of 2 SATA disks is not a license to use buffer pool

warmup on 240GB of buffer pool

Self Induced Pain (2): Buffer Pool Dump/Restore > Solutions

25

Self-Induced Pain Prevention

•Percona Toolkit• pt-stalk

• pt-sift

• pt-kill

•Disable OOM killer

•Configure appropriate disk scheduler

•Check the error log for "Buffer pool load complete"

26

MySQL, MySQL! What Have Suffereth Ye Thee?

27

•Slow queries

•Connections build up

•Slow response times

•Long running transactions

•Stop the World scenario

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

28

--innodb--

txns: 486xACTIVE (28s) 994xnot (0s) 227xLOCK WAIT (25844s)

0 queries inside InnoDB, 0 queries in queue

Main thread: sleeping, pending reads 0, writes 28, flush 1

Log: lsn = 2147483647, chkp = 2147483647, chkp age =

210625191

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

29

---TRANSACTION 230207990, ACTIVE 13779 sec fetching rows

mysql tables in use 1, locked 1

80337 lock struct(s), heap size 8271400, 10979242 row lock(s)

MySQL thread id 671621, OS thread handle 0x7fe03528a700,

query id 37505085 localhost magento Sending data

SELECT `sales_flat_quote_item`.* FROM `sales_flat_quote_item`

LIMIT 376 OFFSET 491056

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > About

30

•KILL long running trx

•pt-kill for persistent long running trx

•Deploy immediate code changes to disable erroring code

MySQL, MySQL! What Have Suffereth Ye Thee? (1): Grind to a Halt > Solutions

31

•MySQL is still responding

•All sorts of mutexes• trx_sys->mutex

• block->lock

• lock_sys->mutex

• lock_sys->wait_mutex

•… and is killing latency

•Service impact means lost income

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > About

32

•innodb_thread_concurrency > 0

MySQL, MySQL! What Have Suffereth Ye Thee? (2): CPU Load > Solutions

33

● “Opening tables”, “Closing tables”

--processlist--

State

578 Opening tables

32 closing tables

MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About

34

● Contention on LOCK_open mutex

● Risk of negative scalability

MySQL, MySQL! What Have Suffereth Ye Thee? (3): CPU Load > About

35

● Tune table_open_cache/table_definition_cache

● table_open_cache_instances (5.6+)

● Shard either logically/horizontally, run multiple mysql

instances to reduce object size by instance

MySQL, MySQL! What Have Suffereth Ye Thee? (3) : CPU Load > Solutions

36

•pt-kill --log

•MySQL Server Configurationa. Remember to tune innodb_thread_ concurrency (default is 0)

b. innodb_table_cache + innodb_table_cache_instances

•Application Stack Configuration (Schema Design)a. Single tenant per schema

b. Multiple tenants per schema (each table has client_id column)

c. All tenants in one schema

MySQL, MySQL! What Have Suffereth Ye Thee? (2,3) : Prevention

37

•Disk performance cascading to MySQL to application

Wizard of OS (1): Disk Performance

38

•Slow writes, binlogs, redo logs, syncs

•Transactions stalling on COMMIT, updating, inserting …•Replication getting delayed if node is a slave

•Translates to latency

Wizard of OS (1): Disk Performance > About

39

● RAID Controller in Write-Through

● Could also be a bad disk!

Wizard of OS (1): Disk Performance > Solutions

40

● Swapping heavily, with significant amount of RAM free

Wizard of OS (2): Swapping

41

● Swapping induces significant amount of IO

● Swapping in and out of disk is mighty expensive

● Affects MySQL in magnificent ways

● Swap Insanity!

Wizard of OS (2): Swapping > About

42

● NUMA Interleave

● Percona Server is NUMA configurable○ numa_interleave

○ Flush_caches

● Check numastat - perl check_numa.pl

Wizard of OS (2): Swapping > Solutions

43

● Tune:○ Vm.swappiness

○ NUMA policy

○ disk scheduler

○ mount options appropriately (ext4, xfs)■ (nobarrier, noatime)

● pt-heartbeat - monitor replication delay

Wizard of OS : Prevention

44

Percona Server Features

•Enable InnoDB Buffer Pool warming

•Enable userstat for table & index statistics

•Enable verbose slow log

•Enable Query Response Time plugin

45

Thank You!

•Jervin Real jervin.real@percona.com• Technical Services Manager, APAC

•Michael Coburn michael.coburn@percona.com• Principal Technical Account Manager, USA

46