Monitor some of the things

41
2013-10-18 MONITOR SOME OF THE THINGS

Transcript of Monitor some of the things

Page 1: Monitor some of the things

2013-10-18

MONITORSOME OF THE THINGS

Page 2: Monitor some of the things

Optimization, Backups, Replication, and more

Baron Schwartz, Peter Zaitsev &

Vadim Tkachenko

High PerformanceMySQL

3rd Edition

Covers Version 5.5

ME

• Cofounder of @VividCortex

• Author of High Performance MySQL

• @xaprb on Twitter

[email protected]

• http://www.linkedin.com/in/xaprb

Page 3: Monitor some of the things

RANT, RECAPPED

• The sky is falling

• Tools drive processes, and we need better tools designed for methods

• Pay attention to CAPS (Capacity, Availability, Performance, Scalability)

• Monitoring tools need to be a lot smarter

• Measure and monitor “work getting done”

Page 4: Monitor some of the things

HARD CAPACITY

• Disk volume

• CPU Cycles

• max_connections

• File descriptors, sockets, TCP port numbers, etc

• %used, absolute quantity available

Page 5: Monitor some of the things

SOFT CAPACITY

• Neil Gunther’s Universal Scalability Law

• %used, absolute quantity available

• Throughput, concurrency, errors

Page 6: Monitor some of the things

AVAILABILITY

• Availability is absence of downtime • %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

Page 7: Monitor some of the things

TASK PERFORMANCE

• Task performance is consistently fast response time.

• Measure an SLA in percentile response time per task, over observation intervals

• %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

• Response time, 95% response time

Page 8: Monitor some of the things

RESOURCE PERFORMANCE

• Resource performance is ability to run tasks consistently fast.

• %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

• Response time, 95% response time

• Throughput, concurrency, busy time, total response time, backlog/queue

Page 9: Monitor some of the things

SCALABILITY

• Universal Scalability Law again • %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

• Response time, 95% response time

• Throughput, concurrency, busy time, total response time, backlog/queue

Page 10: Monitor some of the things

STALL DETECTION

• Overloaded or underperforming? • %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

• Response time, 95% response time

• Throughput, concurrency, busy time, total response time, backlog/queue

• Utilization, saturation, errors, sources of load/demand

Page 11: Monitor some of the things

GIT ‘ER DONE

MONITOR WORK AND RESOURCES

Page 12: Monitor some of the things

WHAT NOT TO DO

• Don’t use top-N lists from Google

• Don’t just do what’s included in some Nagios plugin

Page 13: Monitor some of the things

№1TOP 10 LIST

1. MySQL availability2. Presence of insecure users and databases3. Aborted connects4. Error log5. Deadlocks6. Change in server configuration7. Slow query log8. Slave lag9. Percentage of maximum allowed connections10. Percentage of full table scans

Page 14: Monitor some of the things

№2TOP 10 LIST

1. Threads_connected2. Created_tmp_disk_tables3. Handler_read_first4. Innodb_buffer_pool_wait_free5. Key_reads6. Max_used_connections7. Open_tables8. Select_full_join9. Slow_queries10. Uptime

Page 15: Monitor some of the things

№1PLUGIN

1. threadcache-hitrate (Hit rate of the thread-cache) 2. slave-io-running (Slave io running: Yes) 3. slave-sql-running (Slave sql running: Yes) 4. qcache-hitrate (Query cache hitrate) 5. qcache-lowmem-prunes (Query cache entries pruned because of low memory) 6. keycache-hitrate (MyISAM key cache hitrate) 7. bufferpool-hitrate (InnoDB buffer pool hitrate) 8. bufferpool-wait-free (InnoDB buffer pool waits for clean page available) 9. log-waits (InnoDB log waits because of a too small log buffer) 10. tablecache-hitrate (Table cache hitrate) 11. table-lock-contention (Table lock contention) 12. index-usage (Usage of indices) 13. tmp-disk-tables (Percent of temp tables created on disk) 14. long-running-procs (long running processes)

Page 16: Monitor some of the things

№2PLUGIN

1. connection-time2. uptime3. threads-connected4. threadcache-hitrate5. q[uery]cache-hitrate6. q[uery]cache-lowmem-

prunes7. [myisam-]keycache-hitrate8. [innodb-]bufferpool-hitrate9. [innodb-]bufferpool-wait-free10. [innodb-]log-waits11. tablecache-hitrate

12. table-lock-contention13. index-usage14. tmp-disk-tables15. slow-queries16. long-running-procs17. slave-lag18. slave-io-running19. slave-sql-running20. sql21. open-files22. encode23. cluster-ndb-running

Page 17: Monitor some of the things

№3PLUGIN

Page 18: Monitor some of the things

SURFACE AREA

HTTP://WWW.FLICKR.COM/PHOTOS/NASAMARSHALL/5926864640/

Page 19: Monitor some of the things

DUPLICATE SIGNALS

• Queries

• Com_admin_commands

• Com_assign_to_keycache

• Com_alter_db

• Com_alter_db_upgrade

• Com_alter_event

• Com_alter_function

• Com_alter_procedure

• Com_alter_server

• Com_alter_table

• Com_alter_tablespace

• Com_alter_user

• Com_analyze

• Com_begin

• Com_binlog

• Com_ad_nauseum

Page 20: Monitor some of the things

DESIRABLE METRICS

• %used, absolute quantity available

• Throughput, concurrency, errors

• MTBF, MTTR, MTTD, %availability

• Response time, 95% response time

• Throughput, concurrency, busy time, total response time, backlog/queue

• Utilization, saturation, errors, sources of load/demand

Page 21: Monitor some of the things

Desirable Easy

Page 22: Monitor some of the things

Desirable Easy

Page 23: Monitor some of the things

IRRELEVANT

EXAMPLE PLEASE?

Page 24: Monitor some of the things

RESOURCE LIMITS

• Threads_connected near max_connections?

• %table cache used?

• Open file handles?

• Long-running queries/transactions?

Page 25: Monitor some of the things

ERRORS

• Deadlocks?

• Aborted connects?

Page 26: Monitor some of the things

AVAILABILITY

• Ability to connect and run a query?

• Uptime is small?

• Replication is running?

Page 27: Monitor some of the things

PERFORMANCE

• You can get throughput (Queries) and concurrency (Threads_running) from MySQL

• But in a Nagios check, no context to know whether they’re good or bad

• You generally can’t get response time, busy time, utilization, backlog, etc

• You can aggregate thread states, thread times, users, databases, query abstracts...

Page 28: Monitor some of the things

NAGIOS IS BEST AT

LIVING IN THE MOMENT

Page 29: Monitor some of the things

THOU SHALT NOT

• Cache hit ratios

• Thread cache hit ratio

• Buffer pool cache hit ratio

• Table cache hit ratio

• Key cache hit ratio

• Query cache hit ratio

• Rates of “bad” queries

• % temp tables on disk

• % full table scans

• % slow queries

• Unfixable things

• Replication delay

Page 30: Monitor some of the things

WHY NOT?

• Those are properties of the workload and application

• They are not conditions to alert/warn about

• They are not fixable / actionable in the service

Page 31: Monitor some of the things

ALERTS ARE

BETTER TOGETHER

Page 32: Monitor some of the things

QUESTION:

WHAT IS BETTER?

Page 33: Monitor some of the things

№1 ALERT!!!!!Disk CRIT 100% /dev/sda2

Page 34: Monitor some of the things

№2 ALERT!!!!!Replication CRIT Slave I/O Thread No

Page 35: Monitor some of the things

№3 ALERT!!!!!Replication CRIT Slave SQL Thread No

Page 36: Monitor some of the things

№4 ALERT!!!!!Replication CRIT Seconds_Behind_Master NULL

Page 37: Monitor some of the things

№5 ALERT!!!!!MySQL CRIT oldest transaction: 86400 seconds

Page 38: Monitor some of the things

- OR -

Page 39: Monitor some of the things

№1 ALERT!!!!!CRIT* Disk /dev/sda2 full* Replication stopped* Oldest transaction 86400 seconds* 4999 threads in status “Waiting for table metadata lock”