Db Server Health Check

140
Database Server Health Check Josh Berkus PostgreSQL Experts Inc. pgCon 2010

Transcript of Db Server Health Check

Page 1: Db Server Health Check

DatabaseServerHealthCheck

Josh BerkusPostgreSQL Experts Inc.pgCon 2010

Page 2: Db Server Health Check

DATABASE SERVERHELP 5¢

Page 3: Db Server Health Check

Program of Treatment

● What is a Healthy Database?● Know Your Application● Load Testing● Doing a database server checkup

● hardware● OS & FS● PostgreSQL● application

● Common Ailments of the Database Server

Page 4: Db Server Health Check

What is a Healthy Database Server?

Page 5: Db Server Health Check

What is a Healthy Database Server?

● Response Times

Page 6: Db Server Health Check

What is a Healthy Database Server?

● Response Times● lower than required● consistent & predicable

● Capacity for more● CPU and I/O headroom● low server load

Page 7: Db Server Health Check

25 50 75 100 125 150 175 200 225 2500

5

10

15

20

25

30

Number of Clients

Med

ian

Re s

pon s

e T

ime

Max Response Time

Exp

ecte

d Lo

ad

Page 8: Db Server Health Check

What is an Unhealthy Database Server?

● Slow response times● Inconsistent response times● High server load● No capacity for growth

Page 9: Db Server Health Check

25 50 75 100 125 150 175 200 225 2500

5

10

15

20

25

30

Number of Clients

Med

ian

Re s

pon s

e T

ime

Max Response Time

Exp

ecte

d Lo

ad

Page 10: Db Server Health Check

A healthy database server is able to maintain consistent

and acceptable response times under expected loads with

margin for error.

Page 11: Db Server Health Check

25 50 75 100 125 150 175 200 225 2500

5

10

15

20

25

30

Number of Clients

Med

ian

Re s

pon s

e T

ime

Page 12: Db Server Health Check

Hitting The Wall

Page 13: Db Server Health Check

CPUs Floored

Average: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77

0 88.96 0.09 10.03 1.111 12.09 0.02 86.98 0.002 98.90 0.00 0.00 10.103 77.52 0.44 1.70 20.34

16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13

Page 14: Db Server Health Check

CPUs Floored

Average: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77

0 88.96 0.09 10.03 1.111 12.09 0.02 86.98 0.002 98.90 0.00 0.00 10.103 77.52 0.44 1.70 20.34

16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13

Page 15: Db Server Health Check

IO Saturated

Device: tps MB_read/s MB_wrtn/ssde 414.33 0.40 38.15sdf 1452.00 99.14 29.00

Average: CPU %user %system %iowait %idleAverage:all 34.75 0.13 58.75 6.37

0 8.96 0.09 90.03 1.111 12.09 0.02 86.98 0.002 91.90 0.00 7.00 10.103 27.52 0.44 51.70 20.34

Page 16: Db Server Health Check

Out of Connections

FATAL: connection limit exceeded for non-superusers

Page 17: Db Server Health Check

How close are youHow close are youto the wall?to the wall?

Page 18: Db Server Health Check

The Checkup(full physical)

1. Analyze application

2. Analyze platform

3. Correct anything obviously wrong

4. Set up load test

5. Monitor load test

6. Analyze Results

7. Correct issues

Page 19: Db Server Health Check

The Checkup(semi-annual)

1. Check response times

2. Check system load

3. Check previous issues

4. Check for Signs of Illness

5. Fix new issues

Page 20: Db Server Health Check

Knowyour

application!

Page 21: Db Server Health Check

Application database usage

Which does your application do?

✔ small reads

✔ large sequential reads

✔ small writes

✔ large writes

✔ long-running procedures/transactions

✔ bulk loads and/or ETL

Page 22: Db Server Health Check

What Color Is My Application?● Web Application (Web)

● Online Transaction Processing (OLTP)

● Data Warehousing (DW)

W

O

D

Page 23: Db Server Health Check

What Color Is My Application?● Web Application (Web)

● DB much smaller than RAM● 90% or more simple queries

● Online Transaction Processing (OLTP)

● Data Warehousing (DW)

W

O

D

Page 24: Db Server Health Check

What Color Is My Application?● Web Application (Web)

● DB smaller than RAM● 90% or more simple queries

● Online Transaction Processing (OLTP)● DB slightly larger than RAM to 1TB● 20-40% small data write queries● Some long transactions and complex read queries

● Data Warehousing (DW)

W

O

D

Page 25: Db Server Health Check

What Color Is My Application?● Web Application (Web)

● DB smaller than RAM● 90% or more simple queries

● Online Transaction Processing (OLTP)● DB slightly larger than RAM to 1TB● 20-40% small data write queries● Some long transactions and complex read queries

● Data Warehousing (DW)● Large to huge databases (100GB to 100TB)● Large complex reporting queries● Large bulk loads of data● Also called "Decision Support" or "Business Intelligence"

W

O

D

Page 26: Db Server Health Check

What Color Is My Application?● Web Application (Web)

● CPU-bound● Ailments: idle connections/transactions, too many queries

● Online Transaction Processing (OLTP)● CPU or I/O bound● Ailments: locks, database growth, idle transactions,

database bloat● Data Warehousing (DW)

● I/O or RAM bound

● Resources: database growth, longer running queries, memory usage growth

W

O

D

Page 27: Db Server Health Check

Special features required?

● GIS● heavy cpu for GIS functions● lots of RAM for GIS indexes

● TSearch● lots of RAM for indexes● slow response time on writes

● SSL● response time lag on connections

Page 28: Db Server Health Check

LoadTesting

Page 29: Db Server Health Check

12:00:00 AM02:00:00 AM

04:00:00 AM06:00:00 AM

08:00:00 AM10:00:00 AM

12:00:00 PM02:00:00 PM

04:00:00 PM06:00:00 PM

08:00:00 PM10:00:00 PM

0

10

20

30

40

50

60

70

80

Time

Re

qu

est

s P

er

Se

c on

d

Page 30: Db Server Health Check

12:00:00 AM02:00:00 AM

04:00:00 AM06:00:00 AM

08:00:00 AM10:00:00 AM

12:00:00 PM02:00:00 PM

04:00:00 PM06:00:00 PM

08:00:00 PM10:00:00 PM

0

10

20

30

40

50

60

70

80

Time

Re

qu

est

s P

er

Se

c on

d

DO

WN

TIM

E

Page 31: Db Server Health Check

When preventing downtime,it is not average load which

matters, it is peak load.

Page 32: Db Server Health Check

What to load test

● Load should be as similar as possible to your production traffic

● You should be able to create your target level of traffic● better: incremental increases

● Test the whole application as well ● the database server may not be your weak point

Page 33: Db Server Health Check

How to Load Test

1. Set up a load testing tool

you'll need test servers for this*

2. Turn on PostgreSQL, HW, application monitoring

all monitoring should start at the same time

3. Run the test for a defined time

1 hour is usually good

4. Collect and analyze data

5. Re-run at higher level of traffic

Page 34: Db Server Health Check

Test Servers

● Must be as close as reasonable to production servers● otherwise you don't know how production will be

different● there is no predictable multiplier

● Double them up as your development/staging or failover servers

● If your test server is much smaller, then you need to do a same-load comparison

Page 35: Db Server Health Check

Tools for Load Testing

Page 36: Db Server Health Check

Production Test

1. Determine the peak load hour on the production servers

2. Turn on lots of monitoring duringthat peak load hour

3. Analyze results

Pretty much your only choice without a test server.

Page 37: Db Server Health Check

Issues with Production Test

● Not repeatable

− load won't be exactly the same ever again

● Cannot test target load

− just whatever happens to occur during that hour

−can't test incremental increases either

● Monitoring may hurt production performance

● Cannot test experimental changes

Page 38: Db Server Health Check

The Ad-Hoc Test

● Get 10 to 50 coworkers to open several sessions each

● Have them go crazy on using the application

Page 39: Db Server Health Check

Problems with Ad-Hoc Testing

● Not repeatable● minor changes in response times may be due to

changes in worker activity

● Labor intensive● each test run shuts down the office

● Can't reach target levels of load● unless you have a lot of coworkers

Page 40: Db Server Health Check

Seige

● HTTP traffic generator● all test interfaces must be addressable as URLs● useless for non-web applications

● Simple to use● create a simple load test in a few hours

● Tests the whole web application● cannot test database separately

● http://www.joedog.org/index/siege-home

Page 41: Db Server Health Check

pgReplay

● Replays your activity logs at variable speed● get exactly the traffic you get in production

● Good for testing just the database server● Can take time to set up

● need database snapshot, collect activity logs● must already have production traffic

● http://pgreplay.projects.postgresql.org/

Page 42: Db Server Health Check

tsung● Generic load generator in erlang

● a load testing kit rather than a tool● Generate a tsung file from your actvity logs using

pgFouine and test the database● Generate load for a web application using custom

scripts

● Can be time consuming to set up● but highly configurable and advanced● very scalable - cluster of load testing clients

● http://tsung.erlang-projects.org/

Page 43: Db Server Health Check

pgBench

● Simple micro-benchmark● not like any real application

● Version 9.0 adds multi-threading, customization● write custom pgBench scripts● run against real database

● Fairly ad-hoc compared to other tools● but easy to set up

● ships with PostgreSQL

Page 44: Db Server Health Check

Benchmarks

● Many “real” benchmarks available● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.

● Useful for testing your hardware● not useful for testing your application

● Often time-consuming and complex

Page 45: Db Server Health Check

Platform-specific

● Web framework or platform tests● Rails: ActionController::PerformanceTest● J2EE: OpenDemand, Grinder, many more

– JBoss, BEA have their own tools● Zend Framework Performance Test

● Useful for testing specific application performance● such as performance of specific features, modules

● Not all platforms have them

Page 46: Db Server Health Check

Flight-Check

● Attend the tutorial tomorrow!

Page 47: Db Server Health Check

monitoring PostgreSQL during load test

log_collector = onlog_destination = 'csvlog'log_filename = 'load_test_1_%h'log_rotation_age = 60minlog_rotation_size = 1GB

log_min_duration_statement = 0log_connections = onlog_disconnections = onlog_temp_files = 100kBlog_lock_waits = on

Page 48: Db Server Health Check

monitoring hardware during load test

sar -A -o load_test_1.sar 30 240

iostat or fsstat or zfs iostat

Page 49: Db Server Health Check

monitoring application during load test

● Collect response times● with timestamp● with activity

● Monitor hardware and utilization● activity● memory & CPU usage

● Record errors & timeouts

Page 50: Db Server Health Check

Checking Hardware

Page 51: Db Server Health Check

Checking Hardware

● CPUs and Cores● RAM● I/O & disk support● Network

Page 52: Db Server Health Check

CPUs and Cores

● Pretty simple: ● number● type● speed● L1/L2 cache

● Rules of thumb● fewer faster CPUs is

usually better than more slower ones

● core != cpu● thread != core● virtual core != core

Page 53: Db Server Health Check

CPU calculations

● ½ to 1 core for OS● ½ to 1 core for software raid or ZFS● 1 core for postmaster and bgwriter● 1 core per:

● DW: 1 to 3 concurrent users● OLTP: 10 to 50 concurrent users● Web: 100 to 1000 concurrent users

Page 54: Db Server Health Check

CPU tools

● sar● mpstat● pgTop

Page 55: Db Server Health Check

in praise of sar

● collects data about all aspects of HW usage● available on most OSes

● but output is slightly different

● easiest tool for collecting basic information● often enough for server-checking purposes

● BUT: does not report all data on all platforms

Page 56: Db Server Health Check

sar

CPUs: sar -P ALL and sar -uMemory: sar -r and sar -RI/O: sar -b and sar -dnetwork: sar -n

Page 57: Db Server Health Check

sar CPU output

06:05:01 AM CPU %user %nice %system %iowait %steal %idle06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.3206:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32

15:08:56 %usr %sys %wio %idle15:09:26 10 5 0 8515:09:56 9 7 0 8415:10:26 15 6 0 8015:10:56 14 7 0 7915:11:26 15 5 0 8015:11:56 14 5 0 81

Linux

Solaris

Page 58: Db Server Health Check

Memory

● Only one statistic: how much?● Not generally an issue on its own

● low memory can cause more I/O● low memory can cause more CPU time

Page 59: Db Server Health Check

memory sizing

SharedBuffers

work_memmaint_mem

FilesystemCache

In Buffer

In Cache

On Disk

Page 60: Db Server Health Check

Figure out Memory Sizing

● What is the active portion of your database?● i.e. gets queried frequently

● How large is it?● Where does it fit into the size categories?● How large is the inactive portion of your

database?● how frequently does it get hit? (remember backups)

Page 61: Db Server Health Check

Memory Sizing

● Other needs for RAM – work_mem:● sorts and aggregates: do you do a lot of big ones?● GIN/GiST indexes: these can be huge● hashes: for joins and aggregates● VACUUM

Page 62: Db Server Health Check

I/O Considerations

● Throughput● how fast can you get data off disk?

● Latency● how long does it take to respond to requests?

● Seek Time● how long does it take to find random disk pages?

Page 63: Db Server Health Check

I/O Considerations

● Throughput● important for large databases● important for bulk loads

● Latency● huge effect on small writes & reads● not so much on large scans

● Seek Time● important for small writes & reads● very important for index lookups

Page 64: Db Server Health Check

I/O Considerations

● Web● concerned about read latency & seek time

● OLTP● concerned about write latency & seek time

● DW/BI● concerned about throughput & seek time

Page 65: Db Server Health Check

------Sequential Output------ --Sequential Input- --Random-

-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-

Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP

32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1

------Sequential Output------ --Sequential Input-- --Random-

-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--

Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP

24G 260044 33 62110 17 89914 15 1167 25

6549ms 4882ms 3395ms 107ms

Page 66: Db Server Health Check

Common I/O Types

● Software RAID & ZFS● Hardware RAID Array● NAS/SAN● SSD

Page 67: Db Server Health Check

Hardware RAID Sanity Check

● RAID 1 / 10, not 5● Battery-backed write cache?

● otherwise, turn write cache off

● SATA < SCSI/SAS● about ½ real throughput

● Enough drives?● 4-14 for OLTP application● 8-48 for DW/BI

Page 68: Db Server Health Check

Sw RAID / ZFS Sanity Check

● Enough CPUs?● will need one for the RAID

● Enough disks?● same as hardware raid

● Extra configuration?● caching● block size

Page 69: Db Server Health Check

NAS/SAN Sanity Check

● Check latency!● Check real throughput

● drivers often a problem

● Enough network bandwidth?● multipath or fiber required to get HW RAID

performance

Page 70: Db Server Health Check

SSD Sanity Check

● 1 SSD = 4 Drives● relative performance

● Check write cache configuration● make sure data is safe

● Test real throughput, seek times● drivers often a problem

● Research durability stats

Page 71: Db Server Health Check

IO Tools

● I/O Tests● dd test● Bonnie++● IOZone● filebench

● Monitoring Tools● sar● mpstat iowait● iostat● on zfs: fsstat, zfs

-iostat● EXPLAIN ANALYZE

Page 72: Db Server Health Check

Network

● Throughput● not usually an issue, except:

– iSCSI / NAS / SAN– ELT & Bulk Load Processes

● remember that gigabit is only 100MB/s!

● Latency● real issue for Web / OLTP● consider putting app ↔ database on private

network

Page 73: Db Server Health Check

Checkups for the Cloud

Page 74: Db Server Health Check

Just like real HW, except ...

● Low ceiling on #cpus, RAM● Virtual Core < Real Core

● “CPU Stealing”● last-generation hardware● calculate 50% more cores

Page 75: Db Server Health Check

Cloud I/O Hell

● I/O tends to be very slow, erratic● comparable to a USB thumb drive● horrible latency, up to ½ second● erratic, speeds go up and down● RAID together several volumes on EBS● use asynchronous commit

– or at least commit_siblings

Page 76: Db Server Health Check

#1 Cloud Rule

If your databasedoesn't fit in RAM,

don't host iton a public cloud

Page 77: Db Server Health Check

Checking Operating Systemand Filesystem

Page 78: Db Server Health Check

OS Basics

● Use recent versions● large performance, scaling improvements in Linux &

Solaris in last 2 years

● Check OS tuning advice for databases● advice for Oracle is usually good for PostgreSQL

● Keep up with information about issues & patches● frequently specific releases have major issues● especially check HW drivers

Page 79: Db Server Health Check

OS Basics

● Use Linux, BSD or Solaris!● Windows has poor performance and weak

diagnostic tools● OSX is optimized for desktop and has poor

hardware support● AIX and HPUX require expertise just to install, and

lack tools

Page 80: Db Server Health Check

Filesystem Layout

● One array / one big pool● Two arrays / partitions

● OS and transaction log● Database

● Three arrays● OS & stats file● Transaction log● Database

Page 81: Db Server Health Check

Linux Tuning

● XFS > Ext3 (but not that much)

● Ext3 Tuning: data=writeback,noatime,nodiratime● XFS Tuning: noatime,nodiratime

– for transaction log: nobarrier

● “deadline” I/O scheduler● Increase SHMMAX and SHMALL

● to ½ of RAM

● Cluster filesystems also a possibility● OCFS, RHCFS

Page 82: Db Server Health Check

Solaris Tuning

● Use ZFS● no advantage to UFS anymore● mixed filesystems causes caching issues● set recordsize

– 8K small databases– 128K large databases– check for throughput/latency issues

Page 83: Db Server Health Check

Solaris Tuning

● Set OS parameters via “projects”● For all databases:

● project.max-shm-memory=(priv,12GB,deny)

● For high-connection databases:● use libumem● project.max-shm-ids=(priv,32768,deny)● project.max-sem-ids=(priv,4096,deny)● project.max-msg-ids=(priv,4096,deny)

Page 84: Db Server Health Check

FreeBSD Tuning

● ZFS: same as Solaris● definite win for very large databases● not so much for small databases

● Other tuning per docs

Page 85: Db Server Health Check

PostgreSQL Checkup

Page 86: Db Server Health Check

postgresql.conf: formulae

shared_buffers = available RAM / 4

Page 87: Db Server Health Check

postgresql.conf: formulae

max_connections =web: 100 to 200OLTP: 50 to 100DW/BI: 5 to 20

if you need more, use pooling! 

Page 88: Db Server Health Check

postgresql.conf: formulae

Web/OLTP:work_mem = Av.RAM * 2 / max_connections 

DW/BI:work_mem AvRAM / max_connections

Page 89: Db Server Health Check

postgresql.conf: formulae

Web/OLTP:maintenance_work_mem = Av.RAM * 16

DW/BI:maintenance_work_mem = AvRAM / 8

Page 90: Db Server Health Check

postgresql.conf: formulae

autovacuum = on

DW/BI & bulk loads:autovacuum = offautovacuum_max_workers = 1/2

Page 91: Db Server Health Check

postgresql.conf: formulae

checkpoint_segments = web: 8 to 16oltp: 32 to 64BI/DW: 128 to 256

Page 92: Db Server Health Check

postgresql.conf: formulae

wal_buffers = 8MB

effective_cache_size = AvRAM * 0.75

Page 93: Db Server Health Check

How much recoverability do you need?

● None: ● fsync=off● full_page_writes=off● consider using ramdrive

● Some Loss OK● synchronous_commit = off● wal_buffers = 16MB to 32MB

● Data integrity critical● keep everything on

Page 94: Db Server Health Check

File Locations

● Database● Transaction Log● Activity Log● Stats File● Tablespaces?

Page 95: Db Server Health Check

Database Checks: Indexes

select relname, seq_scan, seq_tup_read, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables where seq_scan > 1000 and pg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity ----------------+----------+--------------+---------+----------------- permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755

Page 96: Db Server Health Check

Database Checks: IndexesSELECT indexrelid::regclass as index , relid::regclass as table FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE;

index | table acct_acctdom_idx | accounts hitlist_acct_idx | hitlist hitlist_number_idx | hitlist custom_field_acct_idx | custom_field user_log_accstrt_idx | user_log user_log_idn_idx | user_log user_log_feed_idx | user_log user_log_inbdstart_idx | user_log user_log_lead_idx | user_log

Page 97: Db Server Health Check

Database Checks:Large Tables

relname | total_size | table_size-------------------------+------------+------------ operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB

Page 98: Db Server Health Check

Database Checks:Heavily-Used Tables

select relname, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables order by update_activity desc limit 10;

relname | size | update_activity ------------------------+---------+----------------- session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480

Page 99: Db Server Health Check

Database Unit Tests

● You need them!● you will be changing database objects and rewriting

queries● find bugs in testing or in testing … or in production

● Various tools● pgTAP● Framework-level tests

– Rails, Django, Catalyst, JBoss, etc.

Page 100: Db Server Health Check

Application StackCheckup

Page 101: Db Server Health Check

The Layer Cake

HardwareStorage

Operating System

PostgreSQL

Middleware

Application

Filesystem

Schema

Drivers

Queries

RAM/CPU Network

Kernel

Config

Connections Caching

Transactions

Page 102: Db Server Health Check

The Layer Cake

HardwareStorage

Operating System

PostgreSQL

Middleware

Application

Filesystem

Schema

Drivers

Queries

RAM/CPU Network

Kernel

Config

Connections Caching

Transactions

Page 103: Db Server Health Check

The Funnel

HW

Application

Middleware

PostgreSQL

OS

Page 104: Db Server Health Check

Check PostgreSQL Drivers

● Does the driver version match the PostgreSQL version?

● Have you applied all updates?● Are you using the best driver?

● There are several Python, C++ drivers● Don't use ODBC if you can avoid it.

● Does the driver support cached plans & binary data?● If so, are they being used?

Page 105: Db Server Health Check

Check Caching

Page 106: Db Server Health Check

Check Caching

● Does the application use data caching?● what kind?● could it be used more?● what is the cache invalidation strategy?● is there protection from “cache refresh storms”?

● Does the application use HTTP caching?● could they be using it more?

Page 107: Db Server Health Check

Check Connection Pooling

● Is the application using connection pooling?● all web applications should, and most OLTP● external or built into the application server?

● Is it configured correctly?● max. efficiency: transaction / statement mode● make sure timeouts match

Page 108: Db Server Health Check

Check Query Design

● PostgreSQL does better with fewer, bigger statements

● Check for common query mistakes● joins in the application layer● pulling too much data and discarding it● huge OFFSETs● unanchored text searches

Page 109: Db Server Health Check

Check Transaction Management

● Are transactions being used for loops?● batches of inserts or updates can be 75% faster if

wrapped in a transaction

● Are transactions aborted properly?● on error● on timeout● transactions being held open while non-database

activity runs

Page 110: Db Server Health Check

Common Ailmentsof the

Database Server

Page 111: Db Server Health Check

Check for them, monitor for them

● ailments could throw off your response time targets● database could even “hit the wall”

● check for them during health check● and during each checkup

● add daily/continuous monitors for them● Nagios check_postgres.pl has checks for many of

these things

Page 112: Db Server Health Check

Database Growth

● Checkup:● check both total database size and largest table(s)

size daily or weekly

● Symptoms:● database grows faster than expected● some tables grow continuously and rapidly

Page 113: Db Server Health Check

Database Growth

● Caused By:● faster than expected increase in usage● “append forever” tables● Database Bloat

● Leads to:● slower seq scans and index scans● swapping & temp files● slower backups

Page 114: Db Server Health Check

Database Growth

● Treatment:● check for Bloat● find largest tables and make them smaller

– expire data– partitioning

● horizontal scaling (if possible)● get better storage & more RAM, sooner

Page 115: Db Server Health Check

Database Bloat-[ RECORD 1 ]+-----schemaname | publictablename | user_logtbloat | 3.4wastedpages | 2356903wastedbytes | 19307749376wastedsize | 18 GBiname | user_log_accttime_idxituples | 941451584ipages | 9743581iotta | 40130146ibloat | 0.2wastedipages | 0wastedibytes | 0wastedisize | 0 bytes

Page 116: Db Server Health Check

Database Bloat

● Caused by: ● Autovacuum not keeping up

– or not enough manual vacuum– often on specific tables only

● FSM set wrong (before 8.4)● Idle In Transaction

● Leads To:● slow response times● unpredictable response times● heavy I/O

Page 117: Db Server Health Check

Database Bloat

● Treatment:● make autovacuum more aggressive

– on specific tables with bloat● fix FSM_relations/FSM_pages● check when tables are getting vacuumed● check for Idle In Transaction

Page 118: Db Server Health Check

Memory Usage Growth00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 0 0 100 0 0 100 0 002:00:00 0 0 100 0 0 100 0 003:00:00 0 0 100 0 0 100 0 004:00:00 0 0 100 0 0 100 0 0

00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 3788 115 98 0 0 100 0 002:00:00 21566 420 78 0 0 100 0 003:00:00 455721 1791 59 0 0 100 0 004:00:00 908 6 96 0 0 100 0 0

Page 119: Db Server Health Check

Memory Usage Growth

● Caused by:● Database Growth or Bloat● work_mem limit too high● bad queries

● Leads To:● database out of cache

– slow response times● OOM Errors (OOM Killer)

Page 120: Db Server Health Check

Memory Usage Growth

● Treatment● Look at ways to shrink queries, DB

– partitioning– data expiration

● lower work_mem limit● refactor bad queries● Or just buy more RAM

Page 121: Db Server Health Check

Idle Connections

select datname, usename, count(*) from pg_stat_activity where current_query = '<IDLE>' group by datname, usename;

datname | usename | count ---------+---------+------- track | www | 318

Page 122: Db Server Health Check

Idle Connections

● Caused by:● poor session management in application● wrong connection pool settings

● Leads to:● memory usage for connections● slower response times● out-of-connections at peak load

Page 123: Db Server Health Check

Idle Connections

● Treatment:● refactor application● reconfigure connection pool

– or add one

Page 124: Db Server Health Check

Idle In Transaction

select datname, usename, max(now() - xact_start) as max_time, count(*) from pg_stat_activity where current_query ~* '<IDLE> in transaction' group by datname, usename;

datname | usename | max_time | count ---------+----------+---------------+------- track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7

Page 125: Db Server Health Check

Idle In Transaction

● Caused by:● poor transaction control by application● abandoned sessions not being terminated fast

enough

● Leads To:● locking problems● database bloat● out of connections

Page 126: Db Server Health Check

Idle In Transaction

● Treatment● refactor application● change driver/ORM settings for transactions● change session timeouts & keepalives on pool,

driver, database

Page 127: Db Server Health Check

Longer Running Queries

● Detection:● log slow queries to PostgreSQL log● do daily or weekly report (pgfouine)

● Symptoms:● number of long-running queries in log increasing● slowest queries getting slower

Page 128: Db Server Health Check

Longer Running Queries

● Caused by:● database growth● poorly-written queries● wrong indexes● out-of-date stats

● Leads to:● out-of-CPU● out-of-connections

Page 129: Db Server Health Check

Longer Running Queries

● Treatments:● refactor queries ● update indexes● make Autoanalyze more aggressive● control database growth

Page 130: Db Server Health Check

Too Many Queries

Page 131: Db Server Health Check

Too Many Queries

● Caused By:● joins in middleware● not caching● poll cycles without delays● other application code issues

● Leads To:● out-of-CPU● out-of-connections

Page 132: Db Server Health Check

Too Many Queries

● Treatment:● characterize queries using logging● refactor application

Page 133: Db Server Health Check

Locking

● Detection:● log_lock_waits● scan activity log for deadlock warnings● query pg_stat_activity and pg_locks

● Symptoms:● deadlock error messages● number and time of lock_waits getting larger

Page 134: Db Server Health Check

Locking

● Caused by:● long-running operations with exclusive locks● inconsistent foreign key updates● poorly planned runtime DDL

● Leads to:● poor response times● timeouts● deadlock errors

Page 135: Db Server Health Check

Locking

● Treatment● analyze locks● refactor operations taking locks

– establish a canonical order of updates for long transactions

– use pessimistic locks with NOWAIT● rely on cascade for FK updates

– not on middleware code

Page 136: Db Server Health Check

Temp File Usage

● Detection:● log_temp_files = 100kB● scan logs for temp files weekly or daily

● Symptoms:● temp file usage getting more frequent● queries using temp files getting longer

Page 137: Db Server Health Check

Temp File Usage

● Caused by:● Sorts, hashes & aggregates too big for work_mem

● Leads to:● slow response times● timeouts

Page 138: Db Server Health Check

Temp File Usage

● Treatment● find swapping queries via logs● set work_mem higher for that ROLE, or● refactor them to need less memory, or● buy more RAM

Page 139: Db Server Health Check

All healthy now?

See you in six months!

Page 140: Db Server Health Check

Q&A

● Josh Berkus● [email protected]● it.toolbox.com/blogs/

database-soup

● PostgreSQL Experts● www.pgexperts.com● pgCon Sponsor

● Also see:● Load Testing

(tommorrow)● Testing BOF (Friday)

Copyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creative commons attribution license,except for 3rd-party images which are property of their respective owners.