@fuzzycz [email protected] PostgreSQL Performance ......Agenda - quick intro into PostgreSQL...

PostgreSQL Performance WorkshopDataOps Barcelona, June 20, 2019

Tomas Vondra, [email protected]

[email protected]@fuzzycz

mailto:[email protected]

mailto:[email protected]

Agenda

- quick intro into PostgreSQL architecture

- basic configuration parameters

- MVCC, VACUUM, CHECKPOINTS

- understanding query plans

- indexes

- partitioning

- parallelism

PostgreSQL architecture&

basic config

data files WAL

page cache (file system / kernel)

shared buffers (PostgreSQL)

SELECT DELETE

CHECKPOINTER

AUTOVACUUM

BGWRITER

Memory Budget

- Shared Memory

- shared_buffers, wal_buffers, ...

- ignore effective_cache_size

- Private memory

- work_mem, maintenance_work_mem, ...

- depends on complexity of queries

- plan memory/cache

- additional overhead proportional to max_connections

max_connections

– You only have limited number of CPU cores anyway

– Good value is ~2x CPU cores

– Many idle connections are a recipe for trouble

– Use a connection pool instead (pgBouncer)

shared_buffers

– The often quoted “25% of RAM” rule is obsolete

– No new rule of thumb - very workload dependent

– Start with smaller value, tune using pg_buffercache

- Can you fit “active set” into RAM? Good, use that.

- Otherwise rather use smaller values.

- page cache is “elastic”, unlike shared_buffers

work_mem

work_mem = 4MB

- Defaults mostly OK for OLTP workloads

- Be careful about OOM

- Increase only for users doing analytics

maintenance_work_mem

maintenance_work_mem = 64MB

- similar to work_mem, but for maintenance operations

- CREATE INDEX, VACUUM, …

- Going beyond 1GB usually does not help

- Smaller values sometimes better (L2/L3 cache)

effective_io_concurrency = 1

- prefetching for some access patterns

- worth increasing , esp. for SSDs / RAID arrays

- default is pretty poor (0 would be better)

Storage

WAL on a separate filesystem

- Good idea, separates I/O requests

- Independent device is even better

RAID with write cache (BBWC)

- Typically 1GB memory

- Best used for pg_wal (absorbs fsync)

- Can help with bursts of random writes

- Useless for random reads

Checkpoints

https://www.2ndquadrant.com/en/blog/basics-of-tuning-checkpoints/

https://www.2ndquadrant.com/en/blog/basics-of-tuning-checkpoints/

Write-Ahead Log (WAL)

- WAL is Postgres Transaction Log

- changes written to WAL first

- in case of crash we can repeat them

- conceptually “infinite” log of changes

- WAL is written in 16MB segments by default

- other sizes possible (specfied at initdb)

- segments allow removing old parts

- files re-used after 2 checkpoints

WAL example

shared buffers

data files

WAL

WAL example

shared buffers

data files

UPDATE

WAL

WAL example

shared buffers

data files

COMMIT

fsync

WAL

WAL example (after a crash)

shared buffers

data files

WAL

WAL example (recovery after a crash)

shared buffers

data files

RECOVERY

WAL

WAL example (after a while)

shared buffers

data files

WAL

CHECKPOINT / flush data to durable storage

shared buffers

data files

WAL

CHECKPOINT / truncate WAL

shared buffers

data files

WAL

Buffers may get evicted in 3 ways

- During a checkpoint (good)

- various optimizations possible (sort, combine, …)

- By a backend process (bad)

- shared buffers are full, queries have to evict buffers

- dirty buffers mean writes (random I/O)

- writes add latency for clients

- By the background writer (mostly good)

- regularly scans shared buffers, evicts buffers

- makes sure the previous case does not happen

CHECKPOINT

- writes dirty buffers to durable storage

- recovery applies WAL since last checkpoint

- allow bounded recovery time

- limits amount of WAL retained

- negative impact on performance (writes)

- spread checkpoints minimize spikes

CHECKPOINT

- Configuration until 9.4

checkpoint_timeout = 5 mincheckpoint_segments = 3

- Configuration since 9.5

checkpoint_timeout = 5 minmin_wal_size = 80 MBmax_wal_size = 1 GB

- Defaults are quite low (for larger DB)

- Consider failover to standby for fast recovery

- See pg_stat_bgwriter for tuning data

CHECKPOINT tuning

- aim for checkpoints triggered by timeout

- pick checkpoint_timeout = 30min (or similar)

- tune max_wal_size to be high enough

- there will always be non-timeout checkpoints

- batch loads at night

- shutdowns

Spread checkpoints

- spread the writes over a longer time period

- gives OS a chance to do writes in background

- when we do fsync, it’ll be cheap (hopefully)

- throttled by time until next checkpoint

- default: complete writes half-way through

- checkpoint_completion_target = 0.5

- if you increase checkpoint_timeout, maybe use 0.9

background writer

- Useful when checkpoints can’t keep up- Generating large amounts of dirty buffers- Auto-tuning behaviour

- Watches how many buffers were needed- Evicts "lru_multiple" of that number- But only up to "lru_maxpages" pages per round

- Defaults throttle to ~4MB/s

bgwriter_delay = 200msbgwriter_lru_maxpages = 100bgwriter_lru_multiplier = 2.0

Concurrency

- design to allow high concurrency

- plus: high performance and application usability

- minus: requires maintenance (garbage collection, …)

Concurrency

BEGIN;

UPDATE table SET col = value WHERE keycol = 7;

-- write lock on the row

COMMIT;

BEGIN;

SELECT col FROM table WHERE keycol = 7;

-- reads thru the lock-- (no waiting)

COMMIT;

MVCC

- Multi-Version Concurrency Control

- Postgres stores multiple versions of row

- Only one row version is visible by any observer

- Later row versions have latest data

- Time-consistent view of database is always

- available for each session (“snapshot”)

Row visibility

- write transactions assigned a full transaction id

- Each row has “hidden” fields with visibility information

- xmin: transaction number the row was added

- xmax: when row was deleted/updated

- additional fields to make it work correctly

- +12 bytes in total to each row header

- Data changes are not visible until commit

MVCC example

CREATE TABLE t (val INT);

-- transaction XID 111INSERT INTO t VALUES (1);

-- contents of the table

xmin | xmax | val -------|--------|------- 111 | | 1

MVCC example

-- transaction XID 222UPDATE t SET val = 2;


xmin | xmax | val -------|--------|------- 111 | 222 | 1 222 | | 2

MVCC example

-- transaction XID 333DELETE FROM t;


xmin | xmax | val -------|--------|------- 111 | 222 | 1 222 | 333 | 2

MVCC

- SQL Actions

- INSERT sets xmin

- DELETE sets xmax

- UPDATE sets xmax, creates new version with xmin

- SELECT ... FOR SHARE/UPDATE sets xmax

- Updates make the table (and indexes) grow!

- UPDATE, DELETE, and ROLLBACK create “dead rows”

- having too many dead rows is called “bloat”

VACUUM

- Removes dead rows

- Can’t remove rows still potentially visible to someone

- long-running statements

- long-running write transactions

- serializable transactions (even read-only)

- Two phase commits and idle transactions can cause

problems in the database

VACUUM phases

Phase 1 – Scans the table

- reduces dead rows down to just their row pointer

- makes list of row pointers to remove

Phase 2 – Scans all of the indexes

- uses the list of rows to remove dead index pointers

Phase 3 – Scans the table again

- uses the list of rows to remove

- removes row pointers

Phase 4 – Truncates empty blocks at end of table

VACUUM vs. VACUUM FULL

VACUUM

- Truncates space at end of table only

- Attempts to get AccessExclusiveLock

- If it doesn’t get lock, skips truncation step

VACUUM FULL

- Repacks table, minimizing space usage

- Similar to what CLUSTER does

- Creates a table copy (needs disk space)

Heap-Only Tuples (HOT)

- UPDATE creates a new row version

- we can proceed without updating indexes if

- no indexed columns were modified

- there’s enough space on the same page

- in such case we only create HOT chain

- minimizes bloat due to frequent UPDATEs

- both index and table bloat

- VACUUM still required for aborted INSERTs

Heap-Only Tuples (HOT)

● Do you really need all the indexes?

● Do you need to update all the columns?

● pg_stat_all_tables

○ n_tup_upd

○ n_tup_hot_upd

Manual vs. Automatic cleanup

Manual

- VACUUM command

- vacuumdb utility

- -- jobs=N (9.5)

automatic

- HOT

- autovacuum

Autovacuum

- Runs VACUUM automatically

- based on monitoring DML activity on tables

- number of inserted / updated / deleted rows

- Also does ANALYZE

- collect statistics about data distribution

- crucial for good query plans

- Runs 24x7 – no time scheduling options

- Cancels itself if an action would block behind it

Autovacuum Parameters

autovacuum = on

autovacuum_naptime = 1min

autovacuum_max_workers = 3

log_autovacuum_min_duration = -1

- up to 3 autovacuum jobs, on different dbs

- maybe log long autvacuum runs

- not a good idea to disable autovacuum

- maybe run it more often (smaller chunks)

Autovacuum parameters

threshold + ( rows * scale_factor )

autovacuum_vacuum_threshold = 50

autovacuum_vacuum_scale_factor = 0.2

autovacuum_analyze_threshold = 50

autovacuum_analyze_scale_factor = 0.1

- Can be set at system or table level

- Can set parameters for toast table separately

- don’t do that (at least not initially)

Autovacuum Formulas

threshold = 50

- prevents frequent operations on tiny tables

scale_factor = 10% (analyze) or 20% (vacuum)

- consider lowering on huge tables

- will trigger more frequently, but less intrusive

If it hurts, do it more often.

Autovac Parameter Tuning

- a bit of bloat / free space is good (ready for new data)

- 20% of 1GB table - 200MB - meh

- 20% of 1TB table - 200GB - probably way too much

- lower the thresholds

autovacuum_vacuum_threshold = 1000

autovacuum_vacuum_scale_factor = 0.01

(Auto)vacuum Throttling

- cleanup can consume a lot of resources (CPU and I/O)

- we need to throttle it, so that it does not affect users

- do bit of work and sleep() for a while

vacuum_cost_page_hit = 1

vacuum_cost_page_miss = 10

vacuum_cost_page_dirty = 20

autovacuum_vacuum_cost_delay = 20ms

autovacuum_vacuum_cost_limit = 200

(Auto)vacuum Throttling

- that means up to ~8MB/s reads, ~4MB/s of writes

- not sufficient for larger databases

autovacuum_vacuum_cost_delay = 10ms

autovacuum_vacuum_cost_limit = 1000

- ~80MB/s reads, ~40MB/s of writes

Monitoring VACUUM and HOT

SELECT * FROM pg_stat_user_tables WHERE relname = 'mytab';

SELECT * FROM pg_stat_progress_vacuum;

https://www.2ndquadrant.com/en/blog/autovacuum-tuning-basics/

https://www.2ndquadrant.com/en/blog/autovacuum-tuning-basics/

Questions?

@fuzzycz [email protected] PostgreSQL Performance ......Agenda - quick intro into PostgreSQL...

Documents

Transcript of @fuzzycz [email protected] PostgreSQL Performance ......Agenda - quick intro into PostgreSQL...