@fuzzycz [email protected] PostgreSQL Performance ......Agenda - quick intro into PostgreSQL...
Transcript of @fuzzycz [email protected] PostgreSQL Performance ......Agenda - quick intro into PostgreSQL...
PostgreSQL Performance WorkshopDataOps Barcelona, June 20, 2019
Tomas Vondra, [email protected]
[email protected]@fuzzycz
Agenda
- quick intro into PostgreSQL architecture
- basic configuration parameters
- MVCC, VACUUM, CHECKPOINTS
- understanding query plans
- indexes
- partitioning
- parallelism
PostgreSQL architecture&
basic config
data files WAL
page cache (file system / kernel)
shared buffers (PostgreSQL)
SELECT DELETE
CHECKPOINTER
AUTOVACUUM
BGWRITER
Memory Budget
- Shared Memory
- shared_buffers, wal_buffers, ...
- ignore effective_cache_size
- Private memory
- work_mem, maintenance_work_mem, ...
- depends on complexity of queries
- plan memory/cache
- additional overhead proportional to max_connections
max_connections
– You only have limited number of CPU cores anyway
– Good value is ~2x CPU cores
– Many idle connections are a recipe for trouble
– Use a connection pool instead (pgBouncer)
shared_buffers
– The often quoted “25% of RAM” rule is obsolete
– No new rule of thumb - very workload dependent
– Start with smaller value, tune using pg_buffercache
- Can you fit “active set” into RAM? Good, use that.
- Otherwise rather use smaller values.
- page cache is “elastic”, unlike shared_buffers
work_mem
work_mem = 4MB
- Defaults mostly OK for OLTP workloads
- Be careful about OOM
- Increase only for users doing analytics
maintenance_work_mem
maintenance_work_mem = 64MB
- similar to work_mem, but for maintenance operations
- CREATE INDEX, VACUUM, …
- Going beyond 1GB usually does not help
- Smaller values sometimes better (L2/L3 cache)
effective_io_concurrency = 1
- prefetching for some access patterns
- worth increasing , esp. for SSDs / RAID arrays
- default is pretty poor (0 would be better)
Storage
WAL on a separate filesystem
- Good idea, separates I/O requests
- Independent device is even better
RAID with write cache (BBWC)
- Typically 1GB memory
- Best used for pg_wal (absorbs fsync)
- Can help with bursts of random writes
- Useless for random reads
Checkpoints
https://www.2ndquadrant.com/en/blog/basics-of-tuning-checkpoints/
Write-Ahead Log (WAL)
- WAL is Postgres Transaction Log
- changes written to WAL first
- in case of crash we can repeat them
- conceptually “infinite” log of changes
- WAL is written in 16MB segments by default
- other sizes possible (specfied at initdb)
- segments allow removing old parts
- files re-used after 2 checkpoints
WAL example
shared buffers
data files
WAL
WAL example
shared buffers
data files
UPDATE
WAL
WAL example
shared buffers
data files
UPDATE
WAL
WAL example
shared buffers
data files
COMMIT
fsync
WAL
WAL example (after a crash)
shared buffers
data files
WAL
WAL example (recovery after a crash)
shared buffers
data files
RECOVERY
WAL
WAL example (after a while)
shared buffers
data files
WAL
CHECKPOINT / flush data to durable storage
shared buffers
data files
WAL
CHECKPOINT / truncate WAL
shared buffers
data files
WAL
Buffers may get evicted in 3 ways
- During a checkpoint (good)
- various optimizations possible (sort, combine, …)
- By a backend process (bad)
- shared buffers are full, queries have to evict buffers
- dirty buffers mean writes (random I/O)
- writes add latency for clients
- By the background writer (mostly good)
- regularly scans shared buffers, evicts buffers
- makes sure the previous case does not happen
CHECKPOINT
- writes dirty buffers to durable storage
- recovery applies WAL since last checkpoint
- allow bounded recovery time
- limits amount of WAL retained
- negative impact on performance (writes)
- spread checkpoints minimize spikes
CHECKPOINT
- Configuration until 9.4
checkpoint_timeout = 5 mincheckpoint_segments = 3
- Configuration since 9.5
checkpoint_timeout = 5 minmin_wal_size = 80 MBmax_wal_size = 1 GB
- Defaults are quite low (for larger DB)
- Consider failover to standby for fast recovery
- See pg_stat_bgwriter for tuning data
CHECKPOINT tuning
- aim for checkpoints triggered by timeout
- pick checkpoint_timeout = 30min (or similar)
- tune max_wal_size to be high enough
- there will always be non-timeout checkpoints
- batch loads at night
- shutdowns
Spread checkpoints
- spread the writes over a longer time period
- gives OS a chance to do writes in background
- when we do fsync, it’ll be cheap (hopefully)
- throttled by time until next checkpoint
- default: complete writes half-way through
- checkpoint_completion_target = 0.5
- if you increase checkpoint_timeout, maybe use 0.9
background writer
- Useful when checkpoints can’t keep up- Generating large amounts of dirty buffers- Auto-tuning behaviour
- Watches how many buffers were needed- Evicts "lru_multiple" of that number- But only up to "lru_maxpages" pages per round
- Defaults throttle to ~4MB/s
bgwriter_delay = 200msbgwriter_lru_maxpages = 100bgwriter_lru_multiplier = 2.0
MVCC
Concurrency
- design to allow high concurrency
- plus: high performance and application usability
- minus: requires maintenance (garbage collection, …)
Concurrency
BEGIN;
UPDATE table SET col = value WHERE keycol = 7;
-- write lock on the row
COMMIT;
BEGIN;
SELECT col FROM table WHERE keycol = 7;
-- reads thru the lock-- (no waiting)
COMMIT;
MVCC
- Multi-Version Concurrency Control
- Postgres stores multiple versions of row
- Only one row version is visible by any observer
- Later row versions have latest data
- Time-consistent view of database is always
- available for each session (“snapshot”)
Row visibility
- write transactions assigned a full transaction id
- Each row has “hidden” fields with visibility information
- xmin: transaction number the row was added
- xmax: when row was deleted/updated
- additional fields to make it work correctly
- +12 bytes in total to each row header
- Data changes are not visible until commit
MVCC example
CREATE TABLE t (val INT);
-- transaction XID 111INSERT INTO t VALUES (1);
-- contents of the table
xmin | xmax | val -------|--------|------- 111 | | 1
MVCC example
-- transaction XID 222UPDATE t SET val = 2;
-- contents of the table
xmin | xmax | val -------|--------|------- 111 | 222 | 1 222 | | 2
MVCC example
-- transaction XID 333DELETE FROM t;
-- contents of the table
xmin | xmax | val -------|--------|------- 111 | 222 | 1 222 | 333 | 2
MVCC
- SQL Actions
- INSERT sets xmin
- DELETE sets xmax
- UPDATE sets xmax, creates new version with xmin
- SELECT ... FOR SHARE/UPDATE sets xmax
- Updates make the table (and indexes) grow!
- UPDATE, DELETE, and ROLLBACK create “dead rows”
- having too many dead rows is called “bloat”
VACUUM
- Removes dead rows
- Can’t remove rows still potentially visible to someone
- long-running statements
- long-running write transactions
- serializable transactions (even read-only)
- Two phase commits and idle transactions can cause
problems in the database
VACUUM phases
Phase 1 – Scans the table
- reduces dead rows down to just their row pointer
- makes list of row pointers to remove
Phase 2 – Scans all of the indexes
- uses the list of rows to remove dead index pointers
Phase 3 – Scans the table again
- uses the list of rows to remove
- removes row pointers
Phase 4 – Truncates empty blocks at end of table
VACUUM vs. VACUUM FULL
VACUUM
- Truncates space at end of table only
- Attempts to get AccessExclusiveLock
- If it doesn’t get lock, skips truncation step
VACUUM FULL
- Repacks table, minimizing space usage
- Similar to what CLUSTER does
- Creates a table copy (needs disk space)
Heap-Only Tuples (HOT)
- UPDATE creates a new row version
- we can proceed without updating indexes if
- no indexed columns were modified
- there’s enough space on the same page
- in such case we only create HOT chain
- minimizes bloat due to frequent UPDATEs
- both index and table bloat
- VACUUM still required for aborted INSERTs
Heap-Only Tuples (HOT)
● Do you really need all the indexes?
● Do you need to update all the columns?
● pg_stat_all_tables
○ n_tup_upd
○ n_tup_hot_upd
Manual vs. Automatic cleanup
Manual
- VACUUM command
- vacuumdb utility
- -- jobs=N (9.5)
automatic
- HOT
- autovacuum
Autovacuum
- Runs VACUUM automatically
- based on monitoring DML activity on tables
- number of inserted / updated / deleted rows
- Also does ANALYZE
- collect statistics about data distribution
- crucial for good query plans
- Runs 24x7 – no time scheduling options
- Cancels itself if an action would block behind it
Autovacuum Parameters
autovacuum = on
autovacuum_naptime = 1min
autovacuum_max_workers = 3
log_autovacuum_min_duration = -1
- up to 3 autovacuum jobs, on different dbs
- maybe log long autvacuum runs
- not a good idea to disable autovacuum
- maybe run it more often (smaller chunks)
Autovacuum parameters
threshold + ( rows * scale_factor )
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.1
- Can be set at system or table level
- Can set parameters for toast table separately
- don’t do that (at least not initially)
Autovacuum Formulas
threshold = 50
- prevents frequent operations on tiny tables
scale_factor = 10% (analyze) or 20% (vacuum)
- consider lowering on huge tables
- will trigger more frequently, but less intrusive
If it hurts, do it more often.
Autovac Parameter Tuning
- a bit of bloat / free space is good (ready for new data)
- 20% of 1GB table - 200MB - meh
- 20% of 1TB table - 200GB - probably way too much
- lower the thresholds
autovacuum_vacuum_threshold = 1000
autovacuum_vacuum_scale_factor = 0.01
(Auto)vacuum Throttling
- cleanup can consume a lot of resources (CPU and I/O)
- we need to throttle it, so that it does not affect users
- do bit of work and sleep() for a while
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
autovacuum_vacuum_cost_delay = 20ms
autovacuum_vacuum_cost_limit = 200
(Auto)vacuum Throttling
- that means up to ~8MB/s reads, ~4MB/s of writes
- not sufficient for larger databases
autovacuum_vacuum_cost_delay = 10ms
autovacuum_vacuum_cost_limit = 1000
- ~80MB/s reads, ~40MB/s of writes
Monitoring VACUUM and HOT
SELECT * FROM pg_stat_user_tables WHERE relname = 'mytab';
SELECT * FROM pg_stat_progress_vacuum;
https://www.2ndquadrant.com/en/blog/autovacuum-tuning-basics/
Questions?