PostgreSQL Under Load: Practical Tuning on General-Purpose VMs

A practical, production‑minded checklist for tuning PostgreSQL 13–16 on general‑purpose VMs: OS/FS basics, key config, autovacuum, checkpoints, pooling, and query/index fixes.

PostgreSQL Under Load: Practical Tuning on General-Purpose VMs

You don’t need exotic hardware to make Postgres fast. Most gains come from a clean baseline, sane OS settings, and a handful of parameters matched to your workload. This guide is a practical, production‑minded checklist for teams running PostgreSQL 13–16 on general‑purpose virtual machines.

Ground Rules

Measure first, change second. Always capture a baseline before tuning:

QPS and latency percentiles (p50/p95/p99)
CPU steal/iowait, load average
Buffer/cache hit ratios, WAL write rate
Top queries by time and I/O

Useful tools: pg_stat_statements, auto_explain, Prometheus exporter + Grafana, and (PG16+) pg_stat_io.

VM & OS Baseline

CPU & memory. Watch CPU steal time; if it’s >2–3% sustained, move or resize. On big VMs (≥16 vCPU/64 GB), keep NUMA simple and set THP off.

Storage & filesystem. Prefer NVMe/SSD. Use ext4 or xfs with noatime; consider discard=async for SSD TRIM.

# /etc/fstab example
UUID=...  /var/lib/postgresql  ext4  rw,noatime,discard  0 2

I/O scheduler. NVMe defaults to none (good). On SSD/SAS, mq-deadline is a solid choice.

cat /sys/block/nvme0n1/queue/scheduler

Kernel sysctl (sane defaults). Create /etc/sysctl.d/99-postgres.conf:

vm.swappiness = 1
vm.dirty_background_ratio = 5
vm.dirty_ratio = 15
vm.overcommit_memory = 2
vm.overcommit_ratio = 95
kernel.sched_migration_cost_ns = 5000000

sudo sysctl --system

Data Layout

Keep WAL and data on the same fast SSD/NVMe unless you truly have a separate fast device for WAL. Spreading to slow disks usually hurts.

Postgres Configuration: the 80/20 Set

Starting point for a mixed read/write OLTP system on a VM with 8 vCPU / 32 GB RAM / NVMe. Adjust proportional to RAM/IOPS.

Memory

shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
maintenance_work_mem = 1GB
huge_pages = try

Notes: work_mem multiplies per sort/hash node; start conservative. Gains from huge shared_buffers taper beyond ~25–30% of RAM.

WAL, checkpoints & durability

wal_level = replica
wal_compression = on
max_wal_size = 8GB
min_wal_size = 2GB
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
synchronous_commit = on

If you can tolerate losing the last few transactions on crash for lower latency, test synchronous_commit = off.

Autovacuum

autovacuum = on
autovacuum_max_workers = 5
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02
autovacuum_vacuum_cost_limit = 3000
autovacuum_vacuum_cost_delay = 2ms

Use per‑table settings for hot tables (lower scale factors). Partition time‑series data.

Connections & pooling

max_connections = 200

Use PgBouncer (transaction pooling) for many short‑lived connections.

Planner & I/O hints

random_page_cost = 1.1
effective_io_concurrency = 200
default_statistics_target = 200

On SATA SSD, random_page_cost ~ 1.3–1.5. Raise stats target for skewed columns.

Logging

log_min_duration_statement = 200ms
log_checkpoints = on
log_autovacuum_min_duration = 250ms
log_error_verbosity = default
log_line_prefix = '%m [%p] %u@%d %a %h '
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = '250ms'
auto_explain.log_analyze = true
auto_explain.sample_rate = 0.05

Fix Queries, Indexes, Schema

No parameter beats a missing index. Weekly loop:

SELECT
  (total_exec_time/1000)::numeric(12,2) AS total_s,
  calls,
  (mean_exec_time)::numeric(12,2) AS mean_ms,
  query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

EXPLAIN ANALYZE top offenders; look for Seq Scan on large tables, add selective/covering indexes.
Trim SELECT lists; avoid accidental DISTINCT/ORDER BY.
Use partitioning for time‑series.

PgBouncer

Run PgBouncer close to Postgres. For web workloads use transaction pooling.

[databases]
appdb = host=127.0.0.1 port=5432 dbname=appdb

[pgbouncer]
listen_port = 6432
auth_type = md5
pool_mode = transaction
max_client_conn = 4000
default_pool_size = 50
server_reset_query = DISCARD ALL

Vacuum, Bloat, Freezing

Watch n_dead_tup and autovacuum logs.
Lower per‑table scale factors on hot tables; consider fillfactor < 100.

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_analyze_scale_factor = 0.01
);

WAL, Replication, HA

Keep wal_level = replica; monitor replay lag.
Enable wal_compression = on for write‑heavy workloads.
Use synchronous replication only when you must; expect higher latency.

Backups & Restore Drills

Daily backups with at least 7‑day retention (base + WAL).
Quarterly restore drills; keep an off‑platform copy.

Load Testing That Looks Like Prod

Warm up caches (5–10 min), then steady‑state 15–30 min.
Track p50/p95/p99, CPU (user/system/steal), iowait, WAL/s, checkpoints.

Example Configs by VM Size

4 vCPU / 16 GB RAM / NVMe (OLTP)

shared_buffers = 4GB
effective_cache_size = 12GB
work_mem = 32MB
maintenance_work_mem = 512MB
max_wal_size = 4GB
checkpoint_timeout = 15min
autovacuum_max_workers = 4

8 vCPU / 32 GB RAM / NVMe (OLTP mixed)

shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 64MB
maintenance_work_mem = 1GB
max_wal_size = 8GB
checkpoint_timeout = 15min
autovacuum_max_workers = 5
effective_io_concurrency = 200
random_page_cost = 1.1

16 vCPU / 64 GB RAM / NVMe (heavier writes)

shared_buffers = 16GB
effective_cache_size = 48GB
work_mem = 96MB
maintenance_work_mem = 2GB
max_wal_size = 16GB
checkpoint_timeout = 15-20min
checkpoint_completion_target = 0.9
autovacuum_max_workers = 6
effective_io_concurrency = 256
random_page_cost = 1.1

Quick Troubleshooting Map

Latency spikes every 5–15 min → increase max_wal_size, set checkpoint_completion_target = 0.9, verify disk IOPS.
High CPU, low QPS → too many connections thrashing; add PgBouncer; fix top queries.
Heap fetches / reads slow → missing covering indexes; verify cache hit ratio.
Autovacuum behind → lower per‑table scale factors; increase workers/cost limit.

Monitoring Essentials

DB: TPS, active vs idle in txn, locks, top queries, cache hit ratio, WAL bytes/s, checkpoints, autovacuum activity.
Host: CPU (user/system/steal), RAM, I/O latency/throughput, network.
SLOs: p95/p99 of key queries or API endpoints.

Wrap‑Up

PostgreSQL scales well on plain VMs when you keep the OS/FS simple, set a realistic memory split, manage WAL/checkpoints and autovacuum, pool connections, and fix the queries and indexes that dominate time.

Need a second pair of eyes? We can review your metrics and config, propose a right‑sized VM shape, or set up PgBouncer/monitoring as part of a managed engagement (DevOps from €35/h; onboarding may be free in some cases).

Next Steps

Ready to put this into practice? Start small, measure, and iterate. If you need infrastructure that can keep up under real load, here are quick links:

Back to all posts

PostgreSQL Under Load: Practical Tuning on General-Purpose VMs