PostgreSQL Under Load: Practical Tuning on General-Purpose VMs
A practical, production‑minded checklist for tuning PostgreSQL 13–16 on general‑purpose VMs: OS/FS basics, key config, autovacuum, checkpoints, pooling, and query/index fixes.
PostgreSQL Under Load: Practical Tuning on General-Purpose VMs
You don’t need exotic hardware to make Postgres fast. Most gains come from a clean baseline, sane OS settings, and a handful of parameters matched to your workload. This guide is a practical, production‑minded checklist for teams running PostgreSQL 13–16 on general‑purpose virtual machines.
Ground Rules
Measure first, change second. Always capture a baseline before tuning:
- QPS and latency percentiles (p50/p95/p99)
- CPU steal/iowait, load average
- Buffer/cache hit ratios, WAL write rate
- Top queries by time and I/O
Useful tools: pg_stat_statements, auto_explain, Prometheus exporter + Grafana, and (PG16+) pg_stat_io.
VM & OS Baseline
CPU & memory. Watch CPU steal time; if it’s >2–3% sustained, move or resize. On big VMs (≥16 vCPU/64 GB), keep NUMA simple and set THP off.
Storage & filesystem. Prefer NVMe/SSD. Use ext4 or xfs with noatime; consider discard=async for SSD TRIM.
# /etc/fstab example UUID=... /var/lib/postgresql ext4 rw,noatime,discard 0 2
I/O scheduler. NVMe defaults to none (good). On SSD/SAS, mq-deadline is a solid choice.
cat /sys/block/nvme0n1/queue/scheduler
Kernel sysctl (sane defaults). Create /etc/sysctl.d/99-postgres.conf:
vm.swappiness = 1 vm.dirty_background_ratio = 5 vm.dirty_ratio = 15 vm.overcommit_memory = 2 vm.overcommit_ratio = 95 kernel.sched_migration_cost_ns = 5000000
sudo sysctl --system
Data Layout
Keep WAL and data on the same fast SSD/NVMe unless you truly have a separate fast device for WAL. Spreading to slow disks usually hurts.
Postgres Configuration: the 80/20 Set
Starting point for a mixed read/write OLTP system on a VM with 8 vCPU / 32 GB RAM / NVMe. Adjust proportional to RAM/IOPS.
Memory
shared_buffers = 8GB effective_cache_size = 24GB work_mem = 64MB maintenance_work_mem = 1GB huge_pages = try
Notes: work_mem multiplies per sort/hash node; start conservative. Gains from huge shared_buffers taper beyond ~25–30% of RAM.
WAL, checkpoints & durability
wal_level = replica wal_compression = on max_wal_size = 8GB min_wal_size = 2GB checkpoint_timeout = 15min checkpoint_completion_target = 0.9 synchronous_commit = on
If you can tolerate losing the last few transactions on crash for lower latency, test synchronous_commit = off.
Autovacuum
autovacuum = on autovacuum_max_workers = 5 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.02 autovacuum_vacuum_cost_limit = 3000 autovacuum_vacuum_cost_delay = 2ms
Use per‑table settings for hot tables (lower scale factors). Partition time‑series data.
Connections & pooling
max_connections = 200
Use PgBouncer (transaction pooling) for many short‑lived connections.
Planner & I/O hints
random_page_cost = 1.1 effective_io_concurrency = 200 default_statistics_target = 200
On SATA SSD, random_page_cost ~ 1.3–1.5. Raise stats target for skewed columns.
Logging
log_min_duration_statement = 200ms log_checkpoints = on log_autovacuum_min_duration = 250ms log_error_verbosity = default log_line_prefix = '%m [%p] %u@%d %a %h ' shared_preload_libraries = 'pg_stat_statements,auto_explain' auto_explain.log_min_duration = '250ms' auto_explain.log_analyze = true auto_explain.sample_rate = 0.05
Fix Queries, Indexes, Schema
No parameter beats a missing index. Weekly loop:
SELECT (total_exec_time/1000)::numeric(12,2) AS total_s, calls, (mean_exec_time)::numeric(12,2) AS mean_ms, query FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20;
- EXPLAIN ANALYZE top offenders; look for Seq Scan on large tables, add selective/covering indexes.
- Trim SELECT lists; avoid accidental DISTINCT/ORDER BY.
- Use partitioning for time‑series.
PgBouncer
Run PgBouncer close to Postgres. For web workloads use transaction pooling.
[databases] appdb = host=127.0.0.1 port=5432 dbname=appdb [pgbouncer] listen_port = 6432 auth_type = md5 pool_mode = transaction max_client_conn = 4000 default_pool_size = 50 server_reset_query = DISCARD ALL
Vacuum, Bloat, Freezing
- Watch n_dead_tup and autovacuum logs.
- Lower per‑table scale factors on hot tables; consider fillfactor < 100.
ALTER TABLE events SET ( autovacuum_vacuum_scale_factor = 0.02, autovacuum_analyze_scale_factor = 0.01 );
WAL, Replication, HA
- Keep wal_level = replica; monitor replay lag.
- Enable wal_compression = on for write‑heavy workloads.
- Use synchronous replication only when you must; expect higher latency.
Backups & Restore Drills
- Daily backups with at least 7‑day retention (base + WAL).
- Quarterly restore drills; keep an off‑platform copy.
Load Testing That Looks Like Prod
- Warm up caches (5–10 min), then steady‑state 15–30 min.
- Track p50/p95/p99, CPU (user/system/steal), iowait, WAL/s, checkpoints.
Example Configs by VM Size
4 vCPU / 16 GB RAM / NVMe (OLTP)
shared_buffers = 4GB effective_cache_size = 12GB work_mem = 32MB maintenance_work_mem = 512MB max_wal_size = 4GB checkpoint_timeout = 15min autovacuum_max_workers = 4
8 vCPU / 32 GB RAM / NVMe (OLTP mixed)
shared_buffers = 8GB effective_cache_size = 24GB work_mem = 64MB maintenance_work_mem = 1GB max_wal_size = 8GB checkpoint_timeout = 15min autovacuum_max_workers = 5 effective_io_concurrency = 200 random_page_cost = 1.1
16 vCPU / 64 GB RAM / NVMe (heavier writes)
shared_buffers = 16GB effective_cache_size = 48GB work_mem = 96MB maintenance_work_mem = 2GB max_wal_size = 16GB checkpoint_timeout = 15-20min checkpoint_completion_target = 0.9 autovacuum_max_workers = 6 effective_io_concurrency = 256 random_page_cost = 1.1
Quick Troubleshooting Map
- Latency spikes every 5–15 min → increase max_wal_size, set checkpoint_completion_target = 0.9, verify disk IOPS.
- High CPU, low QPS → too many connections thrashing; add PgBouncer; fix top queries.
- Heap fetches / reads slow → missing covering indexes; verify cache hit ratio.
- Autovacuum behind → lower per‑table scale factors; increase workers/cost limit.
Monitoring Essentials
- DB: TPS, active vs idle in txn, locks, top queries, cache hit ratio, WAL bytes/s, checkpoints, autovacuum activity.
- Host: CPU (user/system/steal), RAM, I/O latency/throughput, network.
- SLOs: p95/p99 of key queries or API endpoints.
Wrap‑Up
PostgreSQL scales well on plain VMs when you keep the OS/FS simple, set a realistic memory split, manage WAL/checkpoints and autovacuum, pool connections, and fix the queries and indexes that dominate time.
Need a second pair of eyes? We can review your metrics and config, propose a right‑sized VM shape, or set up PgBouncer/monitoring as part of a managed engagement (DevOps from €35/h; onboarding may be free in some cases).
Next Steps
Ready to put this into practice? Start small, measure, and iterate. If you need infrastructure that can keep up under real load, here are quick links:
Similar Posts
Smarter Performance Insights for Your Cloud Servers
We’re excited to introduce a new section in the ServersCamp control panel –Performance Insights.
Hypervisor fleet upgraded to Proxmox VE 9
We are pleased to announce that our hypervisor infrastructure has been upgraded toProxmox VE 9.0, based onDebian 13 “Trixie”. The new release integrates theLinux kernel 6.14, providing improved performance, broader hardware support, and long-term stability for production workloads.
ServersCamp Terraform Provider is Now Available
We’re excited to announce that theofficial ServersCamp Terraform provideris now published and available on the Terraform Registry.