Metrics & Monitoring
What Forgeon tracks for your databases, how to read the charts, and sensible alert recipes per engine.
Forgeon’s mdx-service collects runtime and engine-level metrics for every managed database. You’ll find charts, percentiles, and alert hooks in one place.
Where to find metrics
- Open Project → Databases → [your database] → Metrics.
- Pick a time range (last 15m, 1h, 24h, 7d).
- Toggle percentiles (p50, p90, p95, p99) for latency when available.
- Click a chart to inspect raw samples and series labels (env, project, instance).
quick jump
Core metrics (all engines)
- CPU usage — instantaneous and smoothed; sustained > 80% suggests scaling or query tuning.
- Memory usage — RSS and engine caches; watch for steady climbs + OOM resets.
- Disk space & I/O — used %, read/write throughput, IOPS; alert at 85% used.
- Network — ingress/egress throughput; spikes often correlate with bulk loads.
- Connections / clients — current, peak, and limit; alert near 80% of max.
- Queries per second (QPS) — overall throughput; pair with latency percentiles.
- Errors — non-2xx or engine error counters; investigate sustained increases.
- Latency — p50/p95/p99 where the engine exposes it; rising tails signal contention.
If any chart is flat at zero, the engine may not expose that metric for your version/tier. You’ll still see core host metrics.
Engine-specific insights
Postgres
- Replication lag — bytes or seconds behind primary (WAL/LSN).
- Buffer cache hit ratio — should be very high (> 99% for read-heavy).
- Deadlocks & locks — count and wait time; frequent spikes → query contention.
- Checkpoints / autovacuum — frequency and duration; long autovacuums can throttle.
Alert ideas
- p95 query latency > 250 ms for 5 min (APIs), or > 1 s (batch).
- Replication lag > 30 s for 5 min.
- Connections > 80% of
max_connections. - Disk usage > 85%.
MySQL / MariaDB
- Threads connected and threads running.
- InnoDB buffer pool hit ratio (> 99% is ideal).
- Row lock time and deadlocks.
- Slow queries (if slow log is enabled).
Alert ideas
- p95 latency > 250 ms for 5 min.
- Buffer pool hit ratio < 95% for 10 min.
- Threads connected > 80% of max.
Redis / Valkey
- Ops/sec and instantaneous ops/sec.
- Hit ratio, evicted keys, blocked clients.
- Used memory vs limit; fragmentation ratio.
Alert ideas
- Evictions > 0 for 3 min.
- Used memory > 85% of max.
- p95 latency > 10 ms (in-VPC), > 50 ms (cross-region).
ClickHouse
- Query duration percentiles.
- Insert throughput and parts per table.
- Background merges and rejected queries.
- Disk IO on data volumes.
Alert ideas
- Rejected queries > 0 for 2 min.
- Merges backlog growing for 10+ min.
Cassandra
- Read/write latency (p50/p95).
- Pending compactions, tombstones scanned.
- GC pauses and heap usage.
Alert ideas
- Pending compactions rising with latency.
- GC pauses > 200 ms spikes correlated with QPS.
QuestDB
- Ingest throughput (rows/sec).
- Commit lag and page cache usage.
- CPU IO wait under sustained ingestion.
OpenSearch
- Cluster health (green/yellow/red).
- Indexing/search latency.
- Shard counts and queue sizes.
- JVM heap (young/old gen).
Alert ideas
- Heap > 75% for 10 min.
- Search latency p95 > 500 ms.
- Cluster state ≠ green for 3 min.
Qdrant
- Vectors count and segments.
- RAM usage per collection.
- Index rebuilds / compactions.
Alert ideas
- RAM usage near container limit.
- Index rebuild duration spikes.
Trino / Dremio
- Queued / running / failed queries.
- Worker availability and task splits.
- Coordinator CPU and heap.
Alert ideas
- Queued queries > N for 5 min.
- Worker down events.
FerretDB
- Connections and ops/sec by verb.
- Backend Postgres latency if proxied.
- Errors (unsupported ops) counters.
Recommended alerts (copy-ready)
Create alerts in Project → Databases → Alerts. Good starting points:
safe defaults
$alertdb.cpu.p95 > 80% for 10m
$alertdb.mem.used > 85% for 10m
$alertdb.disk.used_pct > 85% for 5m
$alertdb.conn.used_pct > 80% for 5m
$alertdb.latency.p95 > SLO for 5m
Then add engine-specific ones from the sections above.
Correlate with logs and deploys
- Open Runtime Logs alongside metrics to spot slow queries or memory spikes during deploys.
- For Postgres/MySQL, enable slow query logging in the database settings to capture offenders.
Retention & scraping
- Scrape interval: 15s (bursty engines may sample faster internally).
- Downsampled views: 1m, 5m, 1h.
- Retention: 14 days by default (raise in paid tiers).
Need longer retention or cross-project dashboards? Enable the Metrics Exporter in Settings to stream OpenMetrics to your Grafana/Prometheus.
Reading the tea leaves (how to interpret)
- Rising p95 latency with flat CPU → lock contention or IO waits (indexing, missing indexes, row locks).
- Rising CPU + QPS with steady latency → healthy scale-up scenario.
- Rising memory with flat QPS → leaks, cache growth, or unbounded buffers.
- Disk ~85% and growing → provision more storage before emergency autovacuum/index merges stall.
next steps