Metrics & Monitoring

What Forgeon tracks for your databases, how to read the charts, and sensible alert recipes per engine.

Forgeon’s mdx-service collects runtime and engine-level metrics for every managed database. You’ll find charts, percentiles, and alert hooks in one place.

Where to find metrics

Open Project → Databases → [your database] → Metrics.
Pick a time range (last 15m, 1h, 24h, 7d).
Toggle percentiles (p50, p90, p95, p99) for latency when available.
Click a chart to inspect raw samples and series labels (env, project, instance).

quick jump

$open

/docs/databases

Core metrics (all engines)

CPU usage — instantaneous and smoothed; sustained > 80% suggests scaling or query tuning.
Memory usage — RSS and engine caches; watch for steady climbs + OOM resets.
Disk space & I/O — used %, read/write throughput, IOPS; alert at 85% used.
Network — ingress/egress throughput; spikes often correlate with bulk loads.
Connections / clients — current, peak, and limit; alert near 80% of max.
Queries per second (QPS) — overall throughput; pair with latency percentiles.
Errors — non-2xx or engine error counters; investigate sustained increases.
Latency — p50/p95/p99 where the engine exposes it; rising tails signal contention.

If any chart is flat at zero, the engine may not expose that metric for your version/tier. You’ll still see core host metrics.

Engine-specific insights

Postgres

Replication lag — bytes or seconds behind primary (WAL/LSN).
Buffer cache hit ratio — should be very high (> 99% for read-heavy).
Deadlocks & locks — count and wait time; frequent spikes → query contention.
Checkpoints / autovacuum — frequency and duration; long autovacuums can throttle.

Alert ideas

p95 query latency > 250 ms for 5 min (APIs), or > 1 s (batch).
Replication lag > 30 s for 5 min.
Connections > 80% of max_connections.
Disk usage > 85%.

MySQL / MariaDB

Threads connected and threads running.
InnoDB buffer pool hit ratio (> 99% is ideal).
Row lock time and deadlocks.
Slow queries (if slow log is enabled).

Alert ideas

p95 latency > 250 ms for 5 min.
Buffer pool hit ratio < 95% for 10 min.
Threads connected > 80% of max.

Redis / Valkey

Ops/sec and instantaneous ops/sec.
Hit ratio, evicted keys, blocked clients.
Used memory vs limit; fragmentation ratio.

Alert ideas

Evictions > 0 for 3 min.
Used memory > 85% of max.
p95 latency > 10 ms (in-VPC), > 50 ms (cross-region).

ClickHouse

Query duration percentiles.
Insert throughput and parts per table.
Background merges and rejected queries.
Disk IO on data volumes.

Alert ideas

Rejected queries > 0 for 2 min.
Merges backlog growing for 10+ min.

Cassandra

Read/write latency (p50/p95).
Pending compactions, tombstones scanned.
GC pauses and heap usage.

Alert ideas

Pending compactions rising with latency.
GC pauses > 200 ms spikes correlated with QPS.

QuestDB

Ingest throughput (rows/sec).
Commit lag and page cache usage.
CPU IO wait under sustained ingestion.

OpenSearch

Cluster health (green/yellow/red).
Indexing/search latency.
Shard counts and queue sizes.
JVM heap (young/old gen).

Alert ideas

Heap > 75% for 10 min.
Search latency p95 > 500 ms.
Cluster state ≠ green for 3 min.

Qdrant

Vectors count and segments.
RAM usage per collection.
Index rebuilds / compactions.

Alert ideas

RAM usage near container limit.
Index rebuild duration spikes.

Trino / Dremio

Queued / running / failed queries.
Worker availability and task splits.
Coordinator CPU and heap.

Alert ideas

Queued queries > N for 5 min.
Worker down events.

FerretDB

Connections and ops/sec by verb.
Backend Postgres latency if proxied.
Errors (unsupported ops) counters.

Recommended alerts (copy-ready)

Create alerts in Project → Databases → Alerts. Good starting points:

safe defaults

$alertdb.cpu.p95 > 80% for 10m

$alertdb.mem.used > 85% for 10m

$alertdb.disk.used_pct > 85% for 5m

$alertdb.conn.used_pct > 80% for 5m

$alertdb.latency.p95 > SLO for 5m

Then add engine-specific ones from the sections above.

Correlate with logs and deploys

Open Runtime Logs alongside metrics to spot slow queries or memory spikes during deploys.
For Postgres/MySQL, enable slow query logging in the database settings to capture offenders.

Retention & scraping

Scrape interval: 15s (bursty engines may sample faster internally).
Downsampled views: 1m, 5m, 1h.
Retention: 14 days by default (raise in paid tiers).

Need longer retention or cross-project dashboards? Enable the Metrics Exporter in Settings to stream OpenMetrics to your Grafana/Prometheus.

Reading the tea leaves (how to interpret)

Rising p95 latency with flat CPU → lock contention or IO waits (indexing, missing indexes, row locks).
Rising CPU + QPS with steady latency → healthy scale-up scenario.
Rising memory with flat QPS → leaks, cache growth, or unbounded buffers.
Disk ~85% and growing → provision more storage before emergency autovacuum/index merges stall.

next steps

# Tune & alert

$open

/docs/databases

$open

/docs/deployments