Metrics & Monitoring

What Forgeon tracks for your databases, how to read the charts, and sensible alert recipes per engine.


Forgeon’s mdx-service collects runtime and engine-level metrics for every managed database. You’ll find charts, percentiles, and alert hooks in one place.

Where to find metrics

  • Open Project → Databases → [your database] → Metrics.
  • Pick a time range (last 15m, 1h, 24h, 7d).
  • Toggle percentiles (p50, p90, p95, p99) for latency when available.
  • Click a chart to inspect raw samples and series labels (env, project, instance).
quick jump

Core metrics (all engines)

  • CPU usage — instantaneous and smoothed; sustained > 80% suggests scaling or query tuning.
  • Memory usage — RSS and engine caches; watch for steady climbs + OOM resets.
  • Disk space & I/O — used %, read/write throughput, IOPS; alert at 85% used.
  • Network — ingress/egress throughput; spikes often correlate with bulk loads.
  • Connections / clients — current, peak, and limit; alert near 80% of max.
  • Queries per second (QPS) — overall throughput; pair with latency percentiles.
  • Errors — non-2xx or engine error counters; investigate sustained increases.
  • Latency — p50/p95/p99 where the engine exposes it; rising tails signal contention.

If any chart is flat at zero, the engine may not expose that metric for your version/tier. You’ll still see core host metrics.

Engine-specific insights

Postgres

  • Replication lag — bytes or seconds behind primary (WAL/LSN).
  • Buffer cache hit ratio — should be very high (> 99% for read-heavy).
  • Deadlocks & locks — count and wait time; frequent spikes → query contention.
  • Checkpoints / autovacuum — frequency and duration; long autovacuums can throttle.

Alert ideas

  • p95 query latency > 250 ms for 5 min (APIs), or > 1 s (batch).
  • Replication lag > 30 s for 5 min.
  • Connections > 80% of max_connections.
  • Disk usage > 85%.

MySQL / MariaDB

  • Threads connected and threads running.
  • InnoDB buffer pool hit ratio (> 99% is ideal).
  • Row lock time and deadlocks.
  • Slow queries (if slow log is enabled).

Alert ideas

  • p95 latency > 250 ms for 5 min.
  • Buffer pool hit ratio < 95% for 10 min.
  • Threads connected > 80% of max.

Redis / Valkey

  • Ops/sec and instantaneous ops/sec.
  • Hit ratio, evicted keys, blocked clients.
  • Used memory vs limit; fragmentation ratio.

Alert ideas

  • Evictions > 0 for 3 min.
  • Used memory > 85% of max.
  • p95 latency > 10 ms (in-VPC), > 50 ms (cross-region).

ClickHouse

  • Query duration percentiles.
  • Insert throughput and parts per table.
  • Background merges and rejected queries.
  • Disk IO on data volumes.

Alert ideas

  • Rejected queries > 0 for 2 min.
  • Merges backlog growing for 10+ min.

Cassandra

  • Read/write latency (p50/p95).
  • Pending compactions, tombstones scanned.
  • GC pauses and heap usage.

Alert ideas

  • Pending compactions rising with latency.
  • GC pauses > 200 ms spikes correlated with QPS.

QuestDB

  • Ingest throughput (rows/sec).
  • Commit lag and page cache usage.
  • CPU IO wait under sustained ingestion.

OpenSearch

  • Cluster health (green/yellow/red).
  • Indexing/search latency.
  • Shard counts and queue sizes.
  • JVM heap (young/old gen).

Alert ideas

  • Heap > 75% for 10 min.
  • Search latency p95 > 500 ms.
  • Cluster state ≠ green for 3 min.

Qdrant

  • Vectors count and segments.
  • RAM usage per collection.
  • Index rebuilds / compactions.

Alert ideas

  • RAM usage near container limit.
  • Index rebuild duration spikes.

Trino / Dremio

  • Queued / running / failed queries.
  • Worker availability and task splits.
  • Coordinator CPU and heap.

Alert ideas

  • Queued queries > N for 5 min.
  • Worker down events.

FerretDB

  • Connections and ops/sec by verb.
  • Backend Postgres latency if proxied.
  • Errors (unsupported ops) counters.

Create alerts in Project → Databases → Alerts. Good starting points:

safe defaults
$alertdb.cpu.p95 > 80% for 10m
$alertdb.mem.used > 85% for 10m
$alertdb.disk.used_pct > 85% for 5m
$alertdb.conn.used_pct > 80% for 5m
$alertdb.latency.p95 > SLO for 5m

Then add engine-specific ones from the sections above.

Correlate with logs and deploys

  • Open Runtime Logs alongside metrics to spot slow queries or memory spikes during deploys.
  • For Postgres/MySQL, enable slow query logging in the database settings to capture offenders.

Retention & scraping

  • Scrape interval: 15s (bursty engines may sample faster internally).
  • Downsampled views: 1m, 5m, 1h.
  • Retention: 14 days by default (raise in paid tiers).

Need longer retention or cross-project dashboards? Enable the Metrics Exporter in Settings to stream OpenMetrics to your Grafana/Prometheus.

Reading the tea leaves (how to interpret)

  • Rising p95 latency with flat CPU → lock contention or IO waits (indexing, missing indexes, row locks).
  • Rising CPU + QPS with steady latency → healthy scale-up scenario.
  • Rising memory with flat QPS → leaks, cache growth, or unbounded buffers.
  • Disk ~85% and growing → provision more storage before emergency autovacuum/index merges stall.

next steps
# Tune & alert