Metrics & Monitoring

Monitor your running services—CPU, memory, requests, latency percentiles, errors, and language-specific signals—with alerting and autoscaling hooks.


Forgeon tracks your service instances in real time: CPU, memory, disk, network, requests/second, latency percentiles, and error rate—plus language-specific hints (Node event-loop lag, Go GC, JVM pauses, etc.). Use these charts to debug deploys, size resources, and trigger alerts or autoscaling.

Where to find it

  • Project → Services → [web/worker/cron] → Metrics
  • Pick a time range (15m · 1h · 24h · 7d)
  • Toggle p50/p90/p95/p99 for latency where available
  • Click a chart to inspect samples and filter by instance or region
quick jump

Core runtime signals (all services)

  • CPU usage — instantaneous & averaged. Sustained > 80% under load → scale up/out or tune hot paths.
  • Memory usage — RSS/working set. Steady climbs without drops → leaks or unbounded caches.
  • Disk — used %, read/write throughput, IOPS. Alert at 85% used to avoid noisy evictions.
  • Network — in/out throughput & connections. Pair spikes with request charts.
  • Requests/second (RPS/QPS) — per instance and aggregated.
  • Latency — p50/p95/p99. Rising tails (p95/p99) signal contention or saturation.
  • Error rate — 4xx/5xx or task failures. Spikes after a deploy? Rollback candidates.
  • Health checks — readiness/ liveness results; flaps indicate overload or slow startups.
  • Restart count — crash loops or OOM kills; correlate with memory spikes.

If a chart is flat at zero, the runtime may not expose that metric for your language/version. Core host metrics still appear.

Language-specific insights

Node.js

  • Event-loop lag — if > 100–200ms during traffic, you’re blocking the loop (sync work or giant JSON).
  • Heap used & GC pauses — frequent long pauses + high heap → revisit allocations or enable pooling.
  • Open handles — creeping count after requests → leaks (sockets, timers, file descriptors).

Alert ideas

  • p95 latency > 350 ms for 5m
  • Event-loop lag p95 > 150 ms for 5m
  • RSS > 85% of memory limit for 10m

Go

  • GC pause (STW) and heap live — sawtooth is healthy; rising baseline → leaks.
  • Goroutines — sudden growth with stable RPS may indicate starvation/unbounded work.

Alert ideas

  • p95 latency > 300 ms for 5m
  • GC pause p95 > 40 ms for 5m
  • RSS > 85% limit for 10m

Python (Django/Flask/FastAPI)

  • Worker concurrency (Gunicorn/Uvicorn) vs CPU usage. Too many sync workers increase context switching.
  • Queue wait (if ASGI queue exposed) — rising wait → add workers or profile handlers.

Alert ideas

  • p95 latency > 400 ms for 5m
  • Error rate > 2% for 3m
  • RSS > 85% limit for 10m

JVM (Spring, etc.)

  • GC (G1/ZGC) pauses & Old Gen occupancy; long promotion failures → tune heap/GC.
  • Thread count & blocked threads.

Alert ideas

  • GC pause p95 > 80 ms for 5m
  • Old Gen > 85% for 10m
  • p99 latency > 1 s for 5m

Workers / Cron

  • Job duration percentiles, success/failure counts, queue backlog (if emitted).
  • Next run and schedule drift.

Alert ideas

  • Failure rate > 1% for 10m
  • p95 job duration > SLO for 10m
  • Backlog age rising for 10m

Autoscaling signals

You can scale per service using one or more signals:

  • CPU target — e.g., keep CPU ~60% across instances.
  • Memory target — e.g., keep RSS < 75%.
  • RPS target — e.g., 200 req/instance.
  • Latency target — e.g., p95 < 300 ms.
  • Custom — expose a counter/gauge; point autoscaler at it.

Start with CPU 60% and a latency ceiling; add memory guardrails for languages with big heaps.

Create alerts in Service → Alerts.

safe defaults
$alertruntime.cpu.p95 > 80% for 10m
$alertruntime.mem.used_pct > 85% for 10m
$alerthttp.latency.p95 > 300ms for 5m
$alerthttp.error_rate > 2% for 3m
$alertdeploy.healthcheck.failures > 0 for 5m

Correlate with logs & deploys

  • Open Runtime Logs beside metrics to catch the exact error around a spike.
  • Pin a deployment; charts will highlight the before/after window.
  • If p95 doubles after a deploy, capture logs + metrics and consider Rollback.

Health checks that work

  • Add a /readyz endpoint that returns 200 only when dependencies are live (DB, cache, migrations done).
  • Keep /healthz lightweight for liveness (always 200 unless the process is truly broken).
  • Set a grace period long enough for your framework’s cold start.
example: readiness contract
$/readyz→ 200 when app serving
$/healthz→ 200 if process alive

Retention & export

  • Scrape interval: 15s (with engine-specific rollups)
  • Views: 1m, 5m, 1h downsampling
  • Retention: 7–30 days depending on plan (see Plans & Pricing)
  • Export to your own Prometheus/Grafana via Metrics Exporter.

Need longer retention or org-wide dashboards? Enable Metrics Exporter in Settings and stream OpenMetrics.

Reading the charts (quick heuristics)

  • p95 ↑ while CPU flat → lock contention, blocking work, or IO waits.
  • CPU ↑ with latency flat → healthy scaling opportunity.
  • Memory ↑ steadily → leak; watch restart/GC behavior.
  • Error ↑ after deploy → misconfig, DB migrations, or env differences—consider rollback.