Metrics & Monitoring

Monitor your running services—CPU, memory, requests, latency percentiles, errors, and language-specific signals—with alerting and autoscaling hooks.

Forgeon tracks your service instances in real time: CPU, memory, disk, network, requests/second, latency percentiles, and error rate—plus language-specific hints (Node event-loop lag, Go GC, JVM pauses, etc.). Use these charts to debug deploys, size resources, and trigger alerts or autoscaling.

Where to find it

Project → Services → [web/worker/cron] → Metrics
Pick a time range (15m · 1h · 24h · 7d)
Toggle p50/p90/p95/p99 for latency where available
Click a chart to inspect samples and filter by instance or region

quick jump

$open

/docs/easy-deploy

Core runtime signals (all services)

CPU usage — instantaneous & averaged. Sustained > 80% under load → scale up/out or tune hot paths.
Memory usage — RSS/working set. Steady climbs without drops → leaks or unbounded caches.
Disk — used %, read/write throughput, IOPS. Alert at 85% used to avoid noisy evictions.
Network — in/out throughput & connections. Pair spikes with request charts.
Requests/second (RPS/QPS) — per instance and aggregated.
Latency — p50/p95/p99. Rising tails (p95/p99) signal contention or saturation.
Error rate — 4xx/5xx or task failures. Spikes after a deploy? Rollback candidates.
Health checks — readiness/ liveness results; flaps indicate overload or slow startups.
Restart count — crash loops or OOM kills; correlate with memory spikes.

If a chart is flat at zero, the runtime may not expose that metric for your language/version. Core host metrics still appear.

Language-specific insights

Node.js

Event-loop lag — if > 100–200ms during traffic, you’re blocking the loop (sync work or giant JSON).
Heap used & GC pauses — frequent long pauses + high heap → revisit allocations or enable pooling.
Open handles — creeping count after requests → leaks (sockets, timers, file descriptors).

Alert ideas

p95 latency > 350 ms for 5m
Event-loop lag p95 > 150 ms for 5m
RSS > 85% of memory limit for 10m

Go

GC pause (STW) and heap live — sawtooth is healthy; rising baseline → leaks.
Goroutines — sudden growth with stable RPS may indicate starvation/unbounded work.

Alert ideas

p95 latency > 300 ms for 5m
GC pause p95 > 40 ms for 5m
RSS > 85% limit for 10m

Python (Django/Flask/FastAPI)

Worker concurrency (Gunicorn/Uvicorn) vs CPU usage. Too many sync workers increase context switching.
Queue wait (if ASGI queue exposed) — rising wait → add workers or profile handlers.

Alert ideas

p95 latency > 400 ms for 5m
Error rate > 2% for 3m
RSS > 85% limit for 10m

JVM (Spring, etc.)

GC (G1/ZGC) pauses & Old Gen occupancy; long promotion failures → tune heap/GC.
Thread count & blocked threads.

Alert ideas

GC pause p95 > 80 ms for 5m
Old Gen > 85% for 10m
p99 latency > 1 s for 5m

Workers / Cron

Job duration percentiles, success/failure counts, queue backlog (if emitted).
Next run and schedule drift.

Alert ideas

Failure rate > 1% for 10m
p95 job duration > SLO for 10m
Backlog age rising for 10m

Autoscaling signals

You can scale per service using one or more signals:

CPU target — e.g., keep CPU ~60% across instances.
Memory target — e.g., keep RSS < 75%.
RPS target — e.g., 200 req/instance.
Latency target — e.g., p95 < 300 ms.
Custom — expose a counter/gauge; point autoscaler at it.

Start with CPU 60% and a latency ceiling; add memory guardrails for languages with big heaps.

Recommended alerts (copy-ready)

Create alerts in Service → Alerts.

safe defaults

$alertruntime.cpu.p95 > 80% for 10m

$alertruntime.mem.used_pct > 85% for 10m

$alerthttp.latency.p95 > 300ms for 5m

$alerthttp.error_rate > 2% for 3m

$alertdeploy.healthcheck.failures > 0 for 5m

Correlate with logs & deploys

Open Runtime Logs beside metrics to catch the exact error around a spike.
Pin a deployment; charts will highlight the before/after window.
If p95 doubles after a deploy, capture logs + metrics and consider Rollback.

Health checks that work

Add a /readyz endpoint that returns 200 only when dependencies are live (DB, cache, migrations done).
Keep /healthz lightweight for liveness (always 200 unless the process is truly broken).
Set a grace period long enough for your framework’s cold start.

example: readiness contract

$/readyz→ 200 when app serving

$/healthz→ 200 if process alive

Retention & export

Scrape interval: 15s (with engine-specific rollups)
Views: 1m, 5m, 1h downsampling
Retention: 7–30 days depending on plan (see Plans & Pricing)
Export to your own Prometheus/Grafana via Metrics Exporter.

Need longer retention or org-wide dashboards? Enable Metrics Exporter in Settings and stream OpenMetrics.

Reading the charts (quick heuristics)

p95 ↑ while CPU flat → lock contention, blocking work, or IO waits.
CPU ↑ with latency flat → healthy scaling opportunity.
Memory ↑ steadily → leak; watch restart/GC behavior.
Error ↑ after deploy → misconfig, DB migrations, or env differences—consider rollback.

next steps

$open

/docs/easy-deploy

$open

/docs/db-metrics

$open

/docs/deployments