Metrics & Monitoring
Monitor your running services—CPU, memory, requests, latency percentiles, errors, and language-specific signals—with alerting and autoscaling hooks.
Forgeon tracks your service instances in real time: CPU, memory, disk, network, requests/second, latency percentiles, and error rate—plus language-specific hints (Node event-loop lag, Go GC, JVM pauses, etc.). Use these charts to debug deploys, size resources, and trigger alerts or autoscaling.
Where to find it
- Project → Services → [web/worker/cron] → Metrics
- Pick a time range (15m · 1h · 24h · 7d)
- Toggle p50/p90/p95/p99 for latency where available
- Click a chart to inspect samples and filter by instance or region
Core runtime signals (all services)
- CPU usage — instantaneous & averaged. Sustained > 80% under load → scale up/out or tune hot paths.
- Memory usage — RSS/working set. Steady climbs without drops → leaks or unbounded caches.
- Disk — used %, read/write throughput, IOPS. Alert at 85% used to avoid noisy evictions.
- Network — in/out throughput & connections. Pair spikes with request charts.
- Requests/second (RPS/QPS) — per instance and aggregated.
- Latency — p50/p95/p99. Rising tails (p95/p99) signal contention or saturation.
- Error rate — 4xx/5xx or task failures. Spikes after a deploy? Rollback candidates.
- Health checks — readiness/ liveness results; flaps indicate overload or slow startups.
- Restart count — crash loops or OOM kills; correlate with memory spikes.
If a chart is flat at zero, the runtime may not expose that metric for your language/version. Core host metrics still appear.
Language-specific insights
Node.js
- Event-loop lag — if > 100–200ms during traffic, you’re blocking the loop (sync work or giant JSON).
- Heap used & GC pauses — frequent long pauses + high heap → revisit allocations or enable pooling.
- Open handles — creeping count after requests → leaks (sockets, timers, file descriptors).
Alert ideas
- p95 latency > 350 ms for 5m
- Event-loop lag p95 > 150 ms for 5m
- RSS > 85% of memory limit for 10m
Go
- GC pause (STW) and heap live — sawtooth is healthy; rising baseline → leaks.
- Goroutines — sudden growth with stable RPS may indicate starvation/unbounded work.
Alert ideas
- p95 latency > 300 ms for 5m
- GC pause p95 > 40 ms for 5m
- RSS > 85% limit for 10m
Python (Django/Flask/FastAPI)
- Worker concurrency (Gunicorn/Uvicorn) vs CPU usage. Too many sync workers increase context switching.
- Queue wait (if ASGI queue exposed) — rising wait → add workers or profile handlers.
Alert ideas
- p95 latency > 400 ms for 5m
- Error rate > 2% for 3m
- RSS > 85% limit for 10m
JVM (Spring, etc.)
- GC (G1/ZGC) pauses & Old Gen occupancy; long promotion failures → tune heap/GC.
- Thread count & blocked threads.
Alert ideas
- GC pause p95 > 80 ms for 5m
- Old Gen > 85% for 10m
- p99 latency > 1 s for 5m
Workers / Cron
- Job duration percentiles, success/failure counts, queue backlog (if emitted).
- Next run and schedule drift.
Alert ideas
- Failure rate > 1% for 10m
- p95 job duration > SLO for 10m
- Backlog age rising for 10m
Autoscaling signals
You can scale per service using one or more signals:
- CPU target — e.g., keep CPU ~60% across instances.
- Memory target — e.g., keep RSS < 75%.
- RPS target — e.g., 200 req/instance.
- Latency target — e.g., p95 < 300 ms.
- Custom — expose a counter/gauge; point autoscaler at it.
Start with CPU 60% and a latency ceiling; add memory guardrails for languages with big heaps.
Recommended alerts (copy-ready)
Create alerts in Service → Alerts.
Correlate with logs & deploys
- Open Runtime Logs beside metrics to catch the exact error around a spike.
- Pin a deployment; charts will highlight the before/after window.
- If p95 doubles after a deploy, capture logs + metrics and consider Rollback.
Health checks that work
- Add a
/readyzendpoint that returns 200 only when dependencies are live (DB, cache, migrations done). - Keep
/healthzlightweight for liveness (always 200 unless the process is truly broken). - Set a grace period long enough for your framework’s cold start.
Retention & export
- Scrape interval: 15s (with engine-specific rollups)
- Views: 1m, 5m, 1h downsampling
- Retention: 7–30 days depending on plan (see Plans & Pricing)
- Export to your own Prometheus/Grafana via Metrics Exporter.
Need longer retention or org-wide dashboards? Enable Metrics Exporter in Settings and stream OpenMetrics.
Reading the charts (quick heuristics)
- p95 ↑ while CPU flat → lock contention, blocking work, or IO waits.
- CPU ↑ with latency flat → healthy scaling opportunity.
- Memory ↑ steadily → leak; watch restart/GC behavior.
- Error ↑ after deploy → misconfig, DB migrations, or env differences—consider rollback.