Who this guide is for #
This is for early‑stage, cost‑conscious product engineering—especially when you’re shipping one product without a dedicated SRE/observability function.
You want to:
- know when production is down or degraded
- debug incidents quickly
- keep tooling and bills under control
The goal is not “perfect observability”. The goal is useful signals at a sustainable cost.
What you actually need early on #
Early on, you don’t need a full telemetry platform with every feature enabled. You need to reliably answer three questions:
- Is the product working for users? (availability)
- Is it getting worse? (latency + error rate)
- What changed? (deploys + logs)
Distributed tracing and advanced dashboards are great — but they’re phase two.
Minimal metrics to track (per service) #
Keep it small and consistent. For most HTTP services, start with these golden signals:
- Latency: p95/p99 for a few key endpoints
- Errors: 5xx rate, plus a small set of business‑critical app errors
- Traffic: requests per second (or requests per minute)
- Saturation: CPU, memory, disk (and DB connections if relevant)
- Availability: health checks / uptime checks
Tip: avoid high‑cardinality labels (e.g., user_id, request_id) in metrics — they will explode cost in most systems.
Minimal logging setup #
Logs are your “black box recorder”. The goal is: searchable, structured, and cheap.
Prefer structured logs #
Use JSON logs where possible, and include a small set of fields consistently:
service,env,version(commit SHA or build)level,messagetrace_id/request_id(optional but helpful)user_idonly if you truly need it (and consider hashing)
Keep logs in one place #
- Send logs to one primary log sink (don’t scatter them)
- Set a short retention by default (7–30 days) and increase only for real reasons
If logs are too expensive, reduce cost in this order:
- retention
- verbosity
- ingest volume (e.g., drop noisy logs; keep debug disabled in production)
Alerts that won’t burn you out #
Aim for 5–10 alerts that matter. If you have more, you’re probably alerting on symptoms instead of user impact.
Availability #
- service health check failing
- error rate spike for critical endpoints
Performance #
- p95 latency above threshold for a sustained period
Resource exhaustion #
- disk almost full
- memory pressure / OOM restarts
- DB connection pool saturation (if it bites you)
Practical rules #
- Alert on user impact, not on every internal metric.
- Prefer “sustained for N minutes” to reduce flapping.
- Every alert should have an owner and a clear action.
- Route alerts to one place (PagerDuty/Opsgenie/Slack/email) and keep escalation simple.
Budget-friendly implementation options #
You can do “budget observability” in two broad ways.
Option A: OSS collectors + hosted backend #
- Run lightweight collectors/agents (metrics + logs)
- Ship to a hosted backend (lower ops load)
Good when you want flexibility and a fast start without running stateful observability services yourself.
Option B: fully hosted all‑in‑one #
- One service covers metrics, logs (and maybe traces)
Good when you want minimal setup and you can keep ingest under control.
In both cases: make sure you have one primary place to look during an incident.
Cost control tips (the stuff that saves real money) #
- Retention: don’t store everything forever.
- Sampling/filters: drop noisy logs; avoid debug in prod.
- Cardinality discipline: avoid unique IDs in metric labels.
- Budget alerts: set spend alerts for observability itself.
- Regular cleanup: once a month, remove unused dashboards and noisy alerts.
When to add tracing and “fancy dashboards” #
Add more advanced tooling when:
- you have multiple services and incidents involve unclear call chains
- latency regressions aren’t explainable via metrics/logs
- debugging takes too long because you can’t connect events across systems
Start with a narrow tracing rollout: one critical request path, one or two services, sampled.
Minimal implementation plan (2–4 weeks) #
- Collect basic system + HTTP metrics for production.
- Set up one log sink and enable structured logs.
- Create a “golden signals” dashboard (availability, latency, errors, traffic).
- Add 5–10 alerts (availability, error rate, latency, disk/memory).
- Run one tabletop incident: verify you can find the cause using your signals.
Next steps #
If you want to build this quickly without over‑engineering it, use your current setup as input and iterate:
- start with the smallest set of signals
- make alerts meaningful
- keep bills predictable
If you’d like help designing or setting up a minimal observability stack (dashboards + alerts + sane cost controls), see Services.
Reply by Email