System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 3:23 PM

Observability & Monitoring Design

medium

Practical design of an observability platform: logs, metrics, distributed tracing, SLO-based alerting, runbooks and feedback loop for production.

Observability does not start with charts. It starts with the question of whether the team can explain degradation before users give up.

Logs, metrics, tracing, SLO-based alerting, runbooks, and feedback loops are presented as one diagnostic platform that helps the team move from symptom to cause instead of merely collecting signals.

In design reviews, the chapter is especially useful for discussing signal quality, cardinality cost, alert fatigue, and which telemetry actually reduces investigation time.

Practical value of this chapter

Design in practice

Turn guidance on observability architecture and SLO-driven monitoring design into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for observability architecture and SLO-driven monitoring design: release speed, automation level, observability cost, and operational complexity.

Core Source

Site Reliability Engineering

Four golden signals, SLI/SLO and approaches to observability of production systems.

Open chapter

Observability & Monitoring Design answers a practical question: how to build an observability system that helps make decisions during an incident, and does not just accumulate dashboards. The observability design includes signals (logs/metrics/traces), a pipeline for their delivery, SLO-aware alerting, runbooks and an operational improvement cycle after each failure.

Four pillars of production-observability

Logs

Events and context for the incident investigation.

  • Structured JSON format instead of free-form text.
  • Correlation fields: trace_id, span_id, request_id, user_id.
  • Explicit separation of levels: info/warn/error/fatal and business errors.

Metrics

Numerical time series for SLO, capacity and early degradation detection.

  • Basic frameworks: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors).
  • Controlling the cardinality of labels (especially user_id, path, tenant).
  • Separate metrics for the business fannel and technical layer.

Distributed Tracing

End-to-end request path through services, queues and databases.

  • Context propagation across sync/async boundaries (HTTP, gRPC, Kafka).
  • Sampling strategy: tail sampling for rare errors and high-latency requests.
  • A combination of trace + logs + metrics for quick root-cause analysis.

Alerting

Rules that wake up only for really important deviations.

  • Alerts from SLO/error budget burn-rate, and not just from CPU/RAM.
  • Severity separation (page/ticket/info) and clear escalation routes.
  • Each page-alert is accompanied by a runbook link.

Deep dive into logs, metrics, traces and alerting

Logs: from forensic analysis to operational diagnostics

Logs are needed to explain a specific event: who caused it, what went wrong, at what step and with what context.

Must-have

  • Unified JSON format and stable schema version.
  • Correlation by trace_id/span_id/request_id for gluing with tracing.
  • Explicit domain fields: operation, tenant, entity_id, outcome, error_code.

Solution design

  • Do info/debug sampling at the ingest level, not at the application level.
  • PII/secrets are masked before being sent to the log pipeline.
  • Separate security/audit logs from product logs for different retention and access.

Common mistakes

  • Raw stacktrace without operation context and input parameters.
  • Free-form lines without machine-parse friendly fields.
  • Logging of all events without budget on storage and query latency.

Example of a structured log-event

{
  "ts": "2026-02-15T06:00:00Z",
  "level": "error",
  "service": "checkout-api",
  "operation": "CreateOrder",
  "trace_id": "6c5f...",
  "request_id": "req-1942",
  "tenant": "acme",
  "error_code": "PAYMENT_PROVIDER_TIMEOUT",
  "duration_ms": 1840
}

Reference pipeline: from signal to reaction

1. Instrumentation

The code and infrastructure publishes telemetry via the OpenTelemetry SDK/collector.

2. Collection & Transport

Agents/collectors normalize signals and deliver them to storage facilities without loss.

3. Storage

Separate backends for logs/metrics/traces with retention policies based on cost.

4. Query & Correlation

Unified navigation between graph, trace and logs using common identifiers.

5. Alert & Response

Alerts launch an on-call process: triage -> mitigation -> postmortem.

Practice

Troubleshooting Example

Step-by-step diagnosis of an incident using RED + logs and hypotheses.

Open case

Alerting without noise: rule design

5-15 minutes

Fast burn-rate

Quickly catch critical outage and reduce MTTR.

High burn-rate error budget in a short window.

1-6 hours

Slow burn-rate

Record long-term quality degradation until complete failure.

Steady deviation of latency or availability from SLO.

15-60 minutes

Business anomaly

See the impact on the product before customers contact you.

Drop in conversion / surge in unsuccessful payments / drop in key funnel.

The minimum agreement for the team: each page-alert must have an owner, runbook and postmortem action item. Without this, alerting quickly turns into noise.

References

Related chapters

Enable tracking in Settings