System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Observability & Monitoring Design

mid

Practical design of an observability platform: logs, metrics, distributed tracing, SLO-based alerting, runbooks and feedback loop for production.

Core Source

Site Reliability Engineering

Four golden signals, SLI/SLO and approaches to observability of production systems.

Open chapter

Observability & Monitoring Design answers a practical question: how to build an observability system that helps make decisions during an incident, and does not just accumulate dashboards. The observability design includes signals (logs/metrics/traces), a pipeline for their delivery, SLO-aware alerting, runbooks and an operational improvement cycle after each failure.

Four pillars of production-observability

Logs

Events and context for the incident investigation.

  • Structured JSON format instead of free-form text.
  • Correlation fields: trace_id, span_id, request_id, user_id.
  • Explicit separation of levels: info/warn/error/fatal and business errors.

Metrics

Numerical time series for SLO, capacity and early degradation detection.

  • Basic frameworks: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors).
  • Controlling the cardinality of labels (especially user_id, path, tenant).
  • Separate metrics for the business fannel and technical layer.

Distributed Tracing

End-to-end request path through services, queues and databases.

  • Context propagation across sync/async boundaries (HTTP, gRPC, Kafka).
  • Sampling strategy: tail sampling for rare errors and high-latency requests.
  • A combination of trace + logs + metrics for quick root-cause analysis.

Alerting

Rules that wake up only for really important deviations.

  • Alerts from SLO/error budget burn-rate, and not just from CPU/RAM.
  • Severity separation (page/ticket/info) and clear escalation routes.
  • Each page-alert is accompanied by a runbook link.

Deep dive into logs, metrics, traces and alerting

Logs: from forensic analysis to operational diagnostics

Logs are needed to explain a specific event: who caused it, what went wrong, at what step and with what context.

Must-have

  • Unified JSON format and stable schema version.
  • Correlation by trace_id/span_id/request_id for gluing with tracing.
  • Explicit domain fields: operation, tenant, entity_id, outcome, error_code.

Solution design

  • Do info/debug sampling at the ingest level, not at the application level.
  • PII/secrets are masked before being sent to the log pipeline.
  • Separate security/audit logs from product logs for different retention and access.

Common mistakes

  • Raw stacktrace without operation context and input parameters.
  • Free-form lines without machine-parse friendly fields.
  • Logging of all events without budget on storage and query latency.

Example of a structured log-event

{
  "ts": "2026-02-15T06:00:00Z",
  "level": "error",
  "service": "checkout-api",
  "operation": "CreateOrder",
  "trace_id": "6c5f...",
  "request_id": "req-1942",
  "tenant": "acme",
  "error_code": "PAYMENT_PROVIDER_TIMEOUT",
  "duration_ms": 1840
}

Reference pipeline: from signal to reaction

1. Instrumentation

The code and infrastructure publishes telemetry via the OpenTelemetry SDK/collector.

2. Collection & Transport

Agents/collectors normalize signals and deliver them to storage facilities without loss.

3. Storage

Separate backends for logs/metrics/traces with retention policies based on cost.

4. Query & Correlation

Unified navigation between graph, trace and logs using common identifiers.

5. Alert & Response

Alerts launch an on-call process: triage -> mitigation -> postmortem.

Practice

Troubleshooting Example

Step-by-step diagnosis of an incident using RED + logs and hypotheses.

Open case

Alerting without noise: rule design

5-15 minutes

Fast burn-rate

Quickly catch critical outage and reduce MTTR.

High burn-rate error budget in a short window.

1-6 hours

Slow burn-rate

Record long-term quality degradation until complete failure.

Steady deviation of latency or availability from SLO.

15-60 minutes

Business anomaly

See the impact on the product before customers contact you.

Drop in conversion / surge in unsuccessful payments / drop in key funnel.

The minimum agreement for the team: each page-alert must have an owner, runbook and postmortem action item. Without this, alerting quickly turns into noise.

Next to this topic

References

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov