System Design Space
Knowledge graphSettings

Updated: May 13, 2026 at 11:30 AM

Observability & Monitoring Design

medium

Practical observability-platform design: logs, metrics, distributed tracing, SLO-based alerts, diagnostic dashboards, runbooks, and incident investigation.

Observability does not start with charts. It starts with the question of whether the team can explain degradation before users give up.

Logs, metrics, traces, SLO-based alerts, runbooks, and feedback loops are presented as one diagnostic platform for moving from symptom to cause.

In architecture reviews, the chapter helps discuss signal quality, high-cardinality cost, alert fatigue, and which telemetry actually reduces investigation time.

Practical value of this chapter

Design in practice

Design the telemetry path: instrumentation, collection, storage, correlation, alerting, and response.

Decision quality

Evaluate the stack through signal quality, high-cardinality cost, investigation time, and alert usefulness for on-call.

Interview articulation

Show the path from symptom to cause: metric, trace, log, hypothesis, and mitigation.

Trade-off framing

Make the cost of detail, retention windows, trace sampling, and alert noise explicit.

Source

Site Reliability Engineering

Four golden signals, SLI/SLO, and approaches to observability for production systems.

Open chapter

Observability & Monitoring Design answers a practical question: how do we build an observability system that helps teams make decisions during incidents instead of only accumulating dashboards? This chapter explains logs, metrics, distributed tracing, service objectives, error budgets, burn-rate alerts, alerting rules, runbooks, and the improvement loop after incidents.

Typical observability platform

Platform map

How signals move from the application to incident investigation and on-call action.

A typical platform separates collection, transport, storage, and response so teams can evolve each layer independently.

Services and clientscollectCollectorstransportTransportwriteStoresanalyzeAnalysis and response

Services and clients

where signals originate

Services

APIs, queues, background jobs

Clients

Web, Mobile, edge events

Collectors

normalize and enrich

OpenTelemetry Collector

receive, filter, route

Agents

nodes, containers, sidecars

Transport

buffering and burst protection

Queue / bus

Kafka, Pub/Sub, event stream

Ingestion policies

sampling, retention, limits

Stores

different read models

Metrics store

time series and PromQL

Logs store

event and context search

Trace store

spans, dependencies, latency

Analysis and response

decision and feedback

Dashboards

SLO, RED/USE, business signals

Alert rules

pages and tickets

On-call

runbook, action, postmortem

Four pillars of observability

Logs

Events and context for incident investigation.

  • Structured JSON logging instead of free-form text.
  • Correlation fields: trace_id, span_id, request_id, user_id.
  • Clear log levels: info, warn, error, fatal, plus business errors.

Metrics

Numerical time series for SLOs, capacity planning, and early degradation detection.

  • Baseline models: RED metrics and USE metrics.
  • Control label cardinality, especially user_id, path, and tenant.
  • Business metrics next to technical signals.

Distributed tracing

The end-to-end request path across services, queues, and databases.

  • Context propagation across synchronous and asynchronous boundaries: HTTP, gRPC, Kafka.
  • Tail sampling for rare errors and high-latency requests.
  • Correlation between traces, logs, and metrics to find root cause faster.

Alerting

Rules that wake the team only for deviations that matter.

  • Alerts based on SLOs and error-budget burn rate, not only CPU/RAM.
  • Severity paths: page, ticket, or informational signal.
  • Every page alert points to a runbook.

Deep dive: logs, metrics, traces, and alerting

Logs: from event analysis to operational diagnostics

Logs explain a specific event: who triggered the operation, what went wrong, at which step, and with what context.

Must-have

  • A shared JSON format and a stable record schema version.
  • Correlation through trace_id, span_id, and request ID so logs connect to traces.
  • Machine-readable fields such as operation, tenant, entity_id, outcome, and error_code.

Solution design

  • Sample info and debug events at the ingestion layer rather than inside the application.
  • Mask PII and secrets before sending records into the log pipeline.
  • Separate audit logs from product logs because retention and access rules differ.

Common mistakes

  • Raw stack traces without operation context or input parameters.
  • Free-form lines without machine-readable fields.
  • Logging every event without a budget for storage and query latency.

Structured log event example

{
  "ts": "2026-02-15T06:00:00Z",
  "level": "error",
  "service": "checkout-api",
  "operation": "CreateOrder",
  "trace_id": "6c5f...",
  "request_id": "req-1942",
  "tenant": "acme",
  "error_code": "PAYMENT_PROVIDER_TIMEOUT",
  "duration_ms": 1840
}

Telemetry pipeline: from signal to response

1. Instrumentation

Code and infrastructure publish telemetry signals through the OpenTelemetry SDK and collector.

2. Collection and transport

Agents and collectors normalize signals and deliver them to storage without loss.

3. Storage

Separate stores for logs, metrics, and traces, with retention policies shaped by cost.

4. Query and correlation

Unified navigation between chart, trace, and logs through shared identifiers.

5. Alert and response

Alerts start the on-call process: triage, mitigation, and postmortem.

Practice

Troubleshooting Interview Example

Step-by-step incident diagnosis with RED metrics, logs, and hypotheses.

Open case

Alerting without noise: rule design

5-15 minutes

Fast budget burn

Catch critical downtime quickly and reduce MTTR.

High error-budget burn rate over a short window.

1-6 hours

Slow budget burn

Detect long-running quality degradation before it turns into a full outage.

Sustained deviation of latency or availability from the SLO.

15-60 minutes

Business anomaly

See business impact before customers report it.

Conversion drop, spike in failed payments, or failure in a critical journey.

The minimum team agreement: every page alert has an owner, a runbook, and a follow-up action item. Without that, alerting quickly turns into noise.

References

Related chapters

Enable tracking in Settings