Observability & Monitoring Design — System Design Space

Observability does not start with charts. It starts with the question of whether the team can explain degradation before users give up.

Logs, metrics, traces, SLO-based alerts, runbooks, and feedback loops are presented as one diagnostic platform for moving from symptom to cause.

In architecture reviews, the chapter helps discuss signal quality, high-cardinality cost, alert fatigue, and which telemetry actually reduces investigation time.

Practical value of this chapter

Design in practice

Design the telemetry path: instrumentation, collection, storage, correlation, alerting, and response.

Decision quality

Evaluate the stack through signal quality, high-cardinality cost, investigation time, and alert usefulness for on-call.

Interview articulation

Show the path from symptom to cause: metric, trace, log, hypothesis, and mitigation.

Trade-off framing

Make the cost of detail, retention windows, trace sampling, and alert noise explicit.

Source

Site Reliability Engineering

Four golden signals, SLI/SLO, and approaches to observability for production systems.

Open chapter

During an incident the on-call engineer is not solving “draw the system nicely” — they are solving “figure out what broke and reach a fix in minutes.” Observability & Monitoring Design is about collecting the telemetry that supports that decision, instead of just accumulating dozens of dashboards no one looks at. This chapter explains logs, metrics, distributed tracing, service objectives, error budgets, burn-rate alerts, alerting rules, runbooks, and the improvement loop after incidents.

Typical observability platform

Platform map

How signals move from the application to incident investigation and on-call action.

A typical platform separates collection, transport, storage, and response so teams can evolve each layer independently.

Services and clients→collect→Collectors→transport→Transport→write→Stores→analyze→Analysis and response

Services and clients

where signals originate

Services

APIs, queues, background jobs

Clients

Web, Mobile, edge events

Collectors

normalize and enrich

OpenTelemetry Collector

receive, filter, route

Agents

nodes, containers, sidecars

Transport

buffering and burst protection

Queue / bus

Kafka, Pub/Sub, event stream

Ingestion policies

sampling, retention, limits

Stores

different read models

Metrics store

time series and PromQL

Logs store

event and context search

Trace store

spans, dependencies, latency

Analysis and response

decision and feedback

Dashboards

SLO, RED/USE, business signals

Alert rules

pages and tickets

On-call

runbook, action, postmortem

Four pillars of observability

Logs

What actually happened in a specific request: the event, its context, and the trail for an investigation.

Structured JSON logging instead of free-form text.
Correlation fields: trace_id, span_id, request_id, user_id.
Clear log levels: info, warn, error, fatal, plus business errors.

Metrics

Numerical time series for SLOs, capacity planning, and early degradation detection.

Baseline models: RED metrics and USE metrics.
Control label cardinality, especially user_id, path, and tenant.
Business metrics next to technical signals.

Distributed tracing

The path of a single request across services, queues, and databases — this is where you see which hop the latency settled on.

Context propagation across synchronous and asynchronous boundaries: HTTP, gRPC, Kafka.
Tail sampling for rare errors and high-latency requests.
Correlation between traces, logs, and metrics to find root cause faster.

Alerting

Rules that wake the team at night only when the deviation is truly worth waking up for.

Alerts based on SLOs and error-budget burn rate, not only CPU/RAM.
Severity paths: page, ticket, or informational signal.
Every page alert points to a runbook.

Deep dive: logs, metrics, traces, and alerting

Logs: from event analysis to operational diagnostics

Logs answer the question about a specific event: who triggered the operation, what went wrong, at which step, and with what context. Without that, an investigation turns into guesswork over scraps of text.

Must-have

A shared JSON format and a stable record schema version.
Correlation through trace_id, span_id, and request ID so logs connect to traces.
Machine-readable fields such as operation, tenant, entity_id, outcome, and error_code.

Solution design

Sample info and debug events at the ingestion layer rather than inside the application.
Mask PII and secrets before sending records into the log pipeline.
Separate audit logs from product logs because retention and access rules differ.

Common mistakes

Raw stack traces without operation context or input parameters: you see where it broke, but not which request broke.
Free-form lines without machine-readable fields — you can neither filter nor aggregate such a log.
Logging every event without a budget for storage and query latency, and the observability bill starts to outrun the bill for the service itself.

Structured log event example

{
  "ts": "2026-02-15T06:00:00Z",
  "level": "error",
  "service": "checkout-api",
  "operation": "CreateOrder",
  "trace_id": "6c5f...",
  "request_id": "req-1942",
  "tenant": "acme",
  "error_code": "PAYMENT_PROVIDER_TIMEOUT",
  "duration_ms": 1840
}

Telemetry pipeline: from signal to response

signal path

A signal becomes a decision only after it passes through the full loop

The pipeline connects code, collectors, stores, diagnostics, and on-call response into one testable process.

Instrumentation

Code, clients, and infrastructure emit events, metrics, and spans.

Collection and transport

Agents and collectors normalize, filter, and route the stream.

Storage

Logs, metrics, and traces land in stores with different retention windows.

Correlation

Shared identifiers connect a chart, trace, and log lines into a hypothesis.

Response

An alert leads to on-call triage, mitigation, and review.

Signals flowing through the pipeline

Logs

events and context

Metrics

service level and capacity

Traces

request path

Alerts

signal to act

Practice

Troubleshooting Interview Example

Step-by-step incident diagnosis with RED metrics, logs, and hypotheses.

Open case

Alerting without noise: rule design

5-15 minutes

Fast budget burn

Catch critical downtime quickly and reduce MTTR.

High error-budget burn rate over a short window.

1-6 hours

Slow budget burn

Catch slow quality degradation before it builds up into a full outage.

Sustained deviation of latency or availability from the SLO.

15-60 minutes

Business anomaly

See business impact before customers report it.

Conversion drop, spike in failed payments, or failure in a critical journey.

The minimum team agreement: every page alert has an owner, a runbook, and a follow-up action item. Without that, alerting quickly turns into noise.

References

Related chapters

Why do we need reliability and SRE? - A map of the SRE section: SLOs, incidents, observability, and safe releases.
Troubleshooting Interviews - Where the telemetry built here is put to work: how to form and rule out hypotheses from logs, metrics, and traces during an incident.
The Site Reliability Workbook - SLOs, SLIs, alerting rules, and incident response in operational practice.
Prometheus: The Documentary - Where the metrics and cardinality discussed above came from: the history of the Prometheus ecosystem and monitoring for cloud-native systems.
eBPF: The Documentary - How eBPF extends observability in networking, the kernel, and security practices.