Observability does not start with charts. It starts with the question of whether the team can explain degradation before users give up.
Logs, metrics, tracing, SLO-based alerting, runbooks, and feedback loops are presented as one diagnostic platform that helps the team move from symptom to cause instead of merely collecting signals.
In design reviews, the chapter is especially useful for discussing signal quality, cardinality cost, alert fatigue, and which telemetry actually reduces investigation time.
Practical value of this chapter
Design in practice
Turn guidance on observability architecture and SLO-driven monitoring design into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for observability architecture and SLO-driven monitoring design: release speed, automation level, observability cost, and operational complexity.
Core Source
Site Reliability Engineering
Four golden signals, SLI/SLO and approaches to observability of production systems.
Observability & Monitoring Design answers a practical question: how to build an observability system that helps make decisions during an incident, and does not just accumulate dashboards. The observability design includes signals (logs/metrics/traces), a pipeline for their delivery, SLO-aware alerting, runbooks and an operational improvement cycle after each failure.
Four pillars of production-observability
Logs
Events and context for the incident investigation.
- Structured JSON format instead of free-form text.
- Correlation fields: trace_id, span_id, request_id, user_id.
- Explicit separation of levels: info/warn/error/fatal and business errors.
Metrics
Numerical time series for SLO, capacity and early degradation detection.
- Basic frameworks: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors).
- Controlling the cardinality of labels (especially user_id, path, tenant).
- Separate metrics for the business fannel and technical layer.
Distributed Tracing
End-to-end request path through services, queues and databases.
- Context propagation across sync/async boundaries (HTTP, gRPC, Kafka).
- Sampling strategy: tail sampling for rare errors and high-latency requests.
- A combination of trace + logs + metrics for quick root-cause analysis.
Alerting
Rules that wake up only for really important deviations.
- Alerts from SLO/error budget burn-rate, and not just from CPU/RAM.
- Severity separation (page/ticket/info) and clear escalation routes.
- Each page-alert is accompanied by a runbook link.
Deep dive into logs, metrics, traces and alerting
Logs: from forensic analysis to operational diagnostics
Logs are needed to explain a specific event: who caused it, what went wrong, at what step and with what context.
Must-have
- Unified JSON format and stable schema version.
- Correlation by trace_id/span_id/request_id for gluing with tracing.
- Explicit domain fields: operation, tenant, entity_id, outcome, error_code.
Solution design
- Do info/debug sampling at the ingest level, not at the application level.
- PII/secrets are masked before being sent to the log pipeline.
- Separate security/audit logs from product logs for different retention and access.
Common mistakes
- Raw stacktrace without operation context and input parameters.
- Free-form lines without machine-parse friendly fields.
- Logging of all events without budget on storage and query latency.
Example of a structured log-event
{
"ts": "2026-02-15T06:00:00Z",
"level": "error",
"service": "checkout-api",
"operation": "CreateOrder",
"trace_id": "6c5f...",
"request_id": "req-1942",
"tenant": "acme",
"error_code": "PAYMENT_PROVIDER_TIMEOUT",
"duration_ms": 1840
}Reference pipeline: from signal to reaction
1. Instrumentation
The code and infrastructure publishes telemetry via the OpenTelemetry SDK/collector.
2. Collection & Transport
Agents/collectors normalize signals and deliver them to storage facilities without loss.
3. Storage
Separate backends for logs/metrics/traces with retention policies based on cost.
4. Query & Correlation
Unified navigation between graph, trace and logs using common identifiers.
5. Alert & Response
Alerts launch an on-call process: triage -> mitigation -> postmortem.
Practice
Troubleshooting Example
Step-by-step diagnosis of an incident using RED + logs and hypotheses.
Alerting without noise: rule design
5-15 minutes
Fast burn-rate
Quickly catch critical outage and reduce MTTR.
High burn-rate error budget in a short window.
1-6 hours
Slow burn-rate
Record long-term quality degradation until complete failure.
Steady deviation of latency or availability from SLO.
15-60 minutes
Business anomaly
See the impact on the product before customers contact you.
Drop in conversion / surge in unsuccessful payments / drop in key funnel.
The minimum agreement for the team: each page-alert must have an owner, runbook and postmortem action item. Without this, alerting quickly turns into noise.
References
Related chapters
- Why do we need reliability and SRE? - Map of the entire SRE section: SLO, incidents, observability and releases.
- Troubleshooting Interview - Practice diagnosing production incidents and working with hypotheses.
- The Site Reliability Workbook - SLO/SLI, alerting and incident response in operational practice.
- Prometheus: The Documentary - History of the Prometheus ecosystem and evolution cloud-native monitoring.
- eBPF: The Documentary - How eBPF extends observability in networks, kernel and security practices.
