Observability does not start with charts. It starts with the question of whether the team can explain degradation before users give up.
Logs, metrics, traces, SLO-based alerts, runbooks, and feedback loops are presented as one diagnostic platform for moving from symptom to cause.
In architecture reviews, the chapter helps discuss signal quality, high-cardinality cost, alert fatigue, and which telemetry actually reduces investigation time.
Practical value of this chapter
Design in practice
Design the telemetry path: instrumentation, collection, storage, correlation, alerting, and response.
Decision quality
Evaluate the stack through signal quality, high-cardinality cost, investigation time, and alert usefulness for on-call.
Interview articulation
Show the path from symptom to cause: metric, trace, log, hypothesis, and mitigation.
Trade-off framing
Make the cost of detail, retention windows, trace sampling, and alert noise explicit.
Source
Site Reliability Engineering
Four golden signals, SLI/SLO, and approaches to observability for production systems.
Observability & Monitoring Design answers a practical question: how do we build an observability system that helps teams make decisions during incidents instead of only accumulating dashboards? This chapter explains logs, metrics, distributed tracing, service objectives, error budgets, burn-rate alerts, alerting rules, runbooks, and the improvement loop after incidents.
Typical observability platform
Platform map
How signals move from the application to incident investigation and on-call action.
A typical platform separates collection, transport, storage, and response so teams can evolve each layer independently.
Services and clients
where signals originate
Services
APIs, queues, background jobs
Clients
Web, Mobile, edge events
Collectors
normalize and enrich
OpenTelemetry Collector
receive, filter, route
Agents
nodes, containers, sidecars
Transport
buffering and burst protection
Queue / bus
Kafka, Pub/Sub, event stream
Ingestion policies
sampling, retention, limits
Stores
different read models
Metrics store
time series and PromQL
Logs store
event and context search
Trace store
spans, dependencies, latency
Analysis and response
decision and feedback
Dashboards
SLO, RED/USE, business signals
Alert rules
pages and tickets
On-call
runbook, action, postmortem
Four pillars of observability
Logs
Events and context for incident investigation.
- Structured JSON logging instead of free-form text.
- Correlation fields: trace_id, span_id, request_id, user_id.
- Clear log levels: info, warn, error, fatal, plus business errors.
Metrics
Numerical time series for SLOs, capacity planning, and early degradation detection.
- Baseline models: RED metrics and USE metrics.
- Control label cardinality, especially user_id, path, and tenant.
- Business metrics next to technical signals.
Distributed tracing
The end-to-end request path across services, queues, and databases.
- Context propagation across synchronous and asynchronous boundaries: HTTP, gRPC, Kafka.
- Tail sampling for rare errors and high-latency requests.
- Correlation between traces, logs, and metrics to find root cause faster.
Alerting
Rules that wake the team only for deviations that matter.
- Alerts based on SLOs and error-budget burn rate, not only CPU/RAM.
- Severity paths: page, ticket, or informational signal.
- Every page alert points to a runbook.
Deep dive: logs, metrics, traces, and alerting
Logs: from event analysis to operational diagnostics
Logs explain a specific event: who triggered the operation, what went wrong, at which step, and with what context.
Must-have
- A shared JSON format and a stable record schema version.
- Correlation through trace_id, span_id, and request ID so logs connect to traces.
- Machine-readable fields such as operation, tenant, entity_id, outcome, and error_code.
Solution design
- Sample info and debug events at the ingestion layer rather than inside the application.
- Mask PII and secrets before sending records into the log pipeline.
- Separate audit logs from product logs because retention and access rules differ.
Common mistakes
- Raw stack traces without operation context or input parameters.
- Free-form lines without machine-readable fields.
- Logging every event without a budget for storage and query latency.
Structured log event example
{
"ts": "2026-02-15T06:00:00Z",
"level": "error",
"service": "checkout-api",
"operation": "CreateOrder",
"trace_id": "6c5f...",
"request_id": "req-1942",
"tenant": "acme",
"error_code": "PAYMENT_PROVIDER_TIMEOUT",
"duration_ms": 1840
}Telemetry pipeline: from signal to response
1. Instrumentation
Code and infrastructure publish telemetry signals through the OpenTelemetry SDK and collector.
2. Collection and transport
Agents and collectors normalize signals and deliver them to storage without loss.
3. Storage
Separate stores for logs, metrics, and traces, with retention policies shaped by cost.
4. Query and correlation
Unified navigation between chart, trace, and logs through shared identifiers.
5. Alert and response
Alerts start the on-call process: triage, mitigation, and postmortem.
Practice
Troubleshooting Interview Example
Step-by-step incident diagnosis with RED metrics, logs, and hypotheses.
Alerting without noise: rule design
5-15 minutes
Fast budget burn
Catch critical downtime quickly and reduce MTTR.
High error-budget burn rate over a short window.
1-6 hours
Slow budget burn
Detect long-running quality degradation before it turns into a full outage.
Sustained deviation of latency or availability from the SLO.
15-60 minutes
Business anomaly
See business impact before customers report it.
Conversion drop, spike in failed payments, or failure in a critical journey.
The minimum team agreement: every page alert has an owner, a runbook, and a follow-up action item. Without that, alerting quickly turns into noise.
References
Related chapters
- Why do we need reliability and SRE? - A map of the SRE section: SLOs, incidents, observability, and safe releases.
- Troubleshooting Interviews - Practice diagnosing production incidents and working with hypotheses.
- The Site Reliability Workbook - SLOs, SLIs, alerting rules, and incident response in operational practice.
- Prometheus: The Documentary - The history of the Prometheus ecosystem and monitoring for cloud-native systems.
- eBPF: The Documentary - How eBPF extends observability in networking, the kernel, and security practices.
