Core Source
Site Reliability Engineering
Four golden signals, SLI/SLO and approaches to observability of production systems.
Observability & Monitoring Design answers a practical question: how to build an observability system that helps make decisions during an incident, and does not just accumulate dashboards. The observability design includes signals (logs/metrics/traces), a pipeline for their delivery, SLO-aware alerting, runbooks and an operational improvement cycle after each failure.
Four pillars of production-observability
Logs
Events and context for the incident investigation.
- Structured JSON format instead of free-form text.
- Correlation fields: trace_id, span_id, request_id, user_id.
- Explicit separation of levels: info/warn/error/fatal and business errors.
Metrics
Numerical time series for SLO, capacity and early degradation detection.
- Basic frameworks: RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors).
- Controlling the cardinality of labels (especially user_id, path, tenant).
- Separate metrics for the business fannel and technical layer.
Distributed Tracing
End-to-end request path through services, queues and databases.
- Context propagation across sync/async boundaries (HTTP, gRPC, Kafka).
- Sampling strategy: tail sampling for rare errors and high-latency requests.
- A combination of trace + logs + metrics for quick root-cause analysis.
Alerting
Rules that wake up only for really important deviations.
- Alerts from SLO/error budget burn-rate, and not just from CPU/RAM.
- Severity separation (page/ticket/info) and clear escalation routes.
- Each page-alert is accompanied by a runbook link.
Deep dive into logs, metrics, traces and alerting
Logs: from forensic analysis to operational diagnostics
Logs are needed to explain a specific event: who caused it, what went wrong, at what step and with what context.
Must-have
- Unified JSON format and stable schema version.
- Correlation by trace_id/span_id/request_id for gluing with tracing.
- Explicit domain fields: operation, tenant, entity_id, outcome, error_code.
Solution design
- Do info/debug sampling at the ingest level, not at the application level.
- PII/secrets are masked before being sent to the log pipeline.
- Separate security/audit logs from product logs for different retention and access.
Common mistakes
- Raw stacktrace without operation context and input parameters.
- Free-form lines without machine-parse friendly fields.
- Logging of all events without budget on storage and query latency.
Example of a structured log-event
{
"ts": "2026-02-15T06:00:00Z",
"level": "error",
"service": "checkout-api",
"operation": "CreateOrder",
"trace_id": "6c5f...",
"request_id": "req-1942",
"tenant": "acme",
"error_code": "PAYMENT_PROVIDER_TIMEOUT",
"duration_ms": 1840
}Reference pipeline: from signal to reaction
1. Instrumentation
The code and infrastructure publishes telemetry via the OpenTelemetry SDK/collector.
2. Collection & Transport
Agents/collectors normalize signals and deliver them to storage facilities without loss.
3. Storage
Separate backends for logs/metrics/traces with retention policies based on cost.
4. Query & Correlation
Unified navigation between graph, trace and logs using common identifiers.
5. Alert & Response
Alerts launch an on-call process: triage -> mitigation -> postmortem.
Practice
Troubleshooting Example
Step-by-step diagnosis of an incident using RED + logs and hypotheses.
Alerting without noise: rule design
5-15 minutes
Fast burn-rate
Quickly catch critical outage and reduce MTTR.
High burn-rate error budget in a short window.
1-6 hours
Slow burn-rate
Record long-term quality degradation until complete failure.
Steady deviation of latency or availability from SLO.
15-60 minutes
Business anomaly
See the impact on the product before customers contact you.
Drop in conversion / surge in unsuccessful payments / drop in key funnel.
The minimum agreement for the team: each page-alert must have an owner, runbook and postmortem action item. Without this, alerting quickly turns into noise.
Next to this topic
Why do we need reliability and SRE?
Map of the entire SRE section: SLO, incidents, observability and releases.
Troubleshooting Interview
Practice diagnosing production incidents and working with hypotheses.
The Site Reliability Workbook
SLO/SLI, alerting and incident response in operational practice.
Prometheus: The Documentary
History of the Prometheus ecosystem and evolution cloud-native monitoring.
eBPF: The Documentary
How eBPF extends observability in networks, kernel and security practices.
