SRE and Operational Reliability

17 chapters

This page contains all chapters in this theme. Use search and the type and difficulty filters to find the right material inside this section.

Difficulty:

Only chapters matching both the selected material type and the selected difficulty are shown.

Why do we need reliability and SRE?easy

Introductory chapter on SLO/SLI, error budgets, observability, safe releases, incidents, and improvement loops.

SLI / SLO / SLA and Error Budgetsmedium

Practical walkthrough of SLI/SLO/SLA: choosing service indicators, calculating error budgets, reading burn rate, and tying SLOs to alerting and release policy.

Incident Management as an Engineering Disciplinemedium

How to structure incident management: on-call duty, escalation, postmortems, MTTD/MTTA/MTTR metrics, and the improvement loop after incidents.

Root Cause: Backend Bugs as SRE Trainingmedium

Review of Hussein Nasser's book about real backend bugs: system-wide slowdowns, HTTP/1.1 and HTTP/2, load balancers, resource exhaustion, state corruption, and why investigations matter for SRE engineers.

Site Reliability Engineering (short summary)medium

How Google turns reliability into an engineering discipline: SLOs, error budgets, toil, on-call, monitoring, and postmortems.

The Site Reliability Workbook (short summary)hard

Practical continuation of the SRE Book: implementing SLOs, alerting, incident process, postmortems, on-call practice, and toil reduction.

Release It! (short summary)medium

Resilience patterns from Michael Nygard: timeouts, circuit breakers, bulkheads, load shedding, and protection against cascading failures.

Grokking Continuous Delivery (short summary)easy

A practical introduction to CI/CD from Christie Wilson: delivery pipelines, version control, safe deployment, and DORA metrics.

Observability & Monitoring Designmedium

Practical observability-platform design: logs, metrics, distributed tracing, SLO-based alerts, diagnostic dashboards, runbooks, and incident investigation.

Distributed tracing in microservices (Jaeger, Tempo)medium

Practical distributed tracing in microservices: Jaeger, Tempo, OpenTelemetry, write and read paths, sampling, trace storage, and latency investigation.

Performance Engineeringmedium

A systematic approach to performance: latency, throughput, profiling, load testing, capacity planning, and performance budgets.

Chaos Engineering: Gremlin, Litmus, Chaos Monkeymedium

A practical approach to safe chaos experiments: blast radius, stop conditions, Gremlin, Litmus, Chaos Monkey, and resilience validation.

Engineering Reliable Mobile Applications (short summary)medium

Google's playbook for reliable mobile apps: client telemetry, staged rollout, feature flags, version support, and client impact on backend load.

Prometheus: The Documentarymedium

The history of Prometheus: SoundCloud, the pull model, PromQL, Alertmanager, CNCF, and the path to a monitoring standard.

eBPF: The Documentaryhard

The history of eBPF: Linux kernel programmability, verifier, JIT, Cilium, observability, networking use cases, and runtime protection.

AI, DevOps, and Kubernetes: Kelsey Hightower on What's Nextmedium

Interview with Kelsey Hightower about platform engineering, Kubernetes maturity, API contracts, AI guardrails, engineering culture, and team skills.

Technoshow “Dropped”: episode 1medium

Blameless analysis of a two-week T-Bank data-platform incident: metadata loss, recovery through Kafka and data contracts, data SLOs, and engineering/management takeaways.