SRE and Operational Reliability
17 chaptersThis page contains all chapters in this theme. Open chapters in sequence or use this page as a section map.
Why do we need reliability and SRE?
Original ContenteasyIntroductory chapter: reliability, fault tolerance, releases, observability and incident management.
Open chapterSLI / SLO / SLA and Error Budgets
Original ContentmediumPractical walkthrough of SLI/SLO/SLA: why they matter, how to read burn rate, and how to calculate budget with interactive calculators.
Open chapterIncident Management as an Engineering Discipline
Original ContentmediumHow to structure incident response as a discipline: on-call model, escalation policy, blameless postmortems and maturity metrics.
Open chapterSite Reliability Engineering (short summary)
Book SummarymediumHow Google manages production: SLO, error budgets, toil, on-call, postmortems and four golden signals.
Open chapterThe Site Reliability Workbook (short summary)
Book SummaryhardPractical continuation of the SRE Book: SLO in practice, alerting, incident response and case studies from Google.
Open chapterRelease It! (short summary)
Book SummarymediumResilience patterns from Michael Nygard: timeouts, circuit breakers, bulkheads and protection against cascade failures.
Open chapterGrokking Continuous Delivery (short summary)
Book SummaryeasyA practical introduction to CI/CD from Christie Wilson: pipelines, version control, secure deployment and DORA metrics.
Open chapterObservability & Monitoring Design
Original ContentmediumPractical design of an observability platform: logs, metrics, distributed tracing, SLO-based alerting, runbooks and feedback loop for production.
Open chapterDistributed tracing in microservices (Jaeger, Tempo)
Original ContentmediumPractical distributed tracing in microservices: tracing architecture, Jaeger and Tempo, write/read path, sampling strategy, and operational trade-offs.
Open chapterPerformance Engineering
Original ContentmediumSystematic approach to performance: latency optimization, profiling, capacity planning and performance budget in production.
Open chapterChaos Engineering: Gremlin, Litmus, Chaos Monkey
Original ContentmediumA practical guide to chaos engineering: how to design safe experiments and when to choose Gremlin, Litmus, and Chaos Monkey.
Open chapterEngineering Reliable Mobile Applications (short summary)
Book SummarymediumMobile SRE from Google: staged rollout, feature flags, client telemetry and impact on the backend.
Open chapterEvolution of SRE: implementation of an AI assistant in T-Bank
Original ContenthardAnalysis of Ivan Yurchenko’s report on platformization of incident management, SRE AI assistant, LogAnalyzer and response quality metrics.
Open chapterPrometheus: The Documentary
DocumentarymediumHistory of Prometheus: SoundCloud, PromQL and the path to a standard for cloud-native monitoring.
Open chaptereBPF: The Documentary
DocumentaryhardUnlocking The Kernel - how Linux kernel extension technology changed networking, security and observability.
Open chapterAI, DevOps, and Kubernetes: Kelsey Hightower on What's Next
DocumentarymediumInterview with Kelsey Hightower about Platform Engineering, the evolution of DevOps, the maturity of Kubernetes, the role of API contracts, AI guardrails and the importance of soft skills.
Open chapterTechnoshow “Dropped”: episode 1
DocumentarymediumBlameless analysis of a two-week incident in the T-Bank data platform: loss of metadata, recovery via Kafka/contracts and practical conclusions on SRE for data.
Open chapter