System Design SpaceSystem Design Space
Onboarding
Back to table of contents

SRE and Operational Reliability

17 chapters

This page contains all chapters in this theme. Open chapters in sequence or use this page as a section map.

1

Why do we need reliability and SRE?

Original Contenteasy

Introductory chapter: reliability, fault tolerance, releases, observability and incident management.

Open chapter
2

SLI / SLO / SLA and Error Budgets

Original Contentmedium

Practical walkthrough of SLI/SLO/SLA: why they matter, how to read burn rate, and how to calculate budget with interactive calculators.

Open chapter
3

Incident Management as an Engineering Discipline

Original Contentmedium

How to structure incident response as a discipline: on-call model, escalation policy, blameless postmortems and maturity metrics.

Open chapter
4

Site Reliability Engineering (short summary)

Book Summarymedium

How Google manages production: SLO, error budgets, toil, on-call, postmortems and four golden signals.

Open chapter
5

The Site Reliability Workbook (short summary)

Book Summaryhard

Practical continuation of the SRE Book: SLO in practice, alerting, incident response and case studies from Google.

Open chapter
6

Release It! (short summary)

Book Summarymedium

Resilience patterns from Michael Nygard: timeouts, circuit breakers, bulkheads and protection against cascade failures.

Open chapter
7

Grokking Continuous Delivery (short summary)

Book Summaryeasy

A practical introduction to CI/CD from Christie Wilson: pipelines, version control, secure deployment and DORA metrics.

Open chapter
8

Observability & Monitoring Design

Original Contentmedium

Practical design of an observability platform: logs, metrics, distributed tracing, SLO-based alerting, runbooks and feedback loop for production.

Open chapter
9

Distributed tracing in microservices (Jaeger, Tempo)

Original Contentmedium

Practical distributed tracing in microservices: tracing architecture, Jaeger and Tempo, write/read path, sampling strategy, and operational trade-offs.

Open chapter
10

Performance Engineering

Original Contentmedium

Systematic approach to performance: latency optimization, profiling, capacity planning and performance budget in production.

Open chapter
11

Chaos Engineering: Gremlin, Litmus, Chaos Monkey

Original Contentmedium

A practical guide to chaos engineering: how to design safe experiments and when to choose Gremlin, Litmus, and Chaos Monkey.

Open chapter
12

Engineering Reliable Mobile Applications (short summary)

Book Summarymedium

Mobile SRE from Google: staged rollout, feature flags, client telemetry and impact on the backend.

Open chapter
13

Evolution of SRE: implementation of an AI assistant in T-Bank

Original Contenthard

Analysis of Ivan Yurchenko’s report on platformization of incident management, SRE AI assistant, LogAnalyzer and response quality metrics.

Open chapter
14

Prometheus: The Documentary

Documentarymedium

History of Prometheus: SoundCloud, PromQL and the path to a standard for cloud-native monitoring.

Open chapter
15

eBPF: The Documentary

Documentaryhard

Unlocking The Kernel - how Linux kernel extension technology changed networking, security and observability.

Open chapter
16

AI, DevOps, and Kubernetes: Kelsey Hightower on What's Next

Documentarymedium

Interview with Kelsey Hightower about Platform Engineering, the evolution of DevOps, the maturity of Kubernetes, the role of API contracts, AI guardrails and the importance of soft skills.

Open chapter
17

Technoshow “Dropped”: episode 1

Documentarymedium

Blameless analysis of a two-week incident in the T-Bank data platform: loss of metadata, recovery via Kafka/contracts and practical conclusions on SRE for data.

Open chapter