System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

The Site Reliability Workbook (short summary)

hard

Free version

SRE Workbook from Google

The full text of the book is available for free on Google

sre.google

The Site Reliability Workbook

Authors: Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Publisher: O'Reilly Media, 2018
Length: 506 pages

Practical continuation of the SRE Book: SLO in practice, alerting, incident response and case studies from Google.

The Site Reliability Workbook - original coverOriginal
The Site Reliability Workbook - translated editionTranslated

First book

Site Reliability Engineering

Review of the original SRE Book from Google

Читать обзор

Link to the original SRE Book

SRE Book (2016)

  • SRE Philosophy and Principles
  • Google experience from the inside
  • Theoretical foundation
  • "Why SRE Works"

SRE Workbook (2018)

  • How-To Guides
  • Templates and checklists
  • Case studies from different companies
  • "How to implement SRE"

Key themes of the book

SLO in practice

Step-by-step guide to choosing SLI, installing SLO and working with error budgets. How to document SLOs and communicate them to stakeholders.

Alerting

How to create alerts that actually matter. The fight against alert fatigue and the principles of actionable alerting.

Incident Response

Structured incident response process: roles (Incident Commander, Ops Lead), communication, escalation.

Postmortem Culture

Blameless postmortem templates, how to debrief incidents, track action items and share lessons learned.

Toil Elimination

How to measure toil, prioritize automation and convince management to allocate time to eliminate routine.

On-Call

Healthy on-call practices: scheduling, handoffs, compensation and burnout prevention.

Book structure

Part I

Foundations

How SRE has evolved since the first book. SLO in detail: SLI selection, error budget calculator, SLO document.

Part II

Practices

Monitoring and alerting. On-call. Incident management. Postmortems. Reliability testing (Chaos Engineering).

Part III

Processes

Organizational change management. SRE team models. Training and onboarding. Communication patterns.

Case Studies

Industry examples

Real stories of SRE implementation in different companies: startups, enterprises, companies not from the tech sector.

Practical tools from the book

SLO Document Template

Structure of the SLO document:

  • Service overview — description of the service
  • SLIs — metrics and measurement methods
  • SLOs — target values
  • Error budget - exhaustion policies
  • Rationale — rationale for choice

Incident Command System

Incident roles:

  • Incident Commander (IC) — coordinates the response
  • Operations Lead — technical actions
  • Communications Lead — external communication
  • Planning Lead — documentation and handoffs

Postmortem Template

Sections of a postmortem document:

  • Summary — brief description of the incident
  • Impact — who was affected and how
  • Timeline — chronology of events
  • Root cause - systemic reasons
  • Action items — specific steps with owners
  • Lessons learned - what went good/bad

Application at System Design interview

Useful Concepts

  • SLO-driven architecture
  • Structured incident response
  • Alerting best practices
  • Chaos Engineering approaches
  • Toil measurement frameworks

Questions where it will be useful

  • “How to determine SLO for a service?”
  • "How to respond to incidents?"
  • “Which alerts should I set up?”
  • “How to test reliability?”
  • “How to organize an on-call?”

Related book

Building Secure and Reliable Systems

Security + reliability from Google

Читать обзор

Related Resources

Main conclusions

SLO is not just a metric, but a decision-making tool
Structured incident response reduces MTTR
Blameless postmortems are the key to a learning culture
Toil needs to be measured and systematically eliminated
On-call must be sustainable, otherwise it will burnout
SRE is cultural change, not just technology

Where to find the book

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov