Free version
SRE Workbook from Google
The full text of the book is available for free on Google
The Site Reliability Workbook
Authors: Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Publisher: O'Reilly Media, 2018
Length: 506 pages
Practical continuation of the SRE Book: SLO in practice, alerting, incident response and case studies from Google.
Original
TranslatedFirst book
Site Reliability Engineering
Review of the original SRE Book from Google
Link to the original SRE Book
SRE Book (2016)
- SRE Philosophy and Principles
- Google experience from the inside
- Theoretical foundation
- "Why SRE Works"
SRE Workbook (2018)
- How-To Guides
- Templates and checklists
- Case studies from different companies
- "How to implement SRE"
Key themes of the book
SLO in practice
Step-by-step guide to choosing SLI, installing SLO and working with error budgets. How to document SLOs and communicate them to stakeholders.
Alerting
How to create alerts that actually matter. The fight against alert fatigue and the principles of actionable alerting.
Incident Response
Structured incident response process: roles (Incident Commander, Ops Lead), communication, escalation.
Postmortem Culture
Blameless postmortem templates, how to debrief incidents, track action items and share lessons learned.
Toil Elimination
How to measure toil, prioritize automation and convince management to allocate time to eliminate routine.
On-Call
Healthy on-call practices: scheduling, handoffs, compensation and burnout prevention.
Book structure
Foundations
How SRE has evolved since the first book. SLO in detail: SLI selection, error budget calculator, SLO document.
Practices
Monitoring and alerting. On-call. Incident management. Postmortems. Reliability testing (Chaos Engineering).
Processes
Organizational change management. SRE team models. Training and onboarding. Communication patterns.
Industry examples
Real stories of SRE implementation in different companies: startups, enterprises, companies not from the tech sector.
Practical tools from the book
SLO Document Template
Structure of the SLO document:
- Service overview — description of the service
- SLIs — metrics and measurement methods
- SLOs — target values
- Error budget - exhaustion policies
- Rationale — rationale for choice
Incident Command System
Incident roles:
- Incident Commander (IC) — coordinates the response
- Operations Lead — technical actions
- Communications Lead — external communication
- Planning Lead — documentation and handoffs
Postmortem Template
Sections of a postmortem document:
- Summary — brief description of the incident
- Impact — who was affected and how
- Timeline — chronology of events
- Root cause - systemic reasons
- Action items — specific steps with owners
- Lessons learned - what went good/bad
Application at System Design interview
Useful Concepts
- SLO-driven architecture
- Structured incident response
- Alerting best practices
- Chaos Engineering approaches
- Toil measurement frameworks
Questions where it will be useful
- “How to determine SLO for a service?”
- "How to respond to incidents?"
- “Which alerts should I set up?”
- “How to test reliability?”
- “How to organize an on-call?”
Related book
Building Secure and Reliable Systems
Security + reliability from Google
