Free version
SRE Book from Google
The full text of the book is available for free on Google
Site Reliability Engineering
Authors: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
Publisher: O'Reilly Media, 2016
Length: 552 pages
How Google manages production: SLO, error budgets, toil, on-call, postmortems and four golden signals.
Original
TranslatedKey SRE Concepts
SLI / SLO / SLA
SLI (Service Level Indicator) — a specific metric of service quality (latency, availability, error rate).
SLO (Service Level Objective) — SLI target value (for example, 99.9% availability).
SLA (Service Level Agreement) — a contract with consequences for violating the SLO.
Error Budget
Allowable “error budget” - if SLO is 99.9%, then error budget = 0.1%. Until the budget is exhausted, the team can take risks and roll out new features. If the budget is exhausted, the focus shifts to reliability.
Toil
Routine manual work that does not bring long-term value: restarting services, manual scaling, responding to alerts. SREs should automate toil, spending no more than 50% of their time on it.
Postmortem Culture
Blameless postmortems - analysis of incidents without finding someone to blame. Focus on systemic causes and preventing recurrence. Documenting timeline, root cause and action items.
Book structure
Introduction
What is SRE and how is it different from DevOps? How Google came to this model. Google production environment: Borg, monitoring, networking.
Principles
SLO and error budgets. Eliminating toil. Monitoring distributed systems. Release engineering. Simplicity.
Practices
Practical alerting. On-call. Effective troubleshooting. Emergency response. Postmortem culture. Tracking outages. Testing for reliability. Software engineering in SRE.
Management
Accelerating SREs to on-call. Dealing with interrupts. Operational overload. Communication and collaboration.
Important practices from the book
Monitoring & Alerting
Four golden signals:
- Latency — response time (separately for successful and failed requests)
- Traffic — volume of requests to the system
- Errors — percentage of unsuccessful requests
- Saturation — how loaded are the resources?
On-Call
Principles of healthy on-call:
- No more than 25% of SRE time on on-call
- Maximum 2 incidents per shift (otherwise - overtime)
- Clear runbooks for common problems
- Mandatory handoff between shifts
Release Engineering
How Google deploys:
- Hermetic builds - reproducible builds
- Canary releases - gradual rollout
- Feature flags for risk control
- Automatic rollback when SLO degradation
Application at System Design interview
Useful Concepts
- Determining SLO for clarification
- Error budget as a trade-offs metric
- Four golden signals for monitoring
- Graceful degradation
- Circuit breaker pattern
- Canary deployments
Questions where it will be useful
- “How will you monitor the system?”
- “What SLOs would you set?”
- “How to handle failures?”
- “How to deploy without downtime?”
- “What to do if you are overloaded?”
Related books from Google
The Site Reliability Workbook
Google, 2018
A practical continuation of the SRE Book with specific examples, templates and case studies.
Building Secure and Reliable Systems
Google, 2020
How to combine security and reliability. Secure development practices from Google.
