System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Site Reliability Engineering (short summary)

mid

Free version

SRE Book from Google

The full text of the book is available for free on Google

sre.google

Site Reliability Engineering

Authors: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
Publisher: O'Reilly Media, 2016
Length: 552 pages

How Google manages production: SLO, error budgets, toil, on-call, postmortems and four golden signals.

Site Reliability Engineering - original coverOriginal
Site Reliability Engineering - translated editionTranslated

Key SRE Concepts

SLI / SLO / SLA

SLI (Service Level Indicator) — a specific metric of service quality (latency, availability, error rate).

SLO (Service Level Objective) — SLI target value (for example, 99.9% availability).

SLA (Service Level Agreement) — a contract with consequences for violating the SLO.

Error Budget

Allowable “error budget” - if SLO is 99.9%, then error budget = 0.1%. Until the budget is exhausted, the team can take risks and roll out new features. If the budget is exhausted, the focus shifts to reliability.

Toil

Routine manual work that does not bring long-term value: restarting services, manual scaling, responding to alerts. SREs should automate toil, spending no more than 50% of their time on it.

Postmortem Culture

Blameless postmortems - analysis of incidents without finding someone to blame. Focus on systemic causes and preventing recurrence. Documenting timeline, root cause and action items.

Book structure

Part I

Introduction

What is SRE and how is it different from DevOps? How Google came to this model. Google production environment: Borg, monitoring, networking.

Part II

Principles

SLO and error budgets. Eliminating toil. Monitoring distributed systems. Release engineering. Simplicity.

Part III

Practices

Practical alerting. On-call. Effective troubleshooting. Emergency response. Postmortem culture. Tracking outages. Testing for reliability. Software engineering in SRE.

Part IV

Management

Accelerating SREs to on-call. Dealing with interrupts. Operational overload. Communication and collaboration.

Important practices from the book

Monitoring & Alerting

Four golden signals:

  • Latency — response time (separately for successful and failed requests)
  • Traffic — volume of requests to the system
  • Errors — percentage of unsuccessful requests
  • Saturation — how loaded are the resources?

On-Call

Principles of healthy on-call:

  • No more than 25% of SRE time on on-call
  • Maximum 2 incidents per shift (otherwise - overtime)
  • Clear runbooks for common problems
  • Mandatory handoff between shifts

Release Engineering

How Google deploys:

  • Hermetic builds - reproducible builds
  • Canary releases - gradual rollout
  • Feature flags for risk control
  • Automatic rollback when SLO degradation

Application at System Design interview

Useful Concepts

  • Determining SLO for clarification
  • Error budget as a trade-offs metric
  • Four golden signals for monitoring
  • Graceful degradation
  • Circuit breaker pattern
  • Canary deployments

Questions where it will be useful

  • “How will you monitor the system?”
  • “What SLOs would you set?”
  • “How to handle failures?”
  • “How to deploy without downtime?”
  • “What to do if you are overloaded?”

Related books from Google

The Site Reliability Workbook

Google, 2018

A practical continuation of the SRE Book with specific examples, templates and case studies.

Building Secure and Reliable Systems

Google, 2020

How to combine security and reliability. Secure development practices from Google.

Main conclusions

SRE is the application of software engineering to operational problems
Error budget is a key tool for balancing speed and reliability
Toil needs to be measured and automated
Blameless postmortems improve culture and systems
Monitoring must be actionable
Simplicity is the most important principle of reliable systems

Where to find the book

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov