System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 3:23 PM

Site Reliability Engineering (short summary)

medium

Google's SRE book matters not for the vocabulary, but for the model where reliability becomes a shared economics problem across engineering, product, and operations.

It pulls together SLOs, error budgets, toil reduction, on-call, postmortems, and the four golden signals into a coherent way of running production through measurements and rules instead of the instincts of whoever is on duty.

For interviews, it provides a strong frame for discussing service objectives, operational load, automation boundaries, and the cost of failure in large systems.

Practical value of this chapter

Design in practice

Turn guidance on core Google SRE principles and their production application into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for core Google SRE principles and their production application: release speed, automation level, observability cost, and operational complexity.

Free version

SRE Book from Google

The full text of the book is available for free on Google

sre.google

Site Reliability Engineering

Authors: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
Publisher: O'Reilly Media, 2016
Length: 552 pages

How Google manages production: SLO, error budgets, toil, on-call, postmortems and four golden signals.

Original
Translated

Key SRE Concepts

SLI / SLO / SLA

SLI (Service Level Indicator) — a specific metric of service quality (latency, availability, error rate).

SLO (Service Level Objective) — SLI target value (for example, 99.9% availability).

SLA (Service Level Agreement) — a contract with consequences for violating the SLO.

Error Budget

Allowable “error budget” - if SLO is 99.9%, then error budget = 0.1%. Until the budget is exhausted, the team can take risks and roll out new features. If the budget is exhausted, the focus shifts to reliability.

Toil

Routine manual work that does not bring long-term value: restarting services, manual scaling, responding to alerts. SREs should automate toil, spending no more than 50% of their time on it.

Postmortem Culture

Blameless postmortems - analysis of incidents without finding someone to blame. Focus on systemic causes and preventing recurrence. Documenting timeline, root cause and action items.

Book structure

Part I

Introduction

What is SRE and how is it different from DevOps? How Google came to this model. Google production environment: Borg, monitoring, networking.

Part II

Principles

SLO and error budgets. Eliminating toil. Monitoring distributed systems. Release engineering. Simplicity.

Part III

Practices

Practical alerting. On-call. Effective troubleshooting. Emergency response. Postmortem culture. Tracking outages. Testing for reliability. Software engineering in SRE.

Part IV

Management

Accelerating SREs to on-call. Dealing with interrupts. Operational overload. Communication and collaboration.

Important practices from the book

Monitoring & Alerting

Four golden signals:

  • Latency — response time (separately for successful and failed requests)
  • Traffic — volume of requests to the system
  • Errors — percentage of unsuccessful requests
  • Saturation — how loaded are the resources?

On-Call

Principles of healthy on-call:

  • No more than 25% of SRE time on on-call
  • Maximum 2 incidents per shift (otherwise - overtime)
  • Clear runbooks for common problems
  • Mandatory handoff between shifts

Release Engineering

How Google deploys:

  • Hermetic builds - reproducible builds
  • Canary releases - gradual rollout
  • Feature flags for risk control
  • Automatic rollback when SLO degradation

Application at System Design interview

Useful Concepts

  • Determining SLO for clarification
  • Error budget as a trade-offs metric
  • Four golden signals for monitoring
  • Graceful degradation
  • Circuit breaker pattern
  • Canary deployments

Questions where it will be useful

  • “How will you monitor the system?”
  • “What SLOs would you set?”
  • “How to handle failures?”
  • “How to deploy without downtime?”
  • “What to do if you are overloaded?”

Main conclusions

SRE is the application of software engineering to operational problems
Error budget is a key tool for balancing speed and reliability
Toil needs to be measured and automated
Blameless postmortems improve culture and systems
Monitoring must be actionable
Simplicity is the most important principle of reliable systems

Related chapters

Where to find the book

Enable tracking in Settings