Site Reliability Engineering (short summary)

Google's SRE book matters not for the vocabulary, but for the model where reliability becomes a shared economics problem across engineering, product, and operations.

It ties SLOs, error budgets, toil reduction, on-call, postmortems, and the four golden signals into a coherent production operating model.

For interviews, it gives you a strong frame for discussing service objectives, operational load, automation boundaries, and the cost of failure in large systems.

Practical value of this chapter

Design in practice

Design services around measurable goals: SLIs, SLOs, error budgets, the four golden signals, and clear response rules.

Decision quality

Evaluate architecture through toil, on-call cost, release risk, and the team's ability to learn from incidents.

Interview articulation

Explain how the chosen SLO shapes monitoring, release policy, rollback, and the priority of engineering improvements.

Trade-off framing

Make the balance explicit between delivery speed, reliability, operational load, and the cost of downtime.

Free version

SRE Book from Google

The full text of the book is available for free on Google's SRE site

sre.google

Site Reliability Engineering

Authors: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
Publisher: O'Reilly Media, 2016
Length: 552 pages

How Google turns reliability into an engineering discipline: SLOs, error budgets, toil, on-call, monitoring, and postmortems.

Original

Translated

This chapter treats SRE as an engineering model for reliability: SLI/SLO/SLA, error budgets, toil, the four golden signals, on-call duty, incident response, postmortems, and release engineering.

Key SRE Ideas

SLI / SLO / SLA

SLI is a measurable service-level indicator such as latency, availability, or error rate.

SLO is the target for that indicator, for example 99.9% availability.

SLA is an external agreement with commitments and consequences when the target is missed.

Error Budget

An error budget shows how much risk a team can spend without violating the SLO. While budget remains, the team can move faster; once it is burned, the priority shifts back to reliability.

Toil

Toil is repetitive manual work that creates no long-term value: restarts, manual scaling, and routine alert handling. SRE teams measure and automate it so it does not crowd out engineering work.

Postmortem Culture

Blameless postmortems capture the timeline, root cause, and follow-up actions. The goal is to improve the system, not punish the person closest to the failure.

Book Structure

Part I

Introduction

What SRE is, how it differs from DevOps, and why Google tied engineering and operations together through measurable reliability goals. The section also frames Google's environment: Borg, monitoring, and networking.

Part II

Principles

SLOs, error budgets, eliminating toil, monitoring distributed systems, release engineering, and simplicity.

Part III

Practices

Practical alerting, on-call duty, troubleshooting, emergency response, postmortems, outage tracking, reliability testing, and software engineering in SRE.

Part IV

Management

Preparing engineers for on-call, handling interrupts, operational load, communication, and collaboration between teams.

Practices from the Book

Monitoring and Alerting

The four golden signals describe service health from the user's point of view:

Latency is response time, tracked separately for successful and failed requests.
Traffic is the amount of work flowing through the system.
Errors are requests that fail.
Saturation shows how close the system is to its resource limits.

On-Call

Healthy on-call protects both the service and the team:

No more than 25% of SRE time should be spent on on-call duty.
A shift should not regularly become a continuous stream of incidents.
Common problems should have clear runbooks.
Context must be handed off between shifts.

Release Engineering

Release engineering reduces the risk of every change:

Hermetic builds make releases reproducible.
Canary releases expose changes gradually.
Feature flags control risk without requiring another deployment.
Rollback should be fast when an SLO starts to degrade.

Applying It in System Design Interviews

Useful concepts

Define the SLO while clarifying requirements.
Use the error budget as a metric for the speed-versus-reliability trade-off.
Choose the four golden signals for monitoring.
Design graceful degradation.
Limit cascading failures with circuit breakers.
Roll out changes through canary releases.

Where it helps

How would you monitor the system?
Which SLOs would you set?
How does the system behave during failures?
How do you release changes without downtime?
What happens under overload?

Main Takeaways

SRE applies software engineering to operational problems.

Error budgets help balance release speed and reliability.

Toil needs to be measured and automated.

Blameless postmortems improve both culture and architecture.

Monitoring should produce actionable alerts, not noise.

Simplicity is one of the most important principles of reliable systems.

Related chapters

The Site Reliability Workbook (short summary) - A practical continuation of the SRE Book: rolling out SLOs, alerting, incident response, and operational playbooks.
Building Secure and Reliable Systems (short summary) - Shows how to combine security and reliability requirements in one engineering model.
Release It! (short summary) - Complements SRE with resilience practices: circuit breakers, bulkheads, and graceful degradation.
eBPF: The Documentary - Extends SRE observability with kernel-level telemetry and diagnostics.

Where to find the book

Original

oreilly.com

Site Reliability Engineering

Translated

piter.com

Site Reliability Engineering. Надежность и безотказность как в Google