Building Secure and Reliable Systems (short summary)

This book is valuable because it refuses to split security and reliability into separate rooms: in a real system, they fail together.

The chapter shows how Google's practices tie Zero Trust, defense in depth, secure SDLC, incident response, and security culture into one operating model where protection and resilience reinforce each other.

In interviews, it gives you a strong frame for discussing layered defense, incident learning, and why sound security architecture is inseparable from operational maturity.

Practical value of this chapter

Design in practice

Design protection and reliability together: trust boundaries, least privilege, defense in depth, and secure defaults.

Decision quality

Validate not only that controls exist, but also how the system detects incidents, limits blast radius, and restores trust.

Interview articulation

Frame the answer as a chain: threat, control, resilience, observability, response, and learning after the incident.

Trade-off framing

Make protection costs explicit: check latency, operating complexity, release speed, and team usability.

Official website

Free version

The book is available for free on the Google SRE website.

Перейти на сайт

Building Secure and Reliable Systems

Authors: Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield
Publisher: O'Reilly Media, Inc.
Length: 555 pages

Google SRE practices for combining Zero Trust, defense in depth, secure SDLC, incident response, and security culture into one security-and-reliability model.

Original

Translated

Building Secure and Reliable Systems is Google's book on treating security and reliability as one engineering discipline. The authors show that these qualities cannot be delegated to separate rooms: a security failure often becomes an availability failure, and weak operations weaken protection.

This chapter frames the book through security-by-design, Zero Trust, defense in depth, secure SDLC, incident response, security culture, and blast-radius reduction.

Book structure

Part I

Introduction

Where security and reliability overlap, why culture matters, and how adversarial thinking changes design.

Part II

Designing Systems

Design principles: least privilege, defense in depth, and secure defaults.

Part III

Implementing Systems

Secure code, testing, code review, and dependency governance.

Part IV

Maintaining Systems

Incident response, recovery, post-incident reviews, and learning from failure.

Part V

Organization and Culture

Teams, training, security culture, and clear ownership for system protection.

Security and reliability: a shared approach

Connection

SRE Book

Core SRE practices: SLOs, error budgets, and toil reduction.

Читать обзор

Why are these disciplines connected?

Shared goals:

Protect the system from internal failures and external attacks.
Reduce the blast radius of incidents.
Detect, assess, and respond quickly.
Restore service and user trust after failure.

Shared practices:

Defense in depth.
Least privilege.
Monitoring, alerting, and access audit.
Operational response playbooks.

Key idea

From the outside an attack and an outage look the same: the service degrades, data is at risk, and users lose trust. When protection and operational resilience are designed by separate teams to separate criteria, the seam between them becomes the weakest spot. It is cheaper to keep both properties in one architecture.

Secure design principles

Least privilege

Least privilege means every user, service, or process gets only the rights required for the task at hand.

Examples:

Service accounts with minimal IAM roles.
Network policies that deny by default and allow only explicit paths.
Short-lived credentials instead of long-lived keys.
Just-in-time access for privileged operations.

Defense in depth

Any single barrier gets through eventually — the only question is when. Defense in depth stacks independent layers so a breach of one does not open the whole system but runs into the next.

Perimeter

WAF, DDoS protection, and rate limiting.

Application

Input validation, AuthN/AuthZ, and encryption.

Data

Encryption, access audit, and backups.

Secure defaults

Most systems run on their default values — they rarely get changed. Secure defaults put the risk on the right side: weakening protection takes a deliberate decision, not a forgotten flag.

Weak:

Public object storage buckets by default.
Ports open to 0.0.0.0/0 without a clear reason.
Weak password rules and optional MFA.

Better:

Private buckets and explicit public access.
Default-deny network access with targeted exceptions.
Required MFA and strong authentication rules.

Fail securely

When the check that grants rights is itself down, the system has two options: let everyone through or no one. For access control the right one is fail-closed — losing availability is cheaper than a door left quietly open.

// Unsafe: fail-open
if (authService.isDown()) {
  return allowAccess();
}

// Safer: fail-closed
if (authService.isDown()) {
  return denyAccess();
  // + page the on-call engineer
}

Zero Trust Architecture

Zero Trust principles

The old model trusted the perimeter: once you were inside the network, you counted as friendly. One compromised service turned that assumption into free lateral movement. Zero Trust drops the assumption itself: every request is verified again, internal traffic included.

1. Verify every request explicitly

Authentication and authorization consider identity, location, device, service, and data classification.

2. Grant the least access needed

Use just-in-time access, just-enough administration, short-lived credentials, and risk-aware policies.

3. Assume compromise

Reduce blast radius through segmentation, end-to-end encryption, and continuous monitoring.

Service-to-service authentication

Request allowed

Service A

Client

Service B

Resource

Identity

SPIRE / CA

Policy

OPA / Cedar

Service A

Client

Service B

Resource

Identity

SPIRE / CA

Policy

OPA / Cedar

Request allowed

Verification passed, access grantedPolicy rejected the requestmTLS + identity + policy

Secure development lifecycle

Security checks at every stage

Stage	Checks	Tools
Design	Threat modeling and security architecture review.	STRIDE, attack trees
Code	Secure coding, code review, and SAST.	Semgrep, CodeQL
Build	Dependency scanning, SBOM, and artifact content checks.	Snyk, Dependabot
Test	DAST, fuzzing, and validation of critical abuse cases.	OWASP ZAP, Burp Suite
Deploy	Container, IaC, and admission-policy checks before release.	Trivy, Checkov
Operate	Monitoring, security logging, and incident response.	SIEM, SOAR

Threat modeling

A threat found on the whiteboard is fixed by editing the diagram. The same threat in production is fixed by an incident. Analysis at design time exists to move the cost of the mistake to the stage where it is still cheap.

STRIDE:

Spoofing: pretending to be another identity.
Tampering: unauthorized modification.
Repudiation: denying an action later.
Information disclosure: exposing data.
Denial of service: making the system unavailable.
Elevation of privilege: gaining higher privileges.

Software supply chain security

Your own code gets read in review; everyone else's arrives through dependencies, the artifact registry, and the delivery pipeline. Compromise any of those links and malicious code reaches production signed and verified.

Practices:

SBOMs.
Signed artifacts and verified builds.
Dependency pinning and lock files.
Private artifact registries.
SLSA levels where critical artifacts need stronger guarantees.

Incident response

Connection

Release It!

Resilience patterns: circuit breaker, bulkhead, and timeouts.

Читать обзор

Security incident lifecycle

Detection

Monitoring, alerting rules, and anomaly detection. Mean time to detect remains a critical metric.

Triage

Assess severity, scope, impact, and the response team needed for the incident.

Containment

Isolate affected systems, block malicious traffic, and revoke compromised credentials.

Eradication

Remove the root cause, patch vulnerabilities, and clean up malware.

Recovery

Restore services, verify integrity, and monitor for signs of repeat attacks.

Post-incident review

Run a blameless review, capture lessons learned, and improve the process.

Security culture

Security champions inside teams

The security team does not scale to every pull request. A security champion moves the responsibility closer to the code, where the decisions are made every day.

Review code and architecture changes through a risk lens.
Join threat-modeling work with the product team.
Teach secure development practices to peers.
Connect the product team with security specialists.

Blameless learning culture

Punish the mistake and next time people stay quiet about it until it turns into an incident. Security culture keeps the focus on fixing the system rather than finding someone to blame, so problems surface early.

Encourage vulnerability and unsafe-configuration reports.
Run incident reviews without blame.
Keep facts and decisions transparent during incidents.
Turn each incident into architecture and process improvements.

Comparison with other books

Book	Focus	Connection
SRE Book	Reliability, SLO/SLI	Core operational discipline.
Release It!	Resilience patterns	How systems tolerate failure and load.
DDIA	Distributed systems	Data models, consistency, and fault tolerance.
This book	Security and reliability	How protection, operations, and culture reinforce one another.

Applying it in a system design interview

Practice

API Gateway

Authentication and authorization at the API Gateway layer.

Читать обзор

1. Authentication and authorization

Mention Zero Trust, mTLS between services, JWT/OAuth for users, and RBAC/ABAC for access decisions.

2. Data protection

Discuss encryption at rest and in transit, key management, data classification, and PII handling.

3. Blast-radius reduction

Show microservice isolation, network segmentation, failure domains, and rate limiting.

4. Observability and investigation

Use security logging, audit trails, anomaly detection, and distributed tracing for forensics.

Key takeaways

✓Security and reliability are one discipline: split them across separate teams and the seam between them becomes the weak spot. Shared practices: defense in depth, least privilege, and secure failure behavior.
✓Zero Trust: trust nothing by default and verify even internal traffic.
✓Secure defaults: protection is enabled from the start, and weakening it is explicit.
✓Shift-left security: checks start in design, code, and build stages rather than after release.
✓Blameless learning: the goal is to improve the system, process, and team learning.

Related chapters

Site Reliability Engineering (short summary) - Provides the reliability practices, SLI/SLO thinking, and error-budget discipline that this book combines with security engineering.
SLI / SLO / SLA and Error Budgets - Helps formalize reliability with measurable targets and manage risk during production changes.
Zero Trust: a modern approach to architectural security - Extends this chapter's ideas around mTLS, explicit verification of every request, and blast-radius reduction.
Supply Chain Security - Complements secure SDLC with dependency integrity, artifact signing, and software supply-chain controls.
Release It! (short summary) - Adds resilience-under-failure patterns that align with incident response and recovery discussed here.

Where to find the book

Original

oreilly.com

Building Secure and Reliable Systems

Translated

piter.com

Безопасные и надежные системы: Лучшие практики проектирования, внедрения и обслуживания как в Google