Official website
Free version
The book is available for free on the Google SRE website.
Building Secure and Reliable Systems
Authors: Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield
Publisher: O'Reilly Media, Inc.
Length: 555 pages
Google practices: Zero Trust, defense in depth, secure SDLC, incident response and security culture.
Original
Translated«Building Secure and Reliable Systems» is a book from the Google team that combines Security and Reliability practices into a single approach. The authors show that these disciplines do not contradict each other, but strengthen the system when used correctly.
Book structure
Introduction
The intersection of security and reliability, the role of culture, adversarial thinking.
Designing Systems
Design principles, least privilege, defense in depth, secure by default.
Implementing Systems
Secure code, testing, code review, dependency management.
Maintaining Systems
Incident response, recovery, post-mortems, continuous improvement.
Organization and Culture
Team building, safety culture, training and awareness.
Security + Reliability: A Common Approach
Connection
SRE Book
Basic SRE practices: SLO, error budgets, toil reduction.
Why are Security and Reliability related?
General goals:
- Protecting the system from failures (internal and external)
- Minimizing blast radius incidents
- Fast detection and response
- Recovery from failures
General practices:
- Defense in depth
- Least privilege
- Monitoring and alerting
- Incident response playbooks
Key insight
"Security and reliability failures often look similar from a systems perspective: both result in unavailable or degraded systems, data loss, and loss of user trust."
Safe Design Principles
Least Privilege
Minimum required rights to perform the task. Applies at all levels: users, services, processes, network policies.
Examples:
- Service accounts with minimal IAM roles
- Network policies: deny-all by default
- Temporary credentials instead of long-lived keys
- Just-in-time access for privileged operations
Defense in Depth
Multi-layered protection: if one layer is broken, the next one will stop the attack.
Perimeter
WAF, DDoS protection, rate limiting
Application
Input validation, AuthN/AuthZ, encryption
Data
Encryption at rest, access logging, backups
Secure by Default
Secure configuration out of the box. The user must explicitly weaken the protection, not enable it.
❌ Bad:
- Public S3 buckets by default
- Open ports 0.0.0.0/0
- Weak password policies
✓ Good:
- Private buckets, explicit public access
- Deny-all network policies
- MFA required, strong passwords
Fail Securely
When failures occur, the system should go into a safe state, not an open state.
// ❌ Fail open (unsafe)
if (authService.isDown()) {
return allowAccess(); // Let everyone through if auth fails
}
// ✓ Fail closed (safe)
if (authService.isDown()) {
return denyAccess(); // Block if auth fails
// + alert for on-call
}Zero Trust Architecture
Zero Trust Principles
"Never trust, always verify" - even internal traffic must be authenticated and authorized.
1. Verify explicitly
Authentication and authorization of each request based on all available data: identity, location, device, service, data classification.
2. Use least privilege access
Just-in-time and just-enough access. Temporary credentials, risk-based adaptive policies.
3. Assume breach
Minimizing blast radius through segmentation, end-to-end encryption, continuous monitoring.
Service-to-Service Authentication
Secure Development Lifecycle
Security at every stage
| Stage | Security Activities | Tools |
|---|---|---|
| Design | Threat modeling, security review | STRIDE, Attack trees |
| Code | Secure coding, SAST | Semgrep, CodeQL |
| Build | Dependency scanning, SBOM | Snyk, Dependabot |
| Test | DAST, fuzzing, pen testing | OWASP ZAP, Burp Suite |
| Deploy | Container scanning, IaC security | Trivy, Checkov |
| Operate | Monitoring, incident response | SIEM, SOAR |
Threat Modeling
Systematic threat analysis at the design stage.
STRIDE Framework:
- Spoofing - identity substitution
- Tampering - changing data
- Repudiation - denial of actions
- Information disclosure - data leak
- Denial of service - denial of service
- Elevation of privilege - increase in privileges
Supply Chain Security
Protection against attacks through dependencies and build pipeline.
Practices:
- SBOM (Software Bill of Materials)
- Signed artifacts and verified builds
- Dependency pinning and lock files
- Private artifact registries
- SLSA framework compliance
Incident Response
Connection
Release It!
Resilience patterns: Circuit Breaker, Bulkhead, Timeouts.
Security Incident Lifecycle
Detection
Monitoring, alerts, anomaly detection. Time to Detection (MTTD) is a critical metric.
Triage
Assessment of severity, scope, impact. Definition of the response team.
Containment
Isolation of affected systems, blocking of malicious traffic, revoke compromised credentials.
Eradication
Eliminating root cause, patching vulnerabilities, removing malware.
Recovery
Service restoration, integrity verification, monitoring of replay attacks.
Post-Incident Review
Blameless postmortem, lessons learned, process improvement.
Safety culture
Security Champions
Dedicated representatives on each team who promote security practices.
- Conduct a security review of the command code
- Participate in threat modeling
- Train colleagues on best practices
- Communication between the team and the security team
Blameless Culture
Focus on improving the system rather than punishing people for mistakes.
- Encouraging the reporting of vulnerabilities
- Postmortems without charges
- Incident transparency
- Continuous improvement mindset
Comparison with other books
| Book | Focus | Connection |
|---|---|---|
| SRE Book | Reliability, SLO/SLI | Basic Reliability Practices |
| Release It! | Stability patterns | Patterns of Resilience |
| DDIA | Distributed systems | Distributed systems theory |
| This Book | Security + Reliability | Integration of both practices |
Application at System Design Interview
Practice
API Gateway
Implementation of authentication and authorization at the gateway level.
1. Authentication & Authorization
Mention Zero Trust, mTLS between services, JWT/OAuth for users, RBAC/ABAC for authorization.
2. Data Protection
Encryption at rest and in transit, key management (KMS), data classification, PII handling.
3. Blast Radius Reduction
Microservice isolation, network segmentation, domain failures, rate limiting.
4. Observability
Security logging, audit trails, anomaly detection, distributed tracing for forensics.
Key Findings
- ✓Security and Reliability reinforce each other — general practices: defense in depth, least privilege, fail-safe
- ✓Zero Trust — never trust, always verify, even for internal traffic
- ✓Secure by Default - secure configuration out of the box, obvious weakening
- ✓Shift Left Security — integration of security in the early stages of development
- ✓Blameless Culture - focus on improving the system, not on punishment
