Release It! Design and Deploy Production-Ready Software
Authors: Michael T. Nygard
Publisher: Pragmatic Bookshelf, 2018 (2nd Edition)
Length: 376 pages
Resilience patterns from Michael Nygard: timeouts, circuit breakers, bulkheads and protection against cascade failures.
OriginalStability Antipatterns
Nygard begins by describing the “cracks” in systems—the patterns that lead to cascade failures:
Integration Points
Every integration with an external system is a potential point of failure. Network calls may hang, return garbage, or simply not respond. Without protection, one slow service breaks the entire chain.
Blocked Threads
The most common killer of systems under load. Synchronous calls without timeouts block threads, the pool is exhausted, new requests are not processed - the system hangs.
Cascading Failures
The failure of one component causes a chain reaction. One service begins to respond slowly → the caller saves connections → its pool is exhausted → the entire cluster crashes.
Unbounded Result Sets
A query without LIMIT returns a million records. OOM, GC pause, timeout - and the service is dead. Always limit your results and use pagination.
Stability Patterns
Timeouts
First line of defense. Every external call must have a timeout. Without a timeout, one frozen service will kill the entire system.
- Connection timeout - time to establish a connection
- Read timeout - time to wait for a response
- Total timeout for the entire operation
Circuit Breaker
Automatically disable broken dependencies.If the service is constantly crashing, there is no point in yanking it - it only makes the situation worse.
When the error threshold is exceeded, the circuit “opens” and immediately returns fallback. Periodically tries to restore the connection.
Bulkheads
Insulation of compartments like on a ship. If one compartment is flooded, the others continue to work.
- Separate thread pools for different types of requests
- Separate connection pools for different dependencies
- Isolating critical and non-critical threads
Retry with Backoff
Retries with exponential backoff. But be careful - without proper implementation, retrays turn into DDoS against your own service.
- Exponential backoff: 1s → 2s → 4s → 8s
- Jitter to prevent thundering herd
- Maximum number of attempts
- Retry is only for idempotent operations!
Additional patterns
Shed Load
When overloaded, it is better to reject some requests than to fail completely. Load shedding is a deliberate denial of service to preserve the system.
Fail Fast
If you know that the request cannot be fulfilled, refuse immediately, do not waste resources. Check preconditions at the entrance.
Handshaking
The server informs the client that it is ready to accept requests. Allows graceful startup and controlled shutdown.
Steady State
The system should run indefinitely without manual intervention. Automatic log rotation, clearing caches, deleting old data.
Book structure
Create Stability
Stories of real disasters. Anti-stability patterns. Stability patterns: timeouts, circuit breakers, bulkheads.
Design for Production
Networking, security, availability. Administration, monitoring, logging. Deployment and infrastructure.
Related chapter
Grokking Continuous Delivery
CI/CD, secure deployments and DORA metrics for the Deliver Your System part.
Deliver Your System
Continuous deployment, version control, environments. Configuration management, runtime control.
Solve Systemic Problems
Chaos engineering, adaptation. Organizational change, systems evolution, complexity management.
Application at System Design interview
When to use
- “How to handle dependency failures?”
- “What happens if there is an overload?”
- “How to prevent cascade failures?”
- “How to make graceful degradation?”
- “What are the SLOs and how to achieve them?”
Key interview patterns
- Circuit breaker for external calls
- Timeouts at all integration points
- Bulkheads for load isolation
- Rate limiting and load shedding
- Retry with exponential backoff
