System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Release It! (short summary)

mid

Release It! Design and Deploy Production-Ready Software

Authors: Michael T. Nygard
Publisher: Pragmatic Bookshelf, 2018 (2nd Edition)
Length: 376 pages

Resilience patterns from Michael Nygard: timeouts, circuit breakers, bulkheads and protection against cascade failures.

Release It! Design and Deploy Production-Ready Software - original coverOriginal

Stability Antipatterns

Nygard begins by describing the “cracks” in systems—the patterns that lead to cascade failures:

Integration Points

Every integration with an external system is a potential point of failure. Network calls may hang, return garbage, or simply not respond. Without protection, one slow service breaks the entire chain.

Blocked Threads

The most common killer of systems under load. Synchronous calls without timeouts block threads, the pool is exhausted, new requests are not processed - the system hangs.

Cascading Failures

The failure of one component causes a chain reaction. One service begins to respond slowly → the caller saves connections → its pool is exhausted → the entire cluster crashes.

Unbounded Result Sets

A query without LIMIT returns a million records. OOM, GC pause, timeout - and the service is dead. Always limit your results and use pagination.

Stability Patterns

Timeouts

First line of defense. Every external call must have a timeout. Without a timeout, one frozen service will kill the entire system.

  • Connection timeout - time to establish a connection
  • Read timeout - time to wait for a response
  • Total timeout for the entire operation

Circuit Breaker

Automatically disable broken dependencies.If the service is constantly crashing, there is no point in yanking it - it only makes the situation worse.

ClosedOpenHalf-Open

When the error threshold is exceeded, the circuit “opens” and immediately returns fallback. Periodically tries to restore the connection.

Bulkheads

Insulation of compartments like on a ship. If one compartment is flooded, the others continue to work.

  • Separate thread pools for different types of requests
  • Separate connection pools for different dependencies
  • Isolating critical and non-critical threads

Retry with Backoff

Retries with exponential backoff. But be careful - without proper implementation, retrays turn into DDoS against your own service.

  • Exponential backoff: 1s → 2s → 4s → 8s
  • Jitter to prevent thundering herd
  • Maximum number of attempts
  • Retry is only for idempotent operations!

Additional patterns

Shed Load

When overloaded, it is better to reject some requests than to fail completely. Load shedding is a deliberate denial of service to preserve the system.

Fail Fast

If you know that the request cannot be fulfilled, refuse immediately, do not waste resources. Check preconditions at the entrance.

Handshaking

The server informs the client that it is ready to accept requests. Allows graceful startup and controlled shutdown.

Steady State

The system should run indefinitely without manual intervention. Automatic log rotation, clearing caches, deleting old data.

Book structure

Part I

Create Stability

Stories of real disasters. Anti-stability patterns. Stability patterns: timeouts, circuit breakers, bulkheads.

Part II

Design for Production

Networking, security, availability. Administration, monitoring, logging. Deployment and infrastructure.

Related chapter

Grokking Continuous Delivery

CI/CD, secure deployments and DORA metrics for the Deliver Your System part.

Read chapter
Part III

Deliver Your System

Continuous deployment, version control, environments. Configuration management, runtime control.

Part IV

Solve Systemic Problems

Chaos engineering, adaptation. Organizational change, systems evolution, complexity management.

Application at System Design interview

When to use

  • “How to handle dependency failures?”
  • “What happens if there is an overload?”
  • “How to prevent cascade failures?”
  • “How to make graceful degradation?”
  • “What are the SLOs and how to achieve them?”

Key interview patterns

  • Circuit breaker for external calls
  • Timeouts at all integration points
  • Bulkheads for load isolation
  • Rate limiting and load shedding
  • Retry with exponential backoff

Related books

Main conclusions

Every integration is a potential point of failure. Protect all integration points
Timeouts are required. Without them, one frozen service will kill the entire system
Circuit breaker prevents cascade failures and allows the system to recover
Bulkheads isolate failures, preventing them from spreading
It is better to reject some requests (load shedding) than to fail completely
Production-ready ≠ feature-complete. Sustainability over functionality

Where to find the book

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov