System Design Space
Knowledge graphSettings

Updated: April 14, 2026 at 6:20 PM

Fault Tolerance Patterns: Circuit Breaker, Bulkhead, Retry

medium

How to combine timeouts, circuit breakers, bulkheads, retry budgets, and fallback behavior so degraded dependencies do not turn into cascading failures.

System resilience is defined not by how many patterns appear on the slide, but by how the system behaves once one degraded dependency starts pulling others down with it.

The chapter ties circuit breakers, bulkheads, retries, fail-fast behavior, graceful degradation, fallback, and async recovery into one blast-radius-control model, where the critical question is not only what to enable, but when those mechanisms start making things worse.

In interviews and design reviews, it is useful because it lets you discuss retry storms, resource contention, overload behavior, and manual override paths as real failure semantics rather than as a stack of resilience buzzwords.

Practical value of this chapter

Error Budget

Tie resilience patterns to SLO and error-budget policy so reliability is managed, not just stated.

Failure Isolation

Combine circuit breakers, bulkheads, and timeout policy to contain blast radius across dependencies.

Controlled Degradation

Design fallback levels and feature shedding so user experience remains predictable during incidents.

Decision Rationale

Demonstrate not only the patterns, but also their thresholds, trigger conditions, and effectiveness metrics.

Classic

Release It!

Michael Nygard’s production stories show how the same resilience mechanisms behave in real incidents, not just in diagrams.

Open chapter

Resilience patterns matter because a healthy system must stay useful after one dependency has already degraded. This chapter ties timeouts, circuit breakers, bulkheads, retry policy, fallback, and controlled degradation into one operating model for containing the damage of a bad dependency day.

In practice, fault tolerance is never a single switch. Circuit breakers stop traffic from piling onto a degraded dependency, bulkheads keep one workload from consuming shared resources, and retries only make sense when they are paired with explicit timeouts, backoff, and jitter.

During an incident, the system often has to fail fast instead of waiting forever, switch to a controlled degradation mode, and choose a fallback path that limits blast radius without exhausting the error budget behind the SLO.

That only works when retries stay idempotent and one noisy neighbor cannot starve the critical user path of shared capacity.

Circuit Breaker

When a degraded dependency keeps receiving full traffic, queues grow, latency worsens, and the failure starts spreading to neighboring services. Circuit Breaker cuts that loop before a local issue turns into systemic overload.

Closed

Requests keep flowing to the dependency while the system only collects error and latency signals.

Open

New calls are rejected immediately so queues do not grow and resources are not wasted on an already degraded path.

Half-Open

A small number of probe requests checks whether the dependency has recovered before normal traffic is restored.

Circuit Breaker visualization

Loading circuit breaker visualization...

Bulkhead isolation

When all features share the same resource pools, one noisy workload can consume the available connections or worker slots and break even the critical path. Bulkhead isolation limits that damage at subsystem boundaries and helps the service stay useful in a partial outage.

Separate thread pools and connection pools for critical and non-critical operations.
Isolate queues and concurrency limits for each dependency independently.
Set resource quotas per service or tenant so one noisy neighbor cannot overload the whole environment.
Separate control-plane actions from the serving path so operational commands still work during a partial outage.

Bulkhead visualization

Loading bulkhead visualization...

Important

Consistency and idempotency

Retries without idempotent contracts often create business duplicates and make recovery harder than the original failure.

Open chapter

Retry patterns

Transient failures are inevitable in distributed systems, and retries do help recover a successful result without user intervention. But without explicit limits, pauses, and stopping conditions, retries become an overload mechanism of their own.

Retry only transient failures; predictable failures usually get worse after another call.

Use exponential backoff and jitter so retry waves do not synchronize.

Cap the retry budget and align it with the request timeout.

Retries are safe only for idempotent operations; otherwise they create duplicates and side effects.

When a dependency is overloaded, fail fast and degrade intentionally instead of holding extra waits open.

Retry visualization

Loading retry visualization...

Fallbacks and controlled degradation

Controlled degradation

Disable secondary features such as recommendations or enrichment while keeping the primary user action available.

Last known good cache

Return the last valid version of the data when the source is unavailable, and make the freshness trade-off explicit.

Queue and async recovery

Accept critical operations into a queue and complete them asynchronously once the dependent path recovers.

Safe default response

For non-critical paths, a bounded fallback response is often better than turning the whole scenario into a hard error.

Practical checklist

Every external call has an explicit timeout, retry policy, and circuit-breaker thresholds.

Degradation metrics are defined: availability, latency, queue depth, and rejected-request share.

There is an operational procedure for forced-open mode and safe recovery of the dependency.

Tests cover cascading dependency failure and fallback behavior.

Resilience parameters are reviewed after incidents and postmortems.

Common anti-pattern: aggressive retry without concurrency limits or jitter.

References

Related chapters

Enable tracking in Settings