System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 5:36 PM

Fault Tolerance Patterns: Circuit Breaker, Bulkhead, Retry

medium

Practical analysis of resilience patterns for distributed systems: how to limit cascading failures and manage service degradation.

System resilience is defined not by how many patterns appear on the slide, but by how the system behaves once one degraded dependency starts pulling others down with it.

The chapter ties circuit breakers, bulkheads, retries, fail-fast behavior, graceful degradation, fallback, and async recovery into one blast-radius-control model, where the critical question is not only what to enable, but when those mechanisms start making things worse.

In interviews and design reviews, it is useful because it lets you discuss retry storms, resource contention, overload behavior, and manual override paths as real failure semantics rather than as a stack of resilience buzzwords.

Practical value of this chapter

Failure budget

Tie resilience patterns to SLO and error-budget policy so reliability is managed, not just stated.

Failure isolation

Combine circuit breakers, bulkheads, and timeout policy to contain blast radius across dependencies.

Gradual degradation

Design fallback levels and feature shedding so user experience remains predictable during incidents.

Interview robustness

Demonstrate not only the patterns, but also their thresholds, trigger conditions, and effectiveness metrics.

Classic

Release It!

This chapter gives a modern practical frame for the same resilience principles.

Open chapter

Resilience patterns ensure that a dependency failure does not become a cascading outage across the entire system. `Circuit Breaker`, `Bulkhead`, and `Retry` must work as one set: limit blast radius, preserve controlled degradation, and give the system a chance to recover without breaking critical user journeys.

Circuit Breaker

Motivation: when a degraded dependency keeps receiving full traffic, queues grow, latency worsens, and the failure spreads to other services. Circuit Breaker interrupts this feedback loop and protects the system from cascading overload.

Closed

Traffic flows normally while the system measures errors and latency.

Open

Requests to a degraded dependency are rejected quickly to avoid queue buildup and resource exhaustion.

Half-Open

Probe requests verify recovery before switching back to Closed.

Circuit Breaker visualization

Circuit Breaker Simulator

state: closed
cooldown: 0
fail streak: 0

Request Queue

CB-101OK
/payments
CB-102FAIL
/payments
CB-103FAIL
/payments
CB-104FAIL
/payments

Breaker Policy

failure threshold = 3

open cooldown = 2 cycles

success

0

failures

0

rejected

0

Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Bulkhead isolation

Motivation: when all features share the same resource pools, one noisy workload can consume threads/connections and break critical user paths. Bulkhead isolates resources and limits blast radius at subsystem boundaries.

Separate thread pools/connection pools for critical and non-critical operations.
Queue isolation and concurrency limits per dependency.
Resource quotas per service/tenant so a noisy neighbor cannot take down the whole platform.
Control plane and data plane isolation so ops commands still work during partial outages.

Bulkhead visualization

Bulkhead Simulator

completed: 0
dropped: 0

Incoming Queue

BH-201background
duration: 3 ticks
BH-202background
duration: 3 ticks
BH-203background
duration: 2 ticks
BH-204critical
duration: 2 ticks

Isolation Policy

critical and background pools are isolated

Critical lane

queue: 0/3

inflight: 0completed: 0dropped: 0

Background lane

queue: 0/3

inflight: 0completed: 0dropped: 0
Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Important

Data Consistency and Idempotency Patterns

Retries without idempotency often create business duplicates and inconsistent state.

Open chapter

Retry patterns

Motivation: transient failures are inevitable in distributed systems, and retry helps recover success without user action. But without limits, backoff, and jitter, retries become an overload source, so policy must be strictly controlled.

Retry only transient failures; deterministic failures should not be retried.

Exponential backoff + jitter is mandatory to avoid synchronized retry storms.

Retry budget must be capped and aligned with timeout budget.

Idempotency is mandatory, otherwise retries create duplicates and side effects.

During dependency overload, fail fast with graceful degradation instead of waiting forever.

Retry visualization

Retry Simulator

succeeded ops: 0
failed ops: 0
attempts total: 0

Operation Queue

RT-301max 4
CreateOrder
RT-302max 4
ReserveInventory
RT-303max 4
SendWebhook
RT-304max 4
UpdateLedger

Retry Policy

exponential backoff + jitter

idempotency key is required

Attempt Timeline

Run a step to see retry attempts.

Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Retry without an idempotency key can create duplicates and inconsistent side effects.

Fallback strategies

Graceful degradation

Disable secondary features (recommendations, enrichment) while preserving core user flow.

Stale cache

Return the last valid data when source is unavailable, with explicit freshness metadata.

Queue + async recovery

Accept critical operations into a queue and finish processing asynchronously after recovery.

Static/default response

Serve a safe fallback response instead of a hard error for non-critical paths.

Practical checklist

Each external call has timeout, retry policy, and circuit breaker thresholds.

Degradation KPIs are defined: availability, latency, backlog, drop rate.

Runbook exists for forced-open breaker and manual override.

Tests cover cascading dependency failure and fallback behavior.

Resilience parameters are periodically reviewed using incident/postmortem data.

Common anti-pattern: aggressive retry without concurrency limits and jitter.

References

Related chapters

Enable tracking in Settings