System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Fault Tolerance Patterns: Circuit Breaker, Bulkhead, Retry

mid

Practical analysis of resilience patterns for distributed systems: how to limit cascading failures and manage service degradation.

Classic

Release It! [RU]

This chapter gives a modern practical frame for the same resilience principles.

Open chapter [RU]

Resilience patterns ensure that a dependency failure does not become a cascading outage across the entire system. `Circuit Breaker`, `Bulkhead`, and `Retry` must work as one set: limit blast radius, preserve controlled degradation, and give the system a chance to recover without breaking critical user journeys.

Circuit Breaker

Motivation: when a degraded dependency keeps receiving full traffic, queues grow, latency worsens, and the failure spreads to other services. Circuit Breaker interrupts this feedback loop and protects the system from cascading overload.

Closed

Traffic flows normally while the system measures errors and latency.

Open

Requests to a degraded dependency are rejected quickly to avoid queue buildup and resource exhaustion.

Half-Open

Probe requests verify recovery before switching back to Closed.

Circuit Breaker visualization

Circuit Breaker Simulator

state: closed
cooldown: 0
fail streak: 0

Request Queue

CB-101OK
/payments
CB-102FAIL
/payments
CB-103FAIL
/payments
CB-104FAIL
/payments

Breaker Policy

failure threshold = 3

open cooldown = 2 cycles

success

0

failures

0

rejected

0

Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Bulkhead isolation

Motivation: when all features share the same resource pools, one noisy workload can consume threads/connections and break critical user paths. Bulkhead isolates resources and limits blast radius at subsystem boundaries.

Separate thread pools/connection pools for critical and non-critical operations.
Queue isolation and concurrency limits per dependency.
Resource quotas per service/tenant so a noisy neighbor cannot take down the whole platform.
Control plane and data plane isolation so ops commands still work during partial outages.

Bulkhead visualization

Bulkhead Simulator

completed: 0
dropped: 0

Incoming Queue

BH-201background
duration: 3 ticks
BH-202background
duration: 3 ticks
BH-203background
duration: 2 ticks
BH-204critical
duration: 2 ticks

Isolation Policy

critical and background pools are isolated

Critical lane

queue: 0/3

inflight: 0completed: 0dropped: 0

Background lane

queue: 0/3

inflight: 0completed: 0dropped: 0
Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Important

Data Consistency and Idempotency Patterns

Retries without idempotency often create business duplicates and inconsistent state.

Open chapter

Retry patterns

Motivation: transient failures are inevitable in distributed systems, and retry helps recover success without user action. But without limits, backoff, and jitter, retries become an overload source, so policy must be strictly controlled.

Retry only transient failures; deterministic failures should not be retried.

Exponential backoff + jitter is mandatory to avoid synchronized retry storms.

Retry budget must be capped and aligned with timeout budget.

Idempotency is mandatory, otherwise retries create duplicates and side effects.

During dependency overload, fail fast with graceful degradation instead of waiting forever.

Retry visualization

Retry Simulator

succeeded ops: 0
failed ops: 0
attempts total: 0

Operation Queue

RT-301max 4
CreateOrder
RT-302max 4
ReserveInventory
RT-303max 4
SendWebhook
RT-304max 4
UpdateLedger

Retry Policy

exponential backoff + jitter

idempotency key is required

Attempt Timeline

Run a step to see retry attempts.

Ready

Ready to simulate. Start auto mode or run one step.

Last decision: —

Retry without an idempotency key can create duplicates and inconsistent side effects.

Fallback strategies

Graceful degradation

Disable secondary features (recommendations, enrichment) while preserving core user flow.

Stale cache

Return the last valid data when source is unavailable, with explicit freshness metadata.

Queue + async recovery

Accept critical operations into a queue and finish processing asynchronously after recovery.

Static/default response

Serve a safe fallback response instead of a hard error for non-critical paths.

Practical checklist

Each external call has timeout, retry policy, and circuit breaker thresholds.

Degradation KPIs are defined: availability, latency, backlog, drop rate.

Runbook exists for forced-open breaker and manual override.

Tests cover cascading dependency failure and fallback behavior.

Resilience parameters are periodically reviewed using incident/postmortem data.

Common anti-pattern: aggressive retry without concurrency limits and jitter.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov