Distributed correctness does not fail only on obvious outages. It also fails on retries, stale reads, and partially completed operations.
The chapter ties together consistency models, read-your-writes, idempotency keys, consumer deduplication, transactional outbox, and saga compensation into one design frame where the key question is which invariants can be relaxed and which ones must survive every retry and replay.
For system design interviews, this is powerful because it lets you talk about correctness through actual safeguards instead of hiding behind the marketing phrase of exactly-once, which explains almost nothing about failure behavior.
Practical value of this chapter
Consistency level
Select consistency model per use case: strong, bounded staleness, or eventual with compensation strategy.
Idempotent contracts
Design APIs and consumers with idempotency keys, dedupe stores, and explicit retry behavior.
Failure scenarios
Model race conditions, duplicate delivery, and partial commits through explicit failure timelines.
Interview precision
Show how correctness is preserved in distributed systems when exactly-once cannot be guaranteed end to end.
Theory
CAP Theorem
Consistency is always chosen as a trade-off with availability and latency.
Consistency and idempotency patterns let systems safely survive retries, redelivery, and partial failures. Core principle: you cannot rely only on "perfect delivery" guarantees; the system must stay correct under repeated and out-of-order events.
Related
Jepsen consistency models
Empirical analysis of consistency models and real DB behavior under failures.
Consistency Models
Strong consistency
Financial operations, critical invariants, and workflows with high cost of error.
Higher latency/cost and lower availability under partition scenarios.
Read-your-writes / session consistency
User-facing flows where users must immediately see their own updates.
Requires session routing/sticky reads and cache validation discipline.
Eventual consistency
Catalogs, recommendations, analytical views, and asynchronous integrations.
Temporary divergence appears, so UX/business handling policy is required.
Idempotency Patterns
Active pattern
Idempotency Key for synchronous APIs
POST/command operations: payments, order creation, invoice issuance, workflow execution.
How to implement
- Client sends `Idempotency-Key`; server stores key + request fingerprint + final response.
- On retry with the same key and payload, return original result instead of creating a new operation.
- Choose key TTL by business risk (often 24-72 hours for financial operations).
Risk: If reused key arrives with a different payload, return conflict error; otherwise hidden duplicates appear.
Practical guardrails
- Idempotency protects against duplicate delivery, but does not replace concurrency and invariant controls.
- For critical commands, store not only processed flag but also canonical response/reason code for replay.
- Monitor retry hit-rate, dedupe reject rate, and conflict-resolution latency.
Validation
Testing Distributed Systems
Idempotency must be validated with duplicate/out-of-order scenarios.
Usage Scenarios
Payment API
Retry after timeout without an idempotent contract often results in double charge.
Failure path
Client
timeout + retry
Payment API
no idempotency key
DB
duplicate charge
User
double debit
Resilient path
Client
Idempotency-Key
API
dedupe + unique constraint
Ledger
single transaction
API
status replay
What happens
- Idempotency key maps repeated retries of one client intent to a single business operation.
- Even with redelivery, server returns the same result instead of creating a new transaction.
- Unique constraint and status endpoint close race conditions between retries.
Risk: Key TTL and key scope must match the business time window of the operation.
Scenario must remain correct under retries, redelivery, and out-of-order delivery.
Practical Checklist
For every critical command, duplicate-request behavior is explicitly defined.
Events have stable unique identity and consumer deduplication strategy.
Consistency model is chosen intentionally and reflected in API contracts/documentation.
Reconciliation processes exist to detect and repair divergence.
Team tests retries, redelivery, and out-of-order delivery in integration tests.
Common anti-pattern: assuming once-only delivery is guaranteed by platform and skipping idempotency.
References
Related chapters
- CAP Theorem - Why C, A, and P cannot all be maximized simultaneously under network partitions.
- PACELC Theorem - How latency/consistency trade-off appears even without incidents.
- Event-Driven Architecture - Where idempotency and consistency are especially critical for event flows.
- Resilience Patterns - Retries require idempotent contracts; otherwise they create business duplicates.
- Testing Distributed Systems - How to test duplicate/out-of-order/partial-failure scenarios.
