Distributed transactions become painful exactly where the business wants atomicity but the architecture has already split across multiple services and stores.
In real engineering work, this chapter helps choose between 2PC, 3PC, Saga, and transactional outbox by domain boundaries, acceptable failure behavior, blocking risk, and coordination cost.
In interviews and design conversations, it is especially useful when you need to speak plainly about timeout semantics, partial commit, compensations, and idempotency requirements.
Practical value of this chapter
Design in practice
Helps choose transaction patterns by domain boundaries and acceptable failure behavior.
Decision quality
Compares 2PC, 3PC, and Saga by latency, locking impact, and operational complexity.
Interview articulation
Provides a clear narrative for coordinator, participants, commit point, and recovery.
Risk and trade-offs
Makes blocking, partial-commit, timeout, and idempotency trade-offs explicit.
Context
Consistency and idempotency
Distributed transactions are one way to ensure consistency, but not the only one.
Distributed Transactions (2PC/3PC) are needed when a business invariant requires a coordinated change across several independent resources. The cost is extra latency, blocking, and complex recovery after partial failures.
When is a distributed transaction needed?
- One business operation must atomically change several independent resources or services.
- Even temporary divergence is unacceptable for this class of operation.
- A partially committed operation would create serious financial, regulatory, or user-facing risk.
How two-phase commit (2PC) works
2PC: two-phase commit
Prepare -> votes -> global decision (commit/abort)
The coordinator collects participant votes and makes one global commit/abort decision for the whole transaction.
Strengths
- Simple coordination model that is easy to reason about.
- Clearly separates preparation from the final decision.
Risks
- Blocking is possible if the coordinator fails at the wrong time.
- Highly sensitive to timeout/retry tuning.
Protocol Steps
Current Command
Click Start to play the protocol step-by-step.
Coordinator
Waiting to start
Coordinator commands: 0
Participants
3 participants
Active step: 0 / 8
Order
participant A
Waiting for commands
Involved in steps: 0
Payment
participant B
Waiting for commands
Involved in steps: 0
Inventory
participant C
Waiting for commands
Involved in steps: 0
How three-phase commit (3PC) works
3PC: three-phase commit
CanCommit -> PreCommit -> DoCommit
Adds an intermediate PreCommit phase to reduce blocking risk when the coordinator has issues.
Strengths
- Reduces the chance of getting stuck in an uncertain state.
- Explicitly separates intent from final commit.
Risks
- More network rounds and a more complex state machine.
- Requires very careful timeout and recovery tuning.
Protocol Steps
Current Command
Click Start to play the protocol step-by-step.
Coordinator
Waiting to start
Coordinator commands: 0
Participants
3 participants
Active step: 0 / 12
Order
participant A
Waiting for commands
Involved in steps: 0
Payment
participant B
Waiting for commands
Involved in steps: 0
Inventory
participant C
Waiting for commands
Involved in steps: 0
Alternative
Event-Driven Architecture
In many scenarios, Saga plus transactional outbox is a better fit than global 2PC/3PC.
Trade-offs and alternatives
2PC is easy to explain and adopt, but participants can block if the coordinator fails after prepare.
3PC reduces the chance of getting stuck, but adds a network round and a more complex protocol state machine.
Both approaches are sensitive to network partitions, timeouts, and correct recovery after partial failures.
In a microservice architecture, a shared atomic transaction across services is often too expensive and fragile.
Saga (orchestration/choreography)
Breaks the operation into local steps with compensating actions instead of a global synchronized commit.
Transactional outbox
Writes the outgoing event next to the business change in the local database, then publishes it safely later.
Idempotent commands + reconciliation
Safe retries and background reconciliation reduce the impact of partial failures.
Domain redesign
Sometimes it is cheaper to move aggregate boundaries and remove the cross-service atomicity requirement.
Practical checklist
- It is explicitly defined where strict atomicity is needed and where eventual consistency is acceptable.
- There is a coordinator recovery strategy and durable transaction log.
- Timeout policies are tested under partition and delay scenarios.
- All participants handle repeated COMMIT and ABORT commands safely.
- There is a business mechanism for compensation and manual resolution of disputed cases.
Frequent anti-pattern: introducing 2PC between services without evaluating blocking, retry behavior, and recovery cost.
References
Related chapters
- Consistency and idempotency - Idempotency is required to safely handle retries and recovery after failures.
- Event-Driven Architecture - Saga and asynchronous coordination as a practical alternative to distributed XA.
- Fault Tolerance Patterns - Timeout, retry, and bulkhead strategies define how a transaction behaves under failures.
- Leader Election: patterns and implementations - The Coordinator pattern often relies on leadership coordination.
- Testing Distributed Systems - Test partial commit, timeout, and recovery scenarios before production.
