Saga and Compensating Transactions

This chapter on Saga matters because it shows a third path to cross-service consistency: not an atomic distributed transaction through a blocking coordinator, but a sequence of local transactions T1..Tn, each with a semantic compensation C1..Cn.

In practice it is the engineering basis of long business processes over a database-per-service architecture: order placement, booking, payments — where 2PC is unacceptable because of locking, and consistency moves into the application layer via event choreography or an orchestrator state machine (Temporal, Camunda, Step Functions).

In interviews and design discussions it gives you the language to name the honest cost: a Saga trades ACID atomicity for eventual consistency and loses isolation — hence dirty reads and lost updates that must be mitigated with semantic locks, idempotency, and a transactional outbox.

Practical value of this chapter

Design in practice

Teaches you to split a long cross-service transaction into local steps with compensations and place the pivot as the point of no return.

Decision quality

Helps choose between choreography and orchestration by flow complexity and compare Saga with 2PC by locking, isolation, and observability.

Interview articulation

Provides a clear narrative: T1..Tn, reverse-order compensations C1..Cn, semantic lock, and transactional outbox versus dual write.

Risk and trade-offs

Makes explicit the absence of isolation, already-visible effects, the idempotency requirement, and the boundary where a consistent DB is better.

Contrasting neighbor chapter

Distributed transactions: 2PC and 3PC

The neighbor chapter achieves multi-service atomicity through a blocking coordinator and two-phase commit. This chapter gives up both atomicity and distributed locks.

Compare the approaches

The neighbor chapter on distributed transactions solves "all or nothing" across multiple stores with two-phase commit (2PC): a coordinator asks every participant "are you ready to commit?", and only on a unanimous "yes" does it broadcast the commit. The result is true atomicity with ACID guarantees. The price is that every participant holds locks from the prepare phase to commit, and a stuck coordinator leaves everyone in doubt.

This chapter is about the other pole. A Saga gives up the single atomic transaction entirely. A long business operation is split into a sequence of local transactions T1…Tn — each one committed in its own service immediately and independently. If a step fails along the way, the Saga does not roll back databases (those commits are already visible); instead it runs compensating transactions C1…Cn that semantically undo what was done before. No distributed locks, no single coordinator with the power to stall everyone.

The trade is honest: instead of instant atomicity a Saga gives you eventual consistency — the system passes through intermediate states where some steps are already committed while compensations have not yet run. In return it scales without locks, survives long (hours-to-days) business processes, and does not collapse because of one stuck node. The notion of a saga was introduced by Hector Garcia-Molina and Kenneth Salem back in 1987 — long before microservices.

A Saga is needed where a business transaction crosses several services with their own databases and holding 2PC distributed locks is unacceptable — because of scale, geography, or process duration. You trade ACID atomicity for a sequence of local commits with compensations, accepting the absence of isolation as a deliberate cost.

Why not 2PC

The cost of atomicity

2PC delivers atomicity but holds locks on every participant for the whole round and stalls if the coordinator fails — which scales poorly to long business processes.

Читать обзор

The problem: consistency without a distributed transaction

In a database-per-service architecture, order, inventory, and payments each have their own store. The business wants "place order" to be atomic: either all three steps happen or none do. The obvious answer is to wrap everything in one distributed transaction via 2PC. But at scale and for long processes that answer breaks down along three lines.

Locks for the whole round

Between prepare and commit each participant holds its resources locked. The more services and the longer the round, the higher the contention: concurrent transactions queue up and throughput drops.

Blocking coordinator

If the coordinator dies after prepare but before commit, participants stay in-doubt: resources locked, decision unknown. 3PC softens this but does not remove the locking.

Long business processes

A hold, a credit check, delivery may take minutes, hours, days. Keeping transactional locks for all that time is impossible — they would choke the databases long before the process finishes.

The key shift in thinking: a Saga does not try to preserve the illusion of one atomic transaction. It accepts that intermediate states are visible to others and moves the responsibility for "rollback" from the DB engine into application logic. What 2PC hides behind locks, a Saga exposes as explicit compensations — and pays for it with the loss of isolation.

Original source

Sagas (1987)

Garcia-Molina and Salem introduced the saga for long-lived transactions: split one long transaction into a chain of short ones so locks are not held for hours.

Читать обзор

What a Saga is: T1…Tn and compensations C1…Cn

In the paper "Sagas" (Garcia-Molina, Salem, ACM SIGMOD, 1987), a saga is defined as a way to split a long-lived transaction (LLT) into a sequence of short local transactions T₁, T₂, …, T_n, each committed immediately. For every T_i there is a compensating transaction C_i that semantically undoes its effect. A saga guarantees one of two outcomes:

Full success

The entire sequence T₁ … T_n ran. The business operation is complete, each service committed its part locally. This is the saga's "forward" path.

Compensated rollback

T₁ … T_j ran, step T_j+1 failed. Then compensations C_j … C₁ run in reverse order. The result is semantically equivalent to "as if nothing happened."

Forward path and compensated rollback (failure at T3)

Top — a successful saga T1→T2→T3. Bottom — a failure at T3: compensations C2 and C1 run in reverse order, returning the system to a business-equivalent of the initial state.

This is not ACID atomicity but saga atomicity: a guarantee that the system will not hang "halfway" forever — it either reaches the end or correctly compensates back. An important classification (introduced by Chris Richardson and codified by the Microsoft Azure Architecture Center): steps fall into compensable (undoable via C_i), pivot (the point of no return: after it compensations are no longer relevant), and retryable (idempotent, run after the pivot until they succeed).

Neighbor chapter

Workflow orchestration

As complexity grows, the Saga orchestrator becomes a durable process engine: a state machine, step history, timers, retries — exactly what Temporal and Camunda provide.

Читать обзор

Choreography versus orchestration

A saga has to be coordinated somehow: someone must decide which step is next and when to trigger compensations. There are two polar ways to coordinate, and the choice between them trades coupling for observability.

Choreography: reacting to events

There is no central conductor. Each service, after committing its local transaction, publishes an event, and other subscribers react with their own steps. The saga's logic is spread across the participants.

Pro: no single point of failure, loose coupling, simple for short flows. Con: as the number of steps grows it is hard to see who reacts to what; risk of cyclic dependencies; difficult to test and observe the whole process.

Orchestration: a central conductor

An orchestrator appears — a state machine that holds the saga's state, sends participants commands ("do T_i"), waits for replies, and on failure triggers the compensations itself in the right order.

Pro: process logic in one place, excellent observability, easy to add steps, no cycles. Con: the orchestrator is a potential point of failure and a hub of coupling; it needs a separate service/engine.

A practical rule: choreography is good for 2–4 steps with simple logic where loose coupling matters most. As soon as you have branching, conditional compensations, and a need to "see the whole process," orchestration wins — and it is usually implemented not by hand but on a durable engine from the neighbor chapter on process orchestration.

Compensations: a semantic undo, not a DB rollback

A compensation is not a database ROLLBACK. The local transaction T_i is already committed and visible to others; the only way to "undo" it is a new transaction C_i that does the opposite in meaning. Hence several hard requirements on compensations.

Semantic, not exact, undo

C_i returns the system not to a byte-exact copy of the past but to a business-equivalent state. Money was charged — the compensation issues a refund, it does not "rewind" the balance: both entries remain in history. Inventory was reserved — the compensation releases the reservation.

Idempotency

Because of retries and unreliable delivery, both T_i and C_i may arrive twice. Applying them again must yield the same result as applying once — otherwise you get a double refund or a double charge. Idempotency here is not optional; it is a correctness condition.

Commutativity with the forward step

A compensation may arrive before the acknowledgment of the forward step (delivery races). A good design makes C_i robust to this: "cancel reservation #X" works correctly even if the reservation has not yet been recorded — otherwise you must add waiting and buffering.

The already-visible-effects problem

The hardest part: a step's effect may have already "leaked" outside. The email to the customer is sent, goods have left the warehouse, money has gone to a counterparty. There is no clean compensation — you need business measures: a notification, reverse logistics, manual escalation. So irreversible steps are made the pivot and placed as late as possible.

Hence the practical step order: first everything compensable (holds, reservations, checks), then the single pivot (the point of no return — e.g. actual shipping), and after it only retryable idempotent steps that are guaranteed to run to completion. That keeps the zone where a rollback might be needed reversible.

What is missing

Isolation (ACID-I)

2PC plus locks provide isolation; a Saga provides none at all. Intermediate states are visible to concurrent sagas — hence dirty reads and lost updates.

Читать обзор

Isolation anomalies: what a Saga does not give

A Saga sacrifices the "I" in ACID: there is no isolation between sagas. Because each step commits immediately, its intermediate result is visible to other transactions before the saga has finished or compensated. This produces the classic anomalies — the same ones seen at weak isolation levels in databases.

Dirty read

Saga B reads data that saga A has already written locally but will later undo with a compensation. B made a decision based on something that "will not exist."

Lost update

A and B concurrently read and overwrite the same record without seeing each other's changes. One of the updates is silently clobbered.

Non-repeatable read

Different steps of one saga read the same entity and get different values, because another saga changed it between the reads.

Countermeasures (per Microsoft / Richardson)

Semantic lock: a step sets an "in progress" flag (e.g. order status PENDING), so other sagas know the data is not yet final.
Commutative updates: design operations so the order of application does not change the result (increment instead of read-then-write).
Pessimistic view: reorder steps so updates land in the retryable phase after the pivot — then dirty reads cannot occur.
Reread value: before writing, check that the data has not changed since it was read; if it did, abort and restart the step.
By value: pick the mechanism by the business risk of the request — a saga for cheap operations, a consistent transaction for expensive, critical ones.

Reliable delivery: outbox, CDC, and idempotency keys

A saga has an insidious gap: a step must both commit data to its own DB and send an event/command to the next one. Doing this atomically is impossible — the DB and the message broker are different systems (the classic dual write). Crash in between and either the data exists but the event is lost (the saga hangs), or the event was sent but no data exists. The outbox + CDC pairing solves this.

Transactional outbox + CDC

The event is written in the same local transaction as the business data — into a dedicated outbox table. Since it is one commit, either both the data and the event record exist or neither does. The atomicity of the dual write is reduced to an ordinary local transaction.

A separate relay then reads the outbox and publishes events to the broker — via change data capture (reading the transaction log, e.g. Debezium) or a polling publisher.

Exactly-once effect via idempotency

Brokers offer at-least-once: an event may arrive again. There is no true exactly-once in delivery, but you need an exactly-once effect. You get it with idempotency keys.

Each event carries a unique key (e.g. sagaId + stepId). The consumer stores processed keys and on a repeat simply ignores the duplicate. Retries become safe: the effect is applied exactly once, even if the message is delivered many times.

Together, outbox and idempotency close the saga's reliability: the outbox guarantees an event will not be lost (its commit is atomic with the data), and idempotency keys guarantee a repeat does no harm. Without that pair, any network will eventually either hang the saga or run a step twice.

Transport

Inter-service communication

A choreographed saga lives on events in Kafka; an orchestrated one on commands from the engine to participants. Both are inter-service communication patterns from the neighbor chapter.

Читать обзор

Tools: orchestrators and event-driven

A saga is almost never written "from scratch": coordination, durable state, timers, and retries are provided by off-the-shelf engines. Roughly, the tools split into two camps — durable orchestrators and event buses.

Tool	Style	How it implements a saga
Temporal	Orchestration (code)	Durable execution: the workflow code is replayed deterministically from an event history; compensations are written as explicit undo steps, and the engine guarantees they run on failure.
Camunda / Zeebe	Orchestration (BPMN)	A saga as a BPMN process with compensation boundary events: you model the forward steps and their attached compensations visually, and the engine executes the state machine.
AWS Step Functions	Orchestration (state machine)	A standard workflow as a saga orchestrator: steps call services, and on error Catch transitions trigger compensating tasks (Revert Payment, Revert Inventory).
Apache Kafka	Choreography (events)	Transport for an event-driven saga: each service publishes an event to a topic, the others react; ordering and durability are held by the log itself, with no coordinator.

The boundary is simple: durable orchestrators (Temporal, Camunda/Zeebe, Step Functions) give observability, explicit compensations, and execution history — the choice for complex sagas. Event-driven via Kafka gives loose coupling and no single point of failure — the choice for simple choreographed flows. More on the engines themselves is in the neighbor chapter on process orchestration.

Trade-offs: when to use a Saga and when a consistent database

A Saga fits when

A business transaction crosses several services with their own databases — 2PC is unavailable or costly.
The process is long (minutes/hours/days) and holding locks all that time is impossible.
The domain tolerates eventual consistency and intermediate visible states.
Each step has a meaningful semantic compensation (refund, release a reservation, cancel a hold).

A consistent DB / 2PC is better when

All of the operation's data lives in one database — use a plain local ACID transaction, do not overcomplicate.
You need strong consistency and isolation: no one may see an intermediate state.
Steps are irreversible and compensations are meaningless, while the cost of error is too high (by value).
The team is not ready to operate compensations, idempotency, an outbox, and an orchestrator.

Common mistakes

Treating a compensation as a DB rollback. C_i is a separate transaction; it undoes semantically, not by "rewinding," and cannot remove already-visible effects.
Forgetting idempotency. Without idempotency keys, at-least-once delivery produces double charges and double compensations.
Ignoring isolation anomalies. A Saga gives no "I" from ACID; without semantic locks and commutative updates you get dirty reads and lost updates.
Reaching for a Saga where one DB is enough. If the data is in a single store, a distributed saga is needless complexity with no benefit.
Not delivering events atomically. A dual write without an outbox will eventually hang the saga on a lost event.

Key takeaways

A Saga (Garcia-Molina, Salem, 1987) splits a long transaction into a chain of local transactions T1..Tn, each with a compensation C1..Cn, and delivers eventual consistency instead of ACID atomicity.
Unlike the neighboring 2PC, a Saga holds no distributed locks and does not depend on a blocking coordinator — so it scales to long cross-service processes.
Choreography (events, loose coupling) versus orchestration (a central conductor, observability) — choose by the flow's complexity.
A compensation is a semantic undo, not a ROLLBACK: it must be idempotent, ideally commutative, and is powerless against already-visible effects (hence the pivot).
A Saga sacrifices isolation: dirty reads and lost updates are mitigated by semantic locks, commutative updates, rereading, and choosing by value.
Reliability rests on transactional outbox + CDC (the event will not be lost) and idempotency keys (a repeat does no harm); the broker gives at-least-once, the keys give an exactly-once effect.

Sources and further reading

Source map: Garcia-Molina/Salem is the original source for T1..Tn and compensations; microservices.io covers modern choreography/orchestration and transactional outbox; Azure and AWS provide cloud pattern guidance. A Saga compensation is a business-equivalent reverse step, not automatic rollback, so the exactly-once effect requires idempotency, outbox/CDC, and an explicit retry model.

Garcia-Molina, Salem — Sagas (ACM SIGMOD, 1987): the original source on T1..Tn and compensations Chris Richardson — Pattern: Saga (choreography and orchestration of local transactions)Chris Richardson — Pattern: Transactional outbox (reliable event publishing)Microsoft Azure Architecture Center — Saga design pattern (isolation anomalies and countermeasures)AWS Prescriptive Guidance — Saga orchestration on Step Functions (T1..T3, compensations C1..C2)

Related chapters

Distributed transactions: 2PC and 3PC - The contrasting neighbor approach: atomicity through a blocking coordinator and two-/three-phase commit instead of a sequence of local transactions with compensations.
Microservices integration: overview - Places Saga on the broader integration map: where database-per-service makes a distributed transaction impossible and consistency moves into the application layer.
Inter-service communication patterns - Shows the transport a Saga rides on: events, commands, queues, and asynchronous delivery between the steps of a long transaction.
Workflow orchestration patterns - Expands the Saga orchestrator into a full process engine: state machines, durable execution, timers, and execution history in Temporal/Camunda.