System Design Space
Knowledge graphSettings

Updated: May 1, 2026 at 8:17 AM

Event-Driven Architecture: Event Sourcing, CQRS, Saga

medium

Practical guide to event contracts, Event Sourcing, CQRS, Saga coordination, DLQ handling, and safe evolution of asynchronous flows in distributed systems.

Event-driven architecture matters not because loose coupling sounds attractive, but because it lets teams separate the moment of decision, the publication of a fact, and downstream reactions across time.

The chapter breaks Event Sourcing, CQRS, and Saga into distinct engineering choices, so it becomes clear where asynchrony truly simplifies the system and where it brings schema evolution, replay complexity, convergence lag, and higher operational cost.

In engineering discussions, this helps you reason calmly about orchestration versus choreography, event boundaries, stuck workflows, and the real price of asynchrony instead of collapsing everything into 'let's add a broker and it will be more flexible.'

Practical value of this chapter

Event Contracts

Design events as stable business facts with compatible schema evolution instead of accidental transport packets.

Flow Coordination

Choose orchestration versus choreography deliberately and define where an explicit control loop is required.

Replay and DLQ

Plan replay, DLQ handling, and handler idempotency so failures stay recoverable instead of becoming data loss.

Decision Rationale

Explain when event-driven flow truly lowers coupling and when it only adds lag and operational cost.

Reference

Martin Fowler: Event Sourcing

A classic text on why teams store history as events and where that trade really pays off.

Open reference

Event-Driven Architecture (EDA) matters not because a broker magically makes systems flexible, but because it separates the moment of decision from the moment of reaction. That freedom improves scalability and decoupling, yet it only pays off when event contracts, replay strategy, and failure recovery are designed deliberately.

Core principles of event-driven architecture

This chapter separates Event Sourcing, CQRS, and Saga into three distinct tools rather than treating them as one mandatory package. They solve different problems: durable history, independent read and write scaling, and coordination of long-running business flows.

That means agreeing up front on event contracts, schema evolution, snapshots, auditability, orchestration versus choreography, and what the system should do under at-least-once delivery, consumer lag, and dead-letter-queue recovery. Those decisions belong inside each bounded context instead of being copied mechanically across the whole platform.

Events record facts

An event should describe something that already happened in the domain and should not be rewritten later.

Async flow reduces coupling

Producers and consumers can evolve independently as long as the event contract stays stable.

Convergence is not immediate

There is often a lag between the write side and the final user-facing view, and product behavior must account for it.

Data flow visualizations

Below are three characteristic scenarios: a base event pipeline, CQRS read and write split, and step coordination inside Saga.

Data flow animations

Each diagram can be started independently with the Start button.

Base event pipeline

1

Command API

Receives the business command

2

Domain Aggregate

Validates invariants and makes a decision

3

Event Store

Appends events in order

4

Broker or log

Fans events out to consumers

5

Consumers

Build projections, integrations, and side effects

CQRS: write path and read path

Both lanes are shown in parallel and can be started independently.

Write path

1

Client

Sends a POST or PUT command

2

Command model

Applies business rules and invariants

3

Event stream

Captures domain facts

4

Projector

Updates the read model

Read path

1

Client

Sends a GET query

2

Query API

Serves read-only responses

3

Read model

Stores a denormalized projection

Saga: coordination styles

A dedicated coordinator decides which step should run next and when compensations must begin.

All key commands and decisions pass through a single coordinator.

Order service

OrderCreated

Orchestrator
Orchestrator

ReserveFunds

Payment service
Payment service

PaymentReserved

Orchestrator
Orchestrator

ReserveStock

Inventory service
Inventory service

InventoryReserved

Orchestrator
Orchestrator

CreateShipment

Shipping service
Shipping service

ShipmentCreated

Orchestrator
Orchestrator

OrderCompleted

Order service
Key idea: one coordinator makes decisions and drives service calls.

Related chapter

Microservice Patterns

A dedicated chapter on integration patterns, Saga, and the limits of distributed transactions.

Open chapter

Event Sourcing, CQRS, and Saga

Event Sourcing

System state is rebuilt from a sequence of events rather than stored only as the latest snapshot.

When to use

  • You need a full audit trail and reproducible change history.
  • You need to rebuild projections or derive new read models from historical facts.
  • Your domain is naturally expressed as events that happened over time.

Trade-offs

  • Schema evolution and old-history migrations become harder.
  • You need snapshots and replay strategy to keep recovery and startup fast.

CQRS

Write path and read path are split so each side can be shaped around its own workload and data model.

When to use

  • Read load and write load differ significantly.
  • You need read-optimized projections for fast queries.
  • The system has grown enough that reads and writes should scale independently.

Trade-offs

  • Operational complexity and component count increase.
  • The read model usually converges with the write side with some lag rather than instantly.

Saga

A distributed business operation is broken into local steps and compensations instead of one shared transaction.

When to use

  • One business flow touches multiple services or data stores.
  • 2PC is unavailable or too expensive in latency and coupling.
  • You need controlled recovery from partially completed work through compensating actions.

Trade-offs

  • Every step and every compensation must survive retries without a double effect.
  • Long-running and partially completed flows are harder to debug and observe.

Related book

Software Architecture: The Hard Parts

A strong deep dive into engineering trade-offs, distributed workflows, and real Saga practice.

Open chapter

Saga: coordination styles

Orchestration makes the workflow more explicit and centralizes decisions, while choreography lowers coupling further but makes tracing and debugging more demanding.

Choreography

Services subscribe to events and trigger the next step themselves without a central coordinator.

Pros

  • Lower coupling between services
  • No single control point that automatically becomes a bottleneck

Risks

  • Harder to trace the full business flow end to end
  • Easy to drift into event spaghetti without explicit boundaries

Orchestration

A dedicated coordinator explicitly decides which step runs next and when compensations should start.

Pros

  • The whole workflow is easier to observe
  • Business rules and transitions are easier to verify

Risks

  • A central hotspot of complexity appears
  • The orchestration layer must be scaled and protected separately

Choosing the right pattern

NeedPickWhy
You need full history and state reconstructionEvent SourcingChange history becomes primary data instead of a side log.
Reads and writes live under different SLAs and traffic profilesCQRSYou can optimize and scale write and read paths independently.
One business operation spans multiple services without a shared transactionSagaLocal steps plus compensations give you controlled completion and recovery.
The system is still a simple CRUD application with low integration complexityDo not force EDAOperational complexity may easily exceed the business value.

Common mistakes

  • Adopting Event Sourcing, CQRS, and Saga together before naming the concrete problem each one solves.
  • Publishing technical noise as events instead of domain facts.
  • Skipping consumer idempotency under at-least-once delivery.
  • Leaving schema evolution without a backward-compatibility policy.
  • Ignoring DLQ growth, processing lag, duplicate rate, and Saga completion time.

Related chapter

Resilience Patterns

DLQ complements retries, timeouts, and failure isolation in asynchronous processing loops.

Open chapter

Dead Letter Queue (DLQ)

DLQ exists not to hide failures, but to isolate problematic messages safely, preserve them for investigation, and replay them in a controlled way once the root cause is fixed. It becomes essential when handlers touch unstable dependencies, messy data, or expensive side effects.

When to send to DLQ

When a message exhausts retries, violates the event contract, or keeps failing because of the data itself.

What to store

Message identity, retry count, last error, source, failure time, and a reference to the payload.

What to do next

Triage the root cause, fix it, and replay messages back into the main flow with rate control and idempotency checks.

Practical DLQ checklist

  • Use a dedicated DLQ for each critical flow so domains and priorities do not get mixed together.
  • Store the failure reason, retry count, original topic or queue, and a payload reference for investigation.
  • Separate transient failures from persistent ones: not every message should land in DLQ after the same number of retries.
  • Define how messages return from DLQ into the main flow after the root cause is fixed, either manually or automatically.
  • Track DLQ growth rate and agree on a target time for triaging poison messages.

Checks before rollout

1. Lock down event contracts and a versioning policy.
2. Ensure idempotency on both producers and consumers.
3. Configure DLQ, retry rules, and the replay process.
4. Monitor handler lag, replay time, and Saga completion behavior.

A practical rollout path: start with one or two critical flows, define the contract, observability, and retry rules first, and only then expand event-driven architecture into the rest of the system.

Contract and operational discipline firstarchitecture expansion second.

Related chapters

Enable tracking in Settings