Event-Driven Architecture: Event Sourcing, CQRS, Saga

Event-driven architecture matters not because loose coupling sounds attractive, but because it lets teams separate the moment of decision, the publication of a fact, and downstream reactions across time.

The chapter breaks Event Sourcing, CQRS, and Saga into distinct engineering choices, so it becomes clear where asynchrony truly simplifies the system and where it brings schema evolution, replay complexity, convergence lag, and higher operational cost.

In engineering discussions, this helps you reason calmly about orchestration versus choreography, event boundaries, stuck workflows, and the real price of asynchrony instead of collapsing everything into 'let's add a broker and it will be more flexible.'

Practical value of this chapter

Event Contracts

Design events as stable business facts with compatible schema evolution instead of accidental transport packets.

Flow Coordination

Choose orchestration versus choreography deliberately and define where an explicit control loop is required.

Replay and DLQ

Plan replay, DLQ handling, and handler idempotency so failures stay recoverable instead of becoming data loss.

Decision Rationale

Explain when event-driven flow truly lowers coupling and when it only adds lag and operational cost.

Reference

Martin Fowler: Event Sourcing

A classic text on why teams store history as events and where that trade really pays off.

Open reference

Event-Driven Architecture (EDA) pays off not because of the broker itself, but because the moment of decision and the moment of reaction no longer have to coincide in time. That decoupling buys scalability and loose coupling, but the price is specific: event contracts, replay strategy, and failure recovery have to be designed up front, or the same decoupling turns into a source of hidden failures.

Core principles of event-driven architecture

This chapter separates Event Sourcing, CQRS, and Saga into three distinct tools rather than treating them as one mandatory package. They solve different problems: durable history, independent read and write scaling, and coordination of long-running business flows.

Before picking a tool, close the basic vocabulary — otherwise the argument drifts into terminology instead of a decision. Agree on event contracts, schema evolution, snapshots, auditability, orchestration versus choreography, and what the system should do under at-least-once delivery, consumer lag, and dead-letter-queue recovery. Those decisions belong inside each bounded context instead of being copied mechanically across the whole platform.

Events record facts

An event should describe something that already happened in the domain and should not be rewritten later.

Async flow reduces coupling

Producers and consumers can evolve independently as long as the event contract stays stable.

Convergence is not immediate

There is a lag between the write side and the final user-facing view. If product and API do not account for it explicitly, a user sees their own order still un-updated and concludes the system lost their data.

Data flow visualizations

Below are three characteristic scenarios: a base event pipeline, CQRS read and write split, and step coordination inside Saga.

Data flow animations

Each diagram can be started independently with the Start button.

Base event pipeline

Command API

Receives the business command

Domain Aggregate

Validates invariants and makes a decision

Event Store

Appends events in order

Broker or log

Fans events out to consumers

Consumers

Build projections, integrations, and side effects

CQRS: write path and read path

Both lanes are shown in parallel and can be started independently.

Write path

Client

Sends a POST or PUT command

Command model

Applies business rules and invariants

Event stream

Captures domain facts

Projector

Updates the read model

Read path

Client

Sends a GET query

Query API

Serves read-only responses

Read model

Stores a denormalized projection

Saga: coordination styles

A dedicated coordinator decides which step should run next and when compensations must begin.

All key commands and decisions pass through a single coordinator.

Order service

OrderCreated

Orchestrator

ReserveFunds

Payment service

PaymentReserved

Orchestrator

ReserveStock

Inventory service

InventoryReserved

Orchestrator

CreateShipment

Shipping service

ShipmentCreated

Orchestrator

OrderCompleted

Order service

Key idea: one coordinator makes decisions and drives service calls.

Related chapter

Microservice Patterns

A dedicated chapter on integration patterns, Saga, and the limits of distributed transactions.

Open chapter

Event Sourcing, CQRS, and Saga

Event Sourcing

System state is rebuilt from a sequence of events rather than stored only as the latest snapshot.

When to use

You need a full audit trail and reproducible change history.
You need to rebuild projections or derive new read models from historical facts.
Your domain is naturally expressed as events that happened over time.

Trade-offs

Schema evolution and old-history migrations become harder.
You need snapshots and replay strategy to keep recovery and startup fast.

CQRS

Write path and read path are split so each side can be shaped around its own workload and data model.

When to use

Read load and write load differ significantly.
You need read-optimized projections for fast queries.
The system has grown enough that reads and writes should scale independently.

Trade-offs

Operational complexity and component count increase.
The read model usually converges with the write side with some lag rather than instantly.

Saga

A distributed business operation is broken into local steps and compensations instead of one shared transaction.

When to use

One business flow touches multiple services or data stores.
2PC is unavailable or too expensive in latency and coupling.
You need controlled recovery from partially completed work through compensating actions.

Trade-offs

Every step and every compensation must survive retries without a double effect.
Long-running and partially completed flows are harder to debug and observe.

Related book

Software Architecture: The Hard Parts

A strong deep dive into engineering trade-offs, distributed workflows, and real Saga practice.

Open chapter

Saga: coordination styles

Orchestration makes the workflow more explicit and centralizes decisions, while choreography lowers coupling further but makes tracing and debugging more demanding.

Choreography

Services subscribe to events and trigger the next step themselves without a central coordinator.

Pros

Lower coupling between services
No single control point that automatically becomes a bottleneck

Risks

Harder to trace the full business flow end to end
Easy to drift into event spaghetti without explicit boundaries

Orchestration

A dedicated coordinator explicitly decides which step runs next and when compensations should start.

Pros

The whole workflow is easier to observe
Business rules and transitions are easier to verify

Risks

A central hotspot of complexity appears
The orchestration layer must be scaled and protected separately

Choosing the right pattern

Need	Pick	Why
You need full history and state reconstruction	Event Sourcing	Change history becomes primary data instead of a side log.
Reads and writes live under different SLAs and traffic profiles	CQRS	You can optimize and scale write and read paths independently.
One business operation spans multiple services without a shared transaction	Saga	Local steps plus compensations give you controlled completion and recovery.
The system is still a simple CRUD application with low integration complexity	Do not force EDA	Operational complexity may easily exceed the business value.

Common mistakes

Adopting Event Sourcing, CQRS, and Saga together before naming the concrete problem each one solves.
Publishing technical noise as events instead of domain facts.
Skipping consumer idempotency under at-least-once delivery.
Leaving schema evolution without a backward-compatibility policy.
Ignoring DLQ growth, processing lag, duplicate rate, and Saga completion time.

Related chapter

Resilience Patterns

DLQ complements retries, timeouts, and failure isolation in asynchronous processing loops.

Open chapter

Dead Letter Queue (DLQ)

DLQ exists not to hide failures, but to isolate problematic messages safely, preserve them for investigation, and replay them in a controlled way once the root cause is fixed. It becomes essential when handlers touch unstable dependencies, messy data, or expensive side effects.

When to send to DLQ

A message has exhausted retries, violates the event contract, or keeps failing because of the data itself — retrying it further is pointless and only holds up the partition.

What to store

Message identity, retry count, last error, source, failure time, and a reference to the payload.

What to do next

Triage the root cause, fix it, and replay messages back into the main flow with rate control and idempotency checks.

Practical DLQ checklist

Use a dedicated DLQ for each critical flow so domains and priorities do not get mixed together.
Store the failure reason, retry count, original topic or queue, and a payload reference for investigation.
Separate transient failures from persistent ones: not every message should land in DLQ after the same number of retries.
Define how messages return from DLQ into the main flow after the root cause is fixed, either manually or automatically.
Track DLQ growth rate and agree on a target time for triaging poison messages.

Checks before rollout

1. Lock down event contracts and a versioning policy.

2. Ensure idempotency on both producers and consumers.

3. Configure DLQ, retry rules, and the replay process.

4. Monitor handler lag, replay time, and Saga completion behavior.

A practical rollout path: start with one or two critical flows, define the contract, observability, and retry rules first, and only then expand event-driven architecture into the rest of the system.

Contract and operational discipline firstarchitecture expansion second.

Related chapters

Consistency and idempotency - helps design duplicate-event handling, outbox and inbox flow, and correctness under at-least-once delivery.
Fault tolerance patterns: Circuit Breaker, Bulkhead, Retry - extends event-driven systems with retry rules, timeouts, and controlled degradation for async handlers.
Distributed Message Queue - shows practical queue architecture, partitioning, ordering guarantees, and throughput limits in event flows.
Inter-service communication patterns - compares synchronous and asynchronous integration and clarifies where event-driven flow beats direct RPC.
Design principles for scalable systems - frames latency-versus-throughput trade-offs and backpressure techniques that matter in event-driven systems.
Kafka: The Definitive Guide, 2nd Edition (short summary) - provides the foundation for log-based event streaming, partitioning, delivery semantics, and Kafka operations.
Microservice Patterns (short summary) - goes deeper into Saga, distributed transactions, and integration patterns for microservice architectures.