DDIA matters because it turns distributed systems from a pile of legends and terminology into one coherent engineering language about data, failure, and growth.
In real work, the book helps teams reason about data models, replication, partitioning, transactions, and stream processing as connected design choices rather than unrelated chapters from different toolchains.
In interviews, reviews, and architecture discussions, it is especially useful because it lets you talk through real growth effects such as rebalancing, backpressure, schema evolution, and the price of consistency instead of relying on templates.
Practical value of this chapter
Design in practice
Systematizes practical storage and processing patterns for production-grade systems.
Decision quality
Improves replication/partitioning/indexing choice by workload profile.
Interview articulation
Helps express trade-offs in reliability, latency, consistency, and operability terms.
Risk and trade-offs
Focuses on growth and failure effects: backpressure, rebalancing, and schema evolution.
Designing Data-Intensive Applications
Authors: Martin Kleppmann
Publisher: O'Reilly Media, 2017 (1st Edition), 2025 (2nd Edition)
Length: 616 pages
Analysis of the book by Martin Kleppmann: data models, replication, partitioning, transactions, batch and stream processing.
Primary source
Official page of Designing Data-Intensive Applications by Martin Kleppmann.
Book structure
The book is divided into three parts, each of which successively increases the scale of consideration - from a single machine to global distributed systems:
Part I: Basics
Data models, storage, coding. How data is represented and written to disk.
Part II: Distributed Data
Replication, partitioning, transactions, consensus. Scaling across multiple machines.
Part III: Derived Data
Batch and stream processing. Construction of data processing pipelines.
Part I: Data Systems Fundamentals
Chapter 1-2: Reliability, Scalability and Data Models
Three pillars of the system:
- Reliability — the system works correctly even during failures
- Scalability - ability to cope with increasing workload
- Maintainability — ease of support and changes
Data models:
- Relational - tables, SQL, ACID
- Documentary — JSON, nesting, flexibility
- Graphovaya — nodes and edges, connections of any complexity
Chapter 3: Data Storage and Retrieval
One of the key chapters of the book is how data is physically stored on disk.
LSM-Tree (Log-Structured Merge)
- Optimized for recording
- Used in Cassandra, RocksDB, LevelDB
- Memtable → SSTable → Compaction
B-Tree
- Optimized for reading
- Used in PostgreSQL, MySQL, Oracle
- Fixed size pages, update-in-place
Chapter 4: Coding and Circuit Evolution
How to serialize data and ensure backward/forward compatibility:
JSON/XML
Human readable, large size
Thrift/Protocol Buffers
Binary, with circuit
Avro
Schema evolution, Hadoop-friendly
Part II: Distributed Data
Chapter 5: Replication
Single-Leader
- One master per recording
- Simple model
- Problem: single point of failure
Multi-Leader
- Several masters
- For multi-data centers
- Problem: Write conflicts
Leaderless
- All nodes are equal (Dynamo-style)
- Quorum reads/writes
- W + R > N for consistency
Chapter 6: Sharding
Partitioning strategies:
- By key — hash(key) mod N
- By range - time data, geographic
- Consistent Hashing - minimizing rebalancing
Problems:
- Hot spots - uneven load
- Scatter-gather — requests to all shards
- Rebalancing — data redistribution
Chapter 7: Transactions
An in-depth discussion of ACID and isolation levels is one of the strongest parts of the book:
| Isolation level | Protects against | Does not protect against |
|---|---|---|
| Read Committed | Dirty reads, dirty writes | Non-repeatable reads |
| Snapshot Isolation | Non-repeatable reads | Write skew |
| Serializable | All anomalies | — |
Chapters 8-9: Problems and Consensus
What can go wrong:
- Network partitions
- Asymmetric failures
- Problems with the clock (clock skew)
- Byzantine faults
Consensus algorithms:
- Paxos - classic, complex
- Raft - understandable, used in etcd
- Zab — ZooKeeper
- FLP impossibility theorem
Part III: Derived Data
Chapter 10: Batch Processing
Unix philosophy:
Kleppmann draws a parallel between Unix pipes and modern batch processing:
cat log.txt | grep ERROR | sort | uniq -cMapReduce and its evolution:
- MapReduce - simple model, lots of I/O
- Spark — in-memory, DAG execution
- Flink — unified batch/stream
Chapter 11: Stream Processing
Real-time data processing is a key topic for modern systems:
Message Brokers
Kafka, RabbitMQ, Pulsar
Event Sourcing
Immutable event log as a source of truth
Change Data Capture
Debezium, Maxwell
Chapter 12: The Future of Data Systems
Kleppmann concludes the book with philosophical reflections on how to build correct, sustainable and ethical data systems. He discusses:
- Composition of services and data flow
- End-to-end correctness guarantees
- Ethical aspects of data processing
Key Concepts for System Design Interview
DDIA does not contain ready-made solutions to problems, but it provides a deep understanding that allows you to confidently answer the questions “why?”:
Selecting a Database
Understanding trade-offs between SQL and NoSQL, LSM vs B-Tree
Replication Strategies
When to use synchronous vs asynchronous replication
Partitioning
Selecting partition key, avoiding hot spots
Insulation levels
Explaining anomalies and preventing them
Exactly-once semantics
Idempotency and deduplication in stream processing
Consensus
Understanding Raft/Paxos for distributed locking
📚 Verdict
✅ Strengths
- Deep understanding of the “why”, not just the “how”
- Great visualizations and examples
- Covering the entire stack: from bytes to business logic
- Lots of references to real systems
- Honest discussion of trade-offs
⚠️ Features
- Large book (~600 pages)
- No ready-made interview solutions
- Takes time to digest
- Some sections may be too academic
🎯 Recommendation:
DDIA is must-read for any engineer working with distributed systems. To prepare for an interview, use it together with practical books (Alex Xu, Stanley Chiang) - DDIA will give you an understanding of the “why”, and practical books will give you an understanding of the “how”.
Related chapters
- Why distributed systems and consistency matter - Section entry map and context where DDIA concepts become concrete architecture decisions.
- CAP theorem - Foundational availability-consistency trade-off under partition explored in depth in DDIA.
- PACELC theorem - Extension of CAP for normal operation: latency vs consistency trade-offs in production systems.
- Consensus: Paxos and Raft - Practical continuation of DDIA's consensus chapters on leader election and replicated logs.
- Jepsen and consistency models - Validation of consistency guarantees and real anomaly patterns in distributed databases under faults.
- Testing distributed systems - How to verify DDIA-style system correctness under network faults and partial failures.
- Replication and sharding: growth strategies - Operational application of DDIA concepts around replication, partitioning, and rebalancing.
- Consistency and idempotency: practical patterns - Bridge from DDIA consistency theory to implementation-level production patterns.
- Why understanding storage systems matters - Storage and database landscape that complements DDIA's model-driven reasoning.
- Database Internals: A Deep Dive (short summary) - Low-level perspective on engines and data structures as a practical continuation of DDIA.
- Distributed Systems: Principles and Paradigms (short summary) - Theoretical distributed-systems foundation that complements DDIA's engineering perspective.
