Designing Data-Intensive Applications
Authors: Martin Kleppmann
Publisher: O'Reilly Media, 2017 (1st Edition), 2025 (2nd Edition)
Length: 616 pages
Analysis of the book by Martin Kleppmann: data models, replication, partitioning, transactions, batch and stream processing.
Original
TranslatedPrimary source
Official page of Designing Data-Intensive Applications by Martin Kleppmann.
Book structure
The book is divided into three parts, each of which successively increases the scale of consideration - from a single machine to global distributed systems:
Part I: Basics
Data models, storage, coding. How data is represented and written to disk.
Part II: Distributed Data
Replication, partitioning, transactions, consensus. Scaling across multiple machines.
Part III: Derived Data
Batch and stream processing. Construction of data processing pipelines.
Part I: Data Systems Fundamentals
Chapter 1-2: Reliability, Scalability and Data Models
Three pillars of the system:
- Reliability — the system works correctly even during failures
- Scalability - ability to cope with increasing workload
- Maintainability — ease of support and changes
Data models:
- Relational - tables, SQL, ACID
- Documentary — JSON, nesting, flexibility
- Graphovaya — nodes and edges, connections of any complexity
Chapter 3: Data Storage and Retrieval
One of the key chapters of the book is how data is physically stored on disk.
LSM-Tree (Log-Structured Merge)
- Optimized for recording
- Used in Cassandra, RocksDB, LevelDB
- Memtable → SSTable → Compaction
B-Tree
- Optimized for reading
- Used in PostgreSQL, MySQL, Oracle
- Fixed size pages, update-in-place
Chapter 4: Coding and Circuit Evolution
How to serialize data and ensure backward/forward compatibility:
JSON/XML
Human readable, large size
Thrift/Protocol Buffers
Binary, with circuit
Avro
Schema evolution, Hadoop-friendly
Part II: Distributed Data
Chapter 5: Replication
Single-Leader
- One master per recording
- Simple model
- Problem: single point of failure
Multi-Leader
- Several masters
- For multi-data centers
- Problem: Write conflicts
Leaderless
- All nodes are equal (Dynamo-style)
- Quorum reads/writes
- W + R > N for consistency
Chapter 6: Sharding
Partitioning strategies:
- By key — hash(key) mod N
- By range - time data, geographic
- Consistent Hashing - minimizing rebalancing
Problems:
- Hot spots - uneven load
- Scatter-gather — requests to all shards
- Rebalancing — data redistribution
Chapter 7: Transactions
An in-depth discussion of ACID and isolation levels is one of the strongest parts of the book:
| Isolation level | Protects against | Does not protect against |
|---|---|---|
| Read Committed | Dirty reads, dirty writes | Non-repeatable reads |
| Snapshot Isolation | Non-repeatable reads | Write skew |
| Serializable | All anomalies | — |
Chapters 8-9: Problems and Consensus
What can go wrong:
- Network partitions
- Asymmetric failures
- Problems with the clock (clock skew)
- Byzantine faults
Consensus algorithms:
- Paxos - classic, complex
- Raft - understandable, used in etcd
- Zab — ZooKeeper
- FLP impossibility theorem
Part III: Derived Data
Chapter 10: Batch Processing
Unix philosophy:
Kleppmann draws a parallel between Unix pipes and modern batch processing:
cat log.txt | grep ERROR | sort | uniq -cMapReduce and its evolution:
- MapReduce - simple model, lots of I/O
- Spark — in-memory, DAG execution
- Flink — unified batch/stream
Chapter 11: Stream Processing
Real-time data processing is a key topic for modern systems:
Message Brokers
Kafka, RabbitMQ, Pulsar
Event Sourcing
Immutable event log as a source of truth
Change Data Capture
Debezium, Maxwell
Chapter 12: The Future of Data Systems
Kleppmann concludes the book with philosophical reflections on how to build correct, sustainable and ethical data systems. He discusses:
- Composition of services and data flow
- End-to-end correctness guarantees
- Ethical aspects of data processing
Key Concepts for System Design Interview
DDIA does not contain ready-made solutions to problems, but it provides a deep understanding that allows you to confidently answer the questions “why?”:
Selecting a Database
Understanding trade-offs between SQL and NoSQL, LSM vs B-Tree
Replication Strategies
When to use synchronous vs asynchronous replication
Partitioning
Selecting partition key, avoiding hot spots
Insulation levels
Explaining anomalies and preventing them
Exactly-once semantics
Idempotency and deduplication in stream processing
Consensus
Understanding Raft/Paxos for distributed locking
📚 Verdict
✅ Strengths
- Deep understanding of the “why”, not just the “how”
- Great visualizations and examples
- Covering the entire stack: from bytes to business logic
- Lots of references to real systems
- Honest discussion of trade-offs
⚠️ Features
- Large book (~600 pages)
- No ready-made interview solutions
- Takes time to digest
- Some sections may be too academic
🎯 Recommendation:
DDIA is must-read for any engineer working with distributed systems. To prepare for an interview, use it together with practical books (Alex Xu, Stanley Chiang) - DDIA will give you an understanding of the “why”, and practical books will give you an understanding of the “how”.
