System Design Space
Knowledge graphSettings

Updated: May 4, 2026 at 4:20 PM

Database Internals: A Deep Dive (short summary)

hard

A book on database internals is not about academic prestige. It matters because without that layer, teams make architectural decisions from surface signals and vendor labels.

In engineering practice, it helps you see how B-trees, LSM trees, transactions, replication, and consensus shape the write path, read amplification, crash recovery, and concurrency behavior of a system.

In interviews and architecture discussions, this material is especially valuable as a differentiator because it lets you explain not only what to choose, but why a mechanism behaves the way it does.

Practical value of this chapter

Storage-engine literacy

Deep B-tree and LSM-tree understanding improves architectural choices for read paths, write paths, and workload behavior.

Isolation intuition

Internal transaction mechanics make isolation-level and concurrency decisions explicit and defensible.

Replication and consensus

Tie replication models directly to availability targets, read freshness, and recovery requirements.

Interview deep dive

Use internals-level explanation as a differentiator: explain not only what to choose, but why it works.

Related chapter

PostgreSQL from the inside

Deep dive into MVCC, WAL, locks and PostgreSQL indexes from Egor Rogov.

Читать обзор

Database Internals

Authors: Alex Petrov
Publisher: O'Reilly Media, Inc.
Length: 370 pages

Analysis of Alex Petrov's book on storage engines, B-trees, LSM trees, transactions, replication, consensus, recovery, and physical data layout.

Original
Translated

The book connects storage engines, database pages, B-trees, LSM trees, write-ahead logs, two-phase locking, replicated state machines, and physical I/O into one engineering picture: how low-level storage, transactions, and distributed protocols become the observable behavior of a real DBMS.

Detailed analysis

Code of Architecture

Detailed analysis of the first part from Alexander and the Code of Architecture club

Read the analysis

Part I: storage engines

B-trees and their variants

Data structures
B-tree
Basic structure with keys in internal nodes
B+ tree
Data lives in leaves, and leaves are linked together
B* tree
Denser node fill and fewer page splits
Bw-tree
Lock-free B-tree variant for memory-resident workloads
Disk optimizations
Page Organization
Page structure: header, cells, pointers
Buffer Pool
Caching pages in memory
Split & Merge
Splitting and merging nodes with overflow/underflow
Overflow Pages
Store large values away from the main page

Key insight: B-trees work well for reads and in-place updates, which makes them a natural fit for OLTP workloads. PostgreSQL and MySQL InnoDB use B+ trees for indexes.

Detailed analysis

Code of Architecture

Detailed analysis of the LSM-tree chapter from Alexander and the Code of Architecture club

Read the analysis

LSM trees

Components
Memtable
In-memory buffer for fresh writes, often backed by a red-black tree
SSTable
Sorted immutable file on disk
WAL
Write-ahead log used for crash recovery
Compaction
Size-tiered
Groups files by size and reduces write amplification
Leveled
Caps each level size and improves read cost
FIFO
Useful for time-series data and TTL-driven expiration
Reading optimizations
Bloom Filter
Probabilistic key presence check
Sparse Index
Indexes blocks inside SSTables
Block Cache
Caches frequently read data blocks

Key insight: LSM trees optimize writes through sequential I/O, but they rely on compaction to keep read cost under control. Cassandra, RocksDB, LevelDB, and HBase all use this family of structures.

B-Tree vs LSM tree: choosing a storage structure

B-Tree architecture

[10 | 20 | 30]
[3|5|8]
[12|15|18]
[22|25|28]
Leaves contain pointers to data
✓ Advantages
  • Fast reads: O(log N)
  • Efficient range queries
  • In-place updates
✗ Drawbacks
  • Write amplification
  • Random I/O on writes
Used in:
PostgreSQLMySQL InnoDBOracleSQL ServerSQLite

Transaction Processing

Concurrency control
2PL
Lock acquisition phase → lock release phase
MVCC
Row versions and consistent snapshots
OCC
Write-conflict validation before commit
SSI
Serializable Snapshot Isolation
Recovery
WAL
Records changes before they are applied
ARIES
Analysis, redo, and undo after a crash
Checkpointing
Periodically records a safe recovery point
Shadow paging
Updates page copies and atomically swaps the pointer

Part II: distributed systems

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on replication and partitioning from Alexander and the Code of Architecture club

Read the analysis

Replication and partitioning

Replication
Single leader
One node accepts writes, replicas serve reads
Multi-leader
Multiple nodes accept writes, so conflicts must be resolved
Leaderless
Reads and writes are acknowledged by quorum
Chain replication
Nodes form a chain to preserve strict update order
Partitioning
Range partitioning
Simple key navigation, but hot ranges can appear
Hash partitioning
More even distribution, but range queries become harder
Consistent hashing
Moves less data when nodes are added or removed
Compound partitioning
Combines hash and range strategies

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on consensus protocols from Alexander and the Code of Architecture club

Read the analysis

Consensus protocols

Paxos
  • Classical Lamport algorithm
  • Prepare → Promise → Accept
  • Difficult to implement
  • Multi-Paxos reduces the cost of repeated decisions
Raft
  • Consensus with an explicit leader
  • Leader election and log replication
  • etcd, Consul, CockroachDB
  • Easier to explain and implement
Zab
  • ZooKeeper atomic broadcast
  • Leader and backup-node model
  • FIFO ordering guarantees
  • Optimized for writes

Distributed transactions

Atomic commit protocols
Two-phase commit (2PC)
Prepare → vote → commit or abort. It can block if the coordinator fails at the wrong moment.
Three-phase commit (3PC)
Adds a pre-commit phase for non-blocking recovery.
Alternative approaches
Saga pattern
A chain of local transactions with compensating actions.
Calvin / deterministic DBMS
Preorders transactions and applies that order like a replicated state machine.

Low-level details

File formats

Slotted pages
Variable-length records are addressed through a slot directory
Cell layout
Key, value, and pointer layout inside a page
Checksums
CRC32/xxHash help detect corruption
Alignment
Alignment reduces unnecessary I/O

Disk I/O optimization

Sequential vs random
Sequential access is usually cheaper than random access
Direct I/O
Bypassing OS file cache
Read-ahead
Preloads adjacent blocks before they are requested
Write coalescing
Combines small writes into larger I/O operations

Examples from real DBMSs

PostgreSQL
Storage: B+ tree, heap tables, and indexes
MVCC, WAL, TOAST for large values
MySQL InnoDB
Storage: B+ tree and clustered index
MVCC, doublewrite buffer, change buffer
RocksDB
Storage: LSM tree
Column families, leveled and universal compaction
Cassandra
Storage: LSM tree
Leaderless replication, tunable consistency
MongoDB
Storage: WiredTiger (B-tree + LSM)
Document model, sharding, replica sets
CockroachDB
Storage: RocksDB + Raft
Serializable transactions, geo-partitioning

Takeaways and recommendations

Strengths

  • Deep explanation of DBMS internals
  • Practical comparison of B-trees and LSM trees
  • Detailed treatment of consensus protocols
  • Examples from real production DBMSs
  • Physical disk storage explained clearly

Who should read it?

  • Engineers who work with databases below the SQL surface
  • Storage-engine and data-infrastructure developers
  • People who want to reason about DBMS trade-offs
  • Candidates for Staff+ roles on database teams
  • Researchers and practitioners in storage systems

Verdict: Database Internals bridges the gap between high-level system design books and academic papers. If DDIA explains which properties a system needs, Petrov shows how those properties emerge from pages, logs, indexes, replication, and commit protocols. It is a strong book for anyone who wants to understand DBMS behavior at the mechanism level.

Related chapters

Where to find the book

Enable tracking in Settings