System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Database Internals: A Deep Dive (short summary)

hard

Related chapter

PostgreSQL from the inside

Deep dive into MVCC, WAL, locks and PostgreSQL indexes from Egor Rogov.

Читать обзор

Database Internals

Authors: Alex Petrov
Publisher: O'Reilly Media, Inc.
Length: 370 pages

Analysis of the book by Alex Petrov: B-Trees, LSM-Trees, transactions, replication, consensus and the internal structure of the DBMS.

Database Internals - original coverOriginal
Database Internals - translated editionTranslated

Detailed analysis

Code of Architecture

Detailed analysis of the first part from Alexander and the Code of Architecture club

Read the analysis

Part I: Storage Engines

B-Trees and their variants

Data structures
B-Tree
Basic structure with keys at nodes
B+ Tree
Data in leaves only, linked list
B* Tree
Node Filling Optimization
Bw-Tree
Lock-free B-Tree for in-memory
Disk optimizations
Page Organization
Page structure: header, cells, pointers
Buffer Pool
Caching pages in memory
Split & Merge
Splitting and merging nodes with overflow/underflow
Overflow Pages
Storing Large Values Separately

Key insight: B-Trees are optimized for reads and in-place updates, making them ideal for OLTP workloads. PostgreSQL and MySQL InnoDB use B+ Tree for indexes.

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on LSM-Tree from Alexander and the Code of Architecture club

Read the analysis

LSM-Trees (Log-Structured Merge Trees)

Components
MemTable
In-memory write buffer (usually Red-Black Tree)
SSTable
Sorted String Table - immutable files on disk
WAL
Write-Ahead Log for durability
Compaction
Size-Tiered
Grouping by size, less write amplification
Leveled
Levels with a size limit, better read
FIFO
For time-series data with TTL
Reading optimizations
Bloom Filter
Probabilistic key presence check
Sparse Index
Index to SSTable blocks
Block Cache
Caching Data Blocks

Key insight: LSM-Trees are optimized for writing (sequential I/O), but require compaction to maintain read performance. Used in Cassandra, RocksDB, LevelDB, HBase.

B-Tree vs LSM-Tree: choosing a data structure

B-Tree architecture

[10 | 20 | 30]
[3|5|8]
[12|15|18]
[22|25|28]
Leaves contain pointers to data
✓ Advantages
  • Fast reads: O(log N)
  • Efficient range queries
  • In-place updates
✗ Drawbacks
  • Write amplification
  • Random I/O on writes
Used in:
PostgreSQLMySQL InnoDBOracleSQL ServerSQLite

Transaction Processing

Competition management
2PL (Two-Phase Locking)
Growing phase → Shrinking phase
MVCC
Multi-Version Concurrency Control, snapshot isolation
OCC
Optimistic Concurrency Control, validation on commit
SSI
Serializable Snapshot Isolation
Recovery
WAL (Write-Ahead Log)
Log of changes before application
ARIES
Analysis, Redo, Undo - recovery algorithm
Checkpointing
Periodic state saving
Shadow Paging
Copy-on-write pages

Part II: Distributed Systems

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on replication and partitioning from Alexander and the Code of Architecture club

Read the analysis

Replication & Partitioning

Replication
Single-Leader
One master for writing, replicas for reading
Multi-Leader
Multiple masters, recording conflicts
Leaderless
Quorum-based read/write (Dynamo-style)
Chain Replication
Chain of nodes for strong consistency
Partitioning
Range Partitioning
By key range, hotspots risk
Hash Partitioning
Uniform distribution, but range queries are more difficult
Consistent Hashing
Minimizing rebalancing when adding nodes
Compound Partitioning
hash + range combination

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on consensus protocols from Alexander and the Code of Architecture club

Read the analysis

Consensus Protocols

Paxos
  • Classical Lamport algorithm
  • Prepare → Promise → Accept
  • Difficult to implement
  • Multi-Paxos for the leader
Raft
  • Understandable consensus
  • Leader election + Log replication
  • etcd, Consul, CockroachDB
  • Easier to implement
Zab
  • Zookeeper Atomic Broadcast
  • Primary-backup model
  • FIFO ordering guarantees
  • Optimized for writes

Distributed Transactions

Atomic Commit Protocols
Two-Phase Commit (2PC)
Prepare → Vote → Commit/Abort. Blocking when the coordinator falls.
Three-Phase Commit (3PC)
Adds a pre-commit phase for non-blocking recovery.
Alternative approaches
Saga Pattern
A chain of local transactions with compensating actions.
Calvin / Deterministic DB
Deterministic transaction order, replicated state machine.

Low Level Details

File formats

Slotted Pages
Flexible placement of variable length records
Cell Layout
Key-Value-Pointer structures in pages
Checksums
CRC32/xxHash for corruption detection
Alignment
Alignment for I/O optimization

Disk I/O optimization

Sequential vs Random
Sequential Read/Write Preference
Direct I/O
Bypassing OS file cache
Read-Ahead
Block Read Ahead
Write Coalescing
Grouping records for batch I/O

Examples from real DBMSs

PostgreSQL
Storage: B+ Tree (Heap + Indexes)
MVCC, WAL, TOAST for large values
MySQL InnoDB
Storage: B+ Tree (Clustered Index)
MVCC, Double Write Buffer, Change Buffer
RocksDB
Storage: LSM-Tree
Column Families, Leveled/Universal Compaction
Cassandra
Storage: LSM-Tree
Leaderless Replication, Tunable Consistency
MongoDB
Storage: WiredTiger (B-Tree + LSM)
Document model, Sharding, Replica Sets
CockroachDB
Storage: RocksDB + Raft
Serializable Transactions, Geo-partitioning

Results and recommendations

Strengths

  • Deep analysis of the internal structure of databases
  • Comparison of B-Tree vs LSM-Tree with trade-offs
  • Detailed analysis of consensus algorithms
  • Examples from real production systems
  • Physical disk storage explained

Who is it suitable for?

  • Database engineers
  • For storage system developers
  • For those who want to understand trade-offs of different DBMSs
  • Preparation for Staff+ positions in DB companies
  • Storage Researchers

Verdict: Database Internals is a unique book that bridges the gap between high-level systems design books and academic works. If DDIA explains What do, then Petrov explains How this is implemented internally. A must for anyone who wants to understand databases at a deep level.

Where to find the book

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov