Database Internals: A Deep Dive (short summary)

A book on database internals is not about academic prestige. It matters because without that layer, teams make architectural decisions from surface signals and vendor labels.

In engineering practice, it helps you see how B-trees, LSM trees, transactions, replication, and consensus shape the write path, read amplification, crash recovery, and concurrency behavior of a system.

In interviews and architecture discussions, this material is especially valuable as a differentiator because it lets you explain not only what to choose, but why a mechanism behaves the way it does.

Practical value of this chapter

Storage-engine literacy

Deep B-tree and LSM-tree understanding improves architectural choices for read paths, write paths, and workload behavior.

Isolation intuition

Internal transaction mechanics make isolation-level and concurrency decisions explicit and defensible.

Replication and consensus

Tie replication models directly to availability targets, read freshness, and recovery requirements.

Interview deep dive

Use internals-level explanation as a differentiator: explain not only what to choose, but why it works.

Decision frame and editorial focus

Chapter focus

storage engines, transactions, and replication inside databases

Workload profile

Read through mechanism: write path, read path, isolation, recovery, replication, consensus, and concurrent behavior.

Good fit

The deep dive is useful when you need to explain not just which DBMS to choose, but why its latency, reliability, or limits exist.

Boundary and risk

Do not turn the chapter into a book recap: the value is the bridge from internals to system design and operations.

Connect next

Connect takeaways to concrete engine overviews, DDIA, replication/sharding, and performance diagnosis.

Related chapter

PostgreSQL from the inside

Deep dive into MVCC, WAL, locks and PostgreSQL indexes from Egor Rogov.

Читать обзор

Database Internals

Authors: Alex Petrov
Publisher: O'Reilly Media, Inc.
Length: 370 pages

Analysis of Alex Petrov's book on storage engines, B-trees, LSM trees, transactions, replication, consensus, recovery, and physical data layout.

Original

Translated

The book connects storage engines, database pages, B-trees, LSM trees, write-ahead logs, two-phase locking, replicated state machines, and physical I/O into one engineering picture. While a DBMS is running these mechanisms stay invisible; they surface in the cost of a query, in behavior under load, and in what exactly breaks during a failure. The book shows how low-level storage, transactions, and distributed protocols add up to the observable behavior of a real system.

Detailed analysis

Code of Architecture

Detailed analysis of the first part from Alexander and the Code of Architecture club

Read the analysis

Part I: storage engines

B-trees and their variants

Data structures

B-tree

Basic structure with keys in internal nodes

B+ tree

Data lives in leaves, and leaves are linked together

B* tree

Denser node fill and fewer page splits

Bw-tree

Lock-free B-tree variant for memory-resident workloads

Disk optimizations

Page Organization

Page structure: header, cells, pointers

Buffer Pool

Caching pages in memory

Split & Merge

Splitting and merging nodes with overflow/underflow

Overflow Pages

Store large values away from the main page

Key insight: B-trees pay for in-place updates with random disk I/O but give predictable reads, which is why they are chosen for OLTP workloads. PostgreSQL and MySQL InnoDB use B+ trees for indexes.

Detailed analysis

Code of Architecture

Detailed analysis of the LSM-tree chapter from Alexander and the Code of Architecture club

Read the analysis

LSM trees

Components

Memtable

In-memory buffer for fresh writes, often backed by a red-black tree

SSTable

Sorted immutable file on disk

WAL

Write-ahead log used for crash recovery

Compaction

Size-tiered

Groups files by size and reduces write amplification

Leveled

Caps each level size and improves read cost

FIFO

Useful for time-series data and TTL-driven expiration

Reading optimizations

Bloom Filter

Probabilistic key presence check

Sparse Index

Indexes blocks inside SSTables

Block Cache

Caches frequently read data blocks

Key insight: LSM trees optimize writes through sequential I/O, but they rely on compaction to keep read cost under control. Cassandra, RocksDB, LevelDB, and HBase all use this family of structures.

B-Tree vs LSM tree: choosing a storage structure

B-Tree architecture

[10 | 20 | 30]

[3|5|8]

[12|15|18]

[22|25|28]

Leaves contain pointers to data

✓ Advantages

Fast reads: O(log N)
Efficient range queries
In-place updates

✗ Drawbacks

Write amplification
Random I/O on writes

Used in:

PostgreSQLMySQL InnoDBOracleSQL ServerSQLite

Transaction Processing

Concurrency control

2PL

Lock acquisition phase → lock release phase

MVCC

Row versions and consistent snapshots

OCC

Optimistic concurrency control: write-conflict validation before commit

SSI

Serializable Snapshot Isolation

Recovery

WAL

Records changes before they are applied

ARIES

Analysis, redo, and undo after a crash

Checkpointing

Periodically records a safe recovery point

Shadow paging

Updates page copies and atomically swaps the pointer

Part II: distributed systems

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on replication and partitioning from Alexander and the Code of Architecture club

Read the analysis

Replication and partitioning

Replication

Single leader

One node accepts writes, replicas serve reads

Multi-leader

Multiple nodes accept writes, so conflicts must be resolved

Leaderless

Reads and writes are acknowledged by quorum

Chain replication

Nodes form a chain to preserve strict update order

Partitioning

Range partitioning

Simple key navigation, but hot ranges can appear

Hash partitioning

More even distribution, but range queries become harder

Consistent hashing

Moves less data when nodes are added or removed

Compound partitioning

Combines hash and range strategies

Detailed analysis

Code of Architecture

Detailed analysis of the chapter on consensus protocols from Alexander and the Code of Architecture club

Read the analysis

Consensus protocols

Paxos

Classical Lamport algorithm
Prepare → Promise → Accept
Difficult to implement
Multi-Paxos reduces the cost of repeated decisions

Raft

Consensus with an explicit leader
Leader election and log replication
etcd, Consul, CockroachDB
Easier to explain and implement

Zab

ZooKeeper atomic broadcast
Leader and backup-node model
First-in, first-out (FIFO) ordering guarantees
Optimized for writes

Distributed transactions

Atomic commit protocols

Two-phase commit (2PC)

Prepare → vote → commit or abort. It can block if the coordinator fails at the wrong moment.

Three-phase commit (3PC)

Adds a pre-commit phase for non-blocking recovery.

Alternative approaches

Saga pattern

A chain of local transactions with compensating actions.

Calvin / deterministic DBMS

Preorders transactions and applies that order like a replicated state machine.

Low-level details

File formats

Slotted pages

Variable-length records are addressed through a slot directory

Cell layout

Key, value, and pointer layout inside a page

Checksums

CRC32/xxHash help detect corruption

Alignment

Alignment reduces unnecessary I/O

Disk I/O optimization

Sequential vs random

Sequential access is usually cheaper than random access

Direct I/O

Bypassing OS file cache

Read-ahead

Preloads adjacent blocks before they are requested

Write coalescing

Combines small writes into larger I/O operations

Primary source

Introducing Pebble

Cockroach Labs on replacing RocksDB with Pebble: the motivation, compatibility guarantees, and the plan to drop RocksDB entirely

Read the post

Examples from real DBMSs

PostgreSQL

Storage: B+ tree, heap tables, and indexes

MVCC, WAL, TOAST for large values

MySQL InnoDB

Storage: B+ tree and clustered index

MVCC, doublewrite buffer, change buffer

RocksDB

Storage: LSM tree

Column families, leveled and universal compaction

Cassandra

Storage: LSM tree

Leaderless replication, tunable consistency

MongoDB

Storage: WiredTiger (B-tree), the default engine since 3.2 (December 2015)

Document model, sharding, replica sets; WiredTiger's LSM mode never left the experimental stage and was removed from the codebase in December 2024

CockroachDB

Storage: Pebble, a Go LSM engine that grew out of RocksDB's design

Raft replication, serializable transactions, geo-partitioning; Pebble became the default in v20.2 (November 2020) and the sole engine in v21.1 (May 2021)

Takeaways and recommendations

Strengths

Deep explanation of DBMS internals
Practical comparison of B-trees and LSM trees
Detailed treatment of consensus protocols
Examples from real production DBMSs
Physical disk storage explained clearly

Who should read it?

Engineers who work with databases below the SQL surface
Storage-engine and data-infrastructure developers
People who want to reason about DBMS trade-offs
Candidates for Staff+ roles on database teams
Researchers and practitioners in storage systems

Verdict: Database Internals bridges the gap between high-level system design books and academic papers. If DDIA explains which properties a system needs, Petrov shows how those properties emerge from pages, logs, indexes, replication, and commit protocols. It is a strong book for anyone who wants to understand DBMS behavior at the mechanism level.

Related chapters

PostgreSQL from the inside (short summary) - Connects Petrov's general mechanisms to PostgreSQL's concrete implementation: MVCC, WAL, locks, and indexes.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Shows how DDIA's system-level ideas map to the internals of storage engines, replication, and consensus.
Why understand storage systems? - A storage-section map for deciding where database internals should influence the class of storage you choose.
Database Selection Framework - A selection framework where B-tree, LSM-tree, and write-ahead-log behavior become practical decision criteria.
Replication and sharding - Operational continuation of the book topics: read/write paths, fault tolerance, rebalancing, and data-scale growth.

Where to find the book

Original

oreilly.com

Database Internals

Translated

piter.com

Движки хранения