Jepsen matters because it tests what a system actually does under failure, not what the documentation promises. In distributed systems, that is the real moment of truth.
In real work, this chapter helps teams define testable consistency properties for concrete workloads and avoid trusting vendor claims blindly where the cost of being wrong is high.
In interviews and architecture discussions, it is especially useful when you need to show the gap between claimed and actual guarantees under network faults, delay, and coordination loss.
Practical value of this chapter
Design in practice
Promotes guarantee validation before incidents instead of trusting vendor claims.
Decision quality
Shows how to define testable consistency properties for real workloads.
Interview articulation
Strengthens answers with practical linearizability/serializability testing strategies.
Risk and trade-offs
Exposes gaps between claimed and actual guarantees under network failures.
Official website
Jepsen.io
A project that tests distributed systems for correctness under failure.
Jepsen is an independent distributed-systems analysis and testing project created by Kyle Kingsbury, also known as Aphyr. It has uncovered critical correctness bugs in dozens of popular databases and became a practical standard for validating consistency claims.
Foundation
TCP protocol
Jepsen models network failures and partitions at the transport layer.
What is Jepsen?
Testing tool
Jepsen is a Clojure library for testing distributed systems. It generates load, injects network partitions, kills processes, shifts clocks, and checks whether the stated guarantees still hold.
Report series
Each analysis is published as a detailed report: test setup, detected anomalies, vendor response, and follow-up fixes. The reports became required reading for distributed-system architects.
Related chapter
CAP theorem
A fundamental limitation of distributed systems.
Why Jepsen matters
Testing marketing claims
Many databases promise strong consistency or ACID semantics, but do not always preserve those guarantees in practice. Jepsen has shown confirmed-write loss in MongoDB, dirty reads in RethinkDB, and data loss in Redis Cluster even without process crashes.
A shared language for guarantees
The project made the hierarchy of consistency models easier to reason about and separated transaction isolation in databases from linearizability in distributed operations.
Better systems
Public reports lead vendors to fix bugs. CockroachDB, TiDB, and YugabyteDB, for example, worked closely with Jepsen to substantiate their serializability guarantees.
Source
Jepsen: Consistency Models
Interactive consistency-model hierarchy.
Consistency-model hierarchy
Jepsen collects consistency models into a hierarchy and shows where two traditions meet: transaction isolation in relational databases and linearizability for distributed operations.
Consistency-model hierarchy
Two branches: transaction serializability and linearizability for distributed operations
Source: Jepsen.ioUnavailable during network faults. Nodes pause operations to preserve safety guarantees.
Available on healthy nodes if clients keep working with the same servers.
Available on all healthy nodes, even during full network partitions.
Key insight
Serializable comes from transactional SQL systems (transaction isolation). Linearizable comes from distributed systems (atomic reads/writes). They converge at the top in Strict Serializable, the strictest consistency model.
About Jepsen: Jepsen runs failure-oriented tests for distributed databases and validates their stated consistency guarantees. Many popular systems (Cassandra, MongoDB, CockroachDB, Redis) have gone through Jepsen analysis.
Related chapter
PACELC theorem
Trade-offs between latency and consistency.
Two branches of consistency
Transaction serializability
This branch comes from relational databases and describes transaction isolation levels, from reading uncommitted data to serializable execution.
Focus:
How transactions interact and which anomalies are allowed: dirty reads, phantom reads, and lost updates.
Operation linearizability
This branch comes from distributed systems and describes atomic reads and writes across multiple nodes.
Focus:
Whether a distributed system looks like a single node where each operation has a precise place between invocation and response.
Strict serializability = linearizability + serializability
Strict serializability sits at the top of the hierarchy. It combines both models: transactions execute serializably and respect real-time operation order. Systems such as Google Spanner approach this with TrueTime.
Key consistency models
Linearizability
Unavailable during partitionEvery operation appears instantaneous between invocation and response. All observers see the same sequence of operations. The strictest model for single operations.
Serializability
Unavailable during partitionTransactions behave as if they executed sequentially in some order, but that order does not have to match real time. The strongest SQL isolation level.
Causal consistency
Sticky availableCausally related operations are observed in the correct order. If event A happened before B, the system should not expose B without its cause. Achievable in AP systems.
Eventual consistency
Available on healthy nodesIf no new writes arrive, all replicas eventually converge. The model does not promise that any particular read observes the latest value. The weakest useful guarantee.
Notable Jepsen findings
| System | Claim | Observed behavior | Status |
|---|---|---|---|
| MongoDB | Durable writes | Confirmed writes could be lost | Fixed |
| Cassandra | LWT atomicity | Lost and duplicated operations | Fixed |
| Redis Cluster | Consistency | Data loss without a network fault | By design |
| etcd | Linearizability | Confirmed ✓ | Verified |
| CockroachDB | Serializability | Confirmed ✓ | Verified |
| TiDB | Snapshot isolation | Anomalies found | Fixed |
Full list of reports: jepsen.io/analyses
How Jepsen testing works
Setup
Deploy a cluster on N nodes
Load
Run reads, writes, and CAS operations
Nemesis
Partitions, process kills, and clock shifts
History
Record every call, response, and error
Check
Compare the history with the chosen model
Nemesis is the failure-injection component. It breaks connectivity between nodes, kills processes, and shifts clocks. If a system claims linearizability, it must preserve a valid operation history through those scenarios.
Practical conclusions
1. Do not trust claims without evidence
Strong consistency, ACID, and linearizability are precise guarantees, not marketing adjectives. Check Jepsen reports and vendor documentation for concrete limitations.
2. Understand the cost of a model
Stricter consistency models have a price: unavailability during network partitions under CAP or higher latency under PACELC. Choose the model from application requirements.
3. Test under failure
Correctness is not established in ideal conditions; it is tested during failures. Use chaos-engineering tools such as Jepsen, Chaos Monkey, and Toxiproxy to observe actual system behavior.
4. Separate isolation from consistency
Serializable isolation in a database is not the same as linearizable consistency in a distributed system. The first is about transactions; the second is about individual operations. Full correctness needs both sides: strict serializability.
What to study next
Jepsen consistency models
Interactive hierarchy with definitions and relationships between guarantees.
Jepsen reports
Analyses of tested systems and vendor responses to observed anomalies.
GitHub: jepsen-io/jepsen
Framework source code for custom distributed-system tests.
DDIA Book
Chapter 9, "Consistency and Consensus", gives a deeper treatment of models and consensus.
Jepsen is your ally
Before choosing a database for a critical system, check Jepsen reports. If a system is not listed, that does not prove it is reliable; it only means no one has tested it publicly. Absence of bug evidence is not evidence of absence.
Related chapters
- Why distributed systems and consistency matter - Section context for why consistency guarantees need failure-time validation, not just documentation.
- CAP theorem - The baseline availability-versus-consistency choice under network partition that Jepsen exposes in real systems.
- PACELC theorem - The CAP extension for normal operation, where latency and consistency shape database behavior before a partition.
- Consensus: Paxos and Raft - Mechanisms for strong guarantees through quorums, replicated logs, and leader-oriented protocols.
- Leslie Lamport: causality, Paxos, and engineering mindset - Causality and happens-before reasoning needed to understand Jepsen consistency models.
- Testing distributed systems - Fault injection and chaos experiments for reproducing distributed-system anomalies.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - A deep reference on consistency, replication, and consensus that supports Jepsen-style validation.
- Distributed Systems, 4th Edition (short summary) - Theoretical background on failure models and distributed algorithms behind Jepsen reports.
- Cassandra: The Definitive Guide (short summary) - A practical example of tunable consistency and fix cycles validated by public Jepsen tests.
- MongoDB: document model, replication, and consistency - How replica-set guarantees and write concerns evolved after public Jepsen feedback.
