Consensus: Paxos and Raft

Consensus is best seen not as a badge of engineering sophistication, but as an expensive tool for cases where the system truly needs one agreed history of state.

In practice, this chapter helps separate situations where Raft or Paxos are justified from those where the team only buys extra latency and complexity without a critical correctness gain.

In interviews and architecture discussions, it is strongest when you talk not only about algorithm names, but also about quorum cost: write latency, recovery behavior, leader failover, and debugging overhead.

Practical value of this chapter

Design in practice

Clarifies where consensus is truly required versus where simpler guarantees are enough.

Decision quality

Helps choose Raft/Paxos with awareness of workload shape and fault-tolerance goals.

Interview articulation

Enables concise explanation of quorum, leader term, and commit behavior.

Risk and trade-offs

Highlights consensus cost: write latency, recovery complexity, and debugging overhead.

Related book

Designing Data‑Intensive Applications

The consensus and replication chapters connect command logs, quorums, and the practical trade-offs of fault-tolerant storage.

Read review

In this chapter, consensus ties quorum, linearizability, leader election, and replication into one decision-making loop. Paxos and Raft organize their roles, messages, and command logs differently, but both protect the system from contradictory decisions during network partitions and partial failures.

For Paxos, the vocabulary that matters from here on is the proposer, the acceptor, the learner, and the proposal number. For Raft, it is the replicated log, the term, and the commit index.

The practical cost of this loop shows up through clock skew, clock drift, timeouts, latency, throughput, failover, and the state-machine model that every replica must execute in the same order.

Consensus is needed when several nodes must make the same decision about system state, even while messages are delayed, nodes fail, and the network temporarily splits. Without it, a system cannot reliably elect a leader, fix the order of writes, or maintain one history of critical changes.

Foundation

TCP protocol

Consensus uses message exchange, but correctness comes from quorums, round numbers, and commit rules rather than from transport alone.

Читать обзор

Where consensus is actually needed

Leader election

There must be exactly one coordinator in the cluster: two active leaders at once mean two diverging sources of truth and lost writes.

Cluster metadata

Membership, configuration, schema, and routing rules change rarely, but they cannot change out of step — the order of changes must be the same for every node.

Linearizable writes

Where strict ordering matters, critical operations must appear to pass through one sequential history rather than several parallel ones.

Tested in practice

Jepsen and consistency models

Jepsen tests have repeatedly found clusters whose quorums effectively stopped intersecting — and the systems happily committed contradictory histories.

Читать обзор

Quorum arithmetic: why 2f + 1

Every protocol in this chapter rests on the same argument. Suppose the cluster has N nodes, and a decision counts as made once a quorum of Q nodes confirms it. Two requirements squeeze Q from both sides.

Liveness: Q ≤ N − f

A quorum must still assemble after f nodes fail. If a decision needs more than N − f confirmations, one extra failure halts writes forever — the live nodes would wait on the dead ones indefinitely.

Safety: 2Q > N

Any two quorums must intersect in at least one node. Otherwise, during a network partition, two disjoint groups can commit different decisions — which is exactly split-brain.

Combine the two: N − f ≥ Q > N/2, hence N > 2f. The smallest cluster that survives f failures is N = 2f + 1 nodes with a quorum of Q = f + 1. The node in the intersection of two quorums acts as a witness: any future quorum necessarily contains someone who saw the earlier decision, so a new leader physically cannot miss anything that was committed. All of the safety in Paxos, Raft, and ZAB hangs on this intersection.

Nodes (N)	Quorum	Tolerates failures (f)	Comment
2	2	0	Worse than one node: an extra point of failure with no extra resilience
3	2	1	The smallest fault-tolerant cluster — the standard for etcd and Consul
4	3	1	Same resilience as three nodes, but the quorum is larger and slower
5	3	2	Survives a node failure even during maintenance on another
7	4	3	The practical ceiling: every write already waits for four confirmations

Why an even size is wasted money

Four nodes tolerate exactly as many failures as three (f = 1), but the majority grew from two to three: every write waits for an extra confirmation, and there are more machines that can break. An even-sized cluster pays for reliability it never receives, which is why ZooKeeper and etcd recommend 3 or 5 nodes: three when one spare failure is enough, five when the cluster must survive an outage during planned maintenance.

Read and write quorums: R + W > N

The intersection idea generalizes: if writes are confirmed by W nodes and reads poll R, then with R + W > N a read quorum is guaranteed to touch at least one fresh copy. Raft and Paxos use the symmetric case W = R = f + 1, while leaderless stores such as Cassandra let you tune R and W per workload.

Important: quorum reads and writes by themselves are not consensus. Without round numbers and a commit rule, the intersection guarantees visibility of a fresh copy, not a single linearizable history of operations.

Safety and liveness: two kinds of guarantees

Properties of a distributed algorithm are traditionally split into two classes, and for consensus this split is not a formality but the central design choice: which class of guarantees the protocol is prepared to give up when the network misbehaves.

Safety: "nothing bad ever happens"

A safety violation is visible on a finite prefix of the execution and cannot be repaired after the fact. For consensus it means three properties:

Agreement — no two correct nodes ever decide different values for the same log slot.
Validity — the decided value was actually proposed by someone; the protocol did not invent it.
Integrity — a node decides at most once and never changes its decision.

Liveness: "something good eventually happens"

A liveness violation can never be demonstrated by any finite prefix — one can always object that "one more step and it would have decided". For consensus it means:

Termination — every correct node eventually decides.
In practice: elections finish, writes commit, and the cluster does not hang in endless re-elections.

What Raft and Paxos promise always — and what only in fair weather

Safety is unconditional in both: under arbitrary delays, losses, duplicates, and reordering of messages, a committed entry never rolls back and two leaders never share the same term. The fine print of that honesty: at most f crash failures, no Byzantine nodes, and disks that do not lose data after fsync. Liveness is conditional: it rests on the network behaving well for long enough. The Raft authors state it as a timing requirement: broadcastTime ≪ electionTimeout ≪ MTBF — a round trip with acknowledgment takes 0.5–20 ms, election timeouts are chosen in the 10–500 ms range, and node failures happen once in weeks or months. While the inequality holds, the cluster keeps a stable leader; when the network breaks it, the cluster stops accepting writes — but it never starts contradicting itself.

Paper

FLP, JACM 1985

The original paper by Fischer, Lynch, and Paterson on the impossibility of deterministic consensus in an asynchronous system.

Перейти на сайт

FLP: why consensus cannot be guaranteed

In 1985 Michael Fischer, Nancy Lynch, and Michael Paterson proved the result that drew the boundaries of the whole field and later earned the Dijkstra Prize: in a fully asynchronous system — with no upper bound on message delays and no synchronized clocks — no deterministic consensus protocol can guarantee termination if even a single process may silently stop.

"Every protocol for this problem has the possibility of nontermination, even with only one faulty process."

What exactly is proven

Only liveness takes the hit. For any deterministic protocol there exists an "unlucky" message delivery schedule under which the system stays undecided forever. Safety remains achievable — it is the guarantee of termination that is impossible.

Why it holds

In the asynchronous model a failed node is indistinguishable from a very slow one. Wait for it and you may wait forever; decide without it and it may come back and vote the other way. The proof shows that an adversary controlling message delivery can always keep the protocol in a state where the outcome is still open.

What the theorem does not say

FLP does not say consensus "fails in practice" — it forbids a worst-case guarantee. Real networks are asynchronous only some of the time, and the engineering question is different: how to make the bad periods rare and the behavior during them safe.

How practical systems sidestep FLP

1. Timeouts = partial synchrony

Dwork, Lynch, and Stockmeyer (1988) formalized the partially synchronous model: after some unknown stabilization point the network starts delivering messages within a bounded time. In this model consensus is solvable — and this is exactly where Raft and Paxos live: safety under any network behavior, liveness once things stabilize. Every timeout in Raft is a bet that the stable period has already begun.

2. Randomization

Ben-Or (1983) showed that randomness escapes the deterministic ban: a coin-flipping protocol terminates with probability 1. The direct engineering echo of this idea is Raft's randomized election timeouts: the random spread breaks the symmetry in which candidates split the vote forever.

3. Leader election as a failure detector

Theorists identified the minimal addition to the asynchronous model that makes consensus solvable: a mechanism that eventually points everyone at one live leader (the Ω failure detector of Chandra and Toueg). Leader election in Raft, Multi-Paxos, and ZAB is the engineering realization of that abstraction. Its liveness is conditional for the same reason: under an unstable network, re-elections can run forever, but the protocol will never allow two leaders with the same term number.

How Paxos chooses one value

Paxos has one job: choose a single value when several nodes propose their own and some of them may fail mid-vote. Lamport's algorithm does it in two moves — in the prepare phase participants promise not to accept older proposals, in the accept phase a quorum fixes the chosen value. Those promises are exactly what stops two competing proposers from committing different values.

Paxos: node message flow

The scenarios show how a quorum chooses one value and what happens when proposals compete.

Single value

Paxos chooses a value through quorum intersection

The proposer runs two phases: it first collects promises, then asks a quorum to accept the value.

Interactive replayStep 1/5

Active step

Prepare(n) is sent to a quorum

The proposer chooses a round number and sends Prepare(n) to acceptors.

Node interaction view

What it protects

Phase quorums intersect, so two different values cannot both be safely chosen.

Main risk

Without a stable leader, competing proposers may burn many network rounds.

What to watch

Proposal numbers, quorum size, and the value returned in Promise.

Implementation notes

•Prepare and Accept act like two network safety barriers.
•Promise must return the latest accepted value or safety breaks.
•A learner does not choose the value; it observes the quorum result.

Multi-Paxos: reducing the number of rounds

When the leader is stable, the prepare phase does not need to run for every entry. The leader secures a proposal number once and then sends new values directly into the accept round.

What it gives you

Fewer network rounds per write
Higher throughput while the leader remains stable
Clearer progress when several clients compete to write

Multi-Paxos: message flow

The scenarios show the short write path with a stable leader and a safe leader change.

Steady write

Multi-Paxos shortens the write path with a stable leader

The leader has already completed the prepare phase, so each new entry goes directly through the accept round.

Interactive replayStep 1/5

Active step

The leader owns a proposal number

The prepare phase has already run, and the quorum recognizes the leader for the current number.

Node interaction view

What it protects

Paxos safety is preserved, while the common write path avoids another Prepare.

Main risk

If the leader stalls or loses the majority, the steady path stops making progress.

What to watch

Leader stability, write latency, and the share of writes that require a new Prepare.

Implementation notes

•Multi-Paxos does not change Paxos safety rules; it optimizes repeated writes.
•The client path stays shorter while the leader is not competing with another proposer.
•Reads still need rules that do not bypass the current quorum.

How Raft makes consensus easier to reason about

Understandability here is the design goal, not decoration: Paxos is too easy to implement with a bug. Raft draws hard lines between three mechanisms — leader election, command-log replication, and membership changes — and that is what makes its behavior easier to explain to a team, implement without hidden defects, and debug when the cluster misbehaves.

Raft: node interactions

The scenarios show leader election, entry commit, and stale-leader step-down.

Election + commit

Raft elects a leader and commits through a majority

A candidate first wins a majority of votes, then the leader replicates a command and advances the commit index.

Interactive replayStep 1/5

Active step

Election timeout fires

A follower stops seeing leader heartbeats, increments its term, and becomes a candidate.

Node interaction view

What it protects

A leader appears only after majority votes, and an entry commits only after majority acknowledgment.

Main risk

Poor timeout settings trigger unnecessary elections and delay commits.

What to watch

Term number, vote majority, replica lag, and commit-index movement.

Implementation notes

•Raft separates leader election from log replication to make behavior easier to verify.
•A client command is safe after majority acknowledgment, not just after one leader writes it locally.
•Replicas apply entries to the state machine in log order.

Paxos and Raft: engineering comparison

Paxos

Strong theoretical foundation and compact formal model
Harder to explain and implement without mistakes
Often hidden behind Multi-Paxos or derived algorithms in real products

Raft

Explicit model for leaders, terms, and log replication
Easier to explain to a team and operate in practice
Used in etcd, Consul, CockroachDB, and other systems

Dissertation

Consensus: Bridging Theory and Practice

Ongaro's dissertation on Raft: single-server membership changes, read optimizations, and implementation lessons.

Перейти на сайт

Raft, Multi-Paxos, and ZAB: three schools of leader-based consensus

Three families dominate production today: Raft, Multi-Paxos, and ZAB (ZooKeeper's atomic broadcast protocol). All three maintain a replicated log through majority quorums, but they answer three engineering questions differently: how strong the leader is, how the cluster recovers after the leader changes, and how to change the membership safely.

Aspect	Raft	Multi-Paxos	ZAB
Role of the leader	Strong leader: all writes go through it, and the log flows only from the leader outward	An optimization: the protocol stays correct with several competing proposers — only performance suffers	Primary-backup: the primary generates ordered state changes; the epoch plays the role of the term
Elections and recovery	Votes go only to a candidate with a log at least as complete, so the new leader already holds everything committed	A node with gaps in its log can become leader: the prepare phase recovers unfinished slots from a quorum	Discovery and synchronization phases: the leader gathers the most complete history and aligns a quorum before broadcasting
Membership changes	Joint consensus (a double quorum across two configurations) or single-server changes from the dissertation	The configuration is an ordinary log entry that takes effect α slots later	Before ZooKeeper 3.5.0 — restarts with a new config; since 3.5.0 — dynamic reconfig with no downtime
Where it runs	etcd (and through it Kubernetes), Consul, CockroachDB, TiKV, Kafka KRaft	Chubby, Spanner, and other internal Google systems	ZooKeeper — and on top of it HBase, Hadoop, and Kafka before 4.0

Joint consensus: why two configurations

You cannot switch a cluster from the old membership to the new one in a single step: during the transition, a majority of the old configuration and a majority of the new one may fail to intersect and elect two leaders. That is why the 2014 Raft paper takes the cluster through a transitional configuration C(old,new), where every decision requires a majority in both groups at once.

Single-server changes: simpler, with a backstory

Ongaro's dissertation proposes changing membership one node at a time: quorums of adjacent configurations intersect automatically, so no double quorum is needed. The community later found a subtle safety flaw in the scheme, and the rule was tightened: the leader must first commit an entry from its current term and only then apply a membership change. The etcd-raft library ended up supporting both schemes.

Foundation

Clock synchronization

Lease reads are exactly as correct as the bound on clock drift: a direct dependency on the chapter about time.

Читать обзор

Practical consequences: the price in RTTs and where to draw the line

What one write costs

With a stable leader: the client request to the leader, then one quorum RTT for replication plus an fsync of the log on a quorum of nodes — the disk sits on the critical path of every write.
Basic Paxos without a pinned leader spends two quorum rounds (prepare + accept) per value — which is exactly why every practical system pins a leader.
A leader change pauses writes for the election timeout plus the election itself; in a geo-distributed cluster every quorum round adds tens of milliseconds of cross-region latency.

Three ways to read from a consensus cluster

Reading from the leader as-is

The leader answers from its own state with no extra checks.

Cost: 0 RTT inside the cluster.

The leader may already be deposed without knowing it: a stale leader returns old data and breaks linearizability. Acceptable only where a slightly stale read does no harm.

Read index

The leader records its commit index, confirms its leadership with a quorum round of heartbeats, waits for the log to apply up to that index, and replies.

Cost: 1 quorum RTT, but no disk write.

Linearizable and noticeably cheaper than a write, yet every read still touches the quorum. This is how linearizable reads work in etcd by default.

Lease read

While a lease renewed by heartbeats is in force, the leader answers locally without asking the quorum.

Cost: 0 RTT per read.

Correctness rests on bounded clock drift: drifting clocks turn into stale reads. This is how TiKV and Spanner work — the latter backs its leases with hardware TrueTime.

Why "consensus on every write" does not scale

The leader is a single node: its CPU, disk, and network cap the throughput of the whole group, and every write carries a quorum RTT and an fsync. The log inside one group cannot be sharded, so growing load runs straight into the leader. Mature architectures draw the line in three ways:

Consensus for metadata only. Chubby holds configuration and master elections for GFS and Bigtable, ZooKeeper does it for HBase and for Kafka before 4.0, etcd does it for Kubernetes. The data itself bypasses consensus.
Many small groups. Spanner keeps a Paxos group per tablet, CockroachDB and TiKV a Raft group per key range: scale comes from sharding the groups, not from growing a single one.
Different guarantees for different data. In Kafka with KRaft, cluster metadata goes through Raft, while the messages themselves are replicated by the cheaper in-sync-replica mechanism.

What to remember

Consensus is not needed for every write, only where the system needs one agreed decision.
A cluster of 2f + 1 nodes tolerates f failures because majority quorums are guaranteed to intersect; 3 and 5 nodes are the standard, and an even size adds no resilience.
Protocols guarantee safety under any network behavior, and liveness only during sufficiently stable periods: FLP forbids asking a deterministic algorithm for more.
Paxos shows the fundamental mechanics of value selection, while Raft makes a similar idea easier to implement.
The cost of consensus is a quorum, extra network rounds, write latency, and harder failure recovery.

When consensus can hurt

Consensus improves correctness for critical state, but it makes the write path slower and more complex. If a domain can tolerate temporary divergence, asynchronous replication, or idempotent repair, a simpler mechanism is often more reliable in operation.

Sources and further reading

Fischer, Lynch, Paterson — Impossibility of Distributed Consensus (JACM, 1985)Dwork, Lynch, Stockmeyer — Consensus in the Presence of Partial Synchrony (JACM, 1988)Ongaro, Ousterhout — In Search of an Understandable Consensus Algorithm (USENIX ATC, 2014)Ongaro — Consensus: Bridging Theory and Practice (PhD, Stanford, 2014)Chandra, Griesemer, Redstone — Paxos Made Live (PODC, 2007)Corbett et al. — Spanner: Google's Globally-Distributed Database (OSDI, 2012)Junqueira, Reed, Serafini — Zab: High-performance Broadcast (DSN, 2011)ZooKeeper Dynamic Reconfiguration (since 3.5.0)TiKV — How TiKV Uses Lease Read Apache Kafka 4.0: KRaft replaces ZooKeeper raft-dev — the bug in single-server membership changes and its fix

Related chapters

Why distributed systems and consistency are needed - Section map with baseline failure models, coordination challenges, and consistency boundaries.
CAP theorem - Explains why a protocol must choose what to protect first when the network partitions.
PACELC theorem - You pay the cost of coordination not only during incidents: PACELC shows that even in normal operation consistency is traded for latency.
Clock Synchronization in Distributed Systems - How skew, clock drift, and timeouts shape the stability of leader-based protocols.
Leader Election: patterns and implementations - Practical failover, lease, and split-brain protection patterns built on top of Raft and other coordination mechanisms.
Distributed transactions: two-phase and three-phase commit - How cross-service operation coordination differs from consensus over replicated state.
Jepsen and consistency models - How to validate consistency guarantees and find real violations in clustered systems.
Designing Data-Intensive Applications, 2nd Edition (short summary) - Key source on replication, command logs, consensus, and distributed-system trade-offs.
Distributed Systems, 4th Edition (short summary) - Classic theoretical foundation for distributed algorithms and failure models.
Leslie Lamport: causality, Paxos, and engineering mindset - Historical and practical context for the ideas that grew into the Paxos family.

Practical value of this chapter

Where consensus is actually needed

Leader election

Cluster metadata

Linearizable writes

Quorum arithmetic: why 2f + 1

Liveness: Q ≤ N − f

Safety: 2Q > N

Why an even size is wasted money

Read and write quorums: R + W > N

Safety and liveness: two kinds of guarantees

Safety: "nothing bad ever happens"

Liveness: "something good eventually happens"

FLP: why consensus cannot be guaranteed

What exactly is proven

Why it holds

What the theorem does not say

How practical systems sidestep FLP

1. Timeouts = partial synchrony

2. Randomization

3. Leader election as a failure detector

How Paxos chooses one value

Paxos: node message flow

Paxos chooses a value through quorum intersection

Prepare(n) is sent to a quorum

Acceptors reply with Promise

Accept(n, v) proposes the value

The quorum stores Accepted

Learners observe the decision

Prepare(n) is sent to a quorum

Multi-Paxos: reducing the number of rounds

What it gives you

Multi-Paxos: message flow

Multi-Paxos shortens the write path with a stable leader

The leader owns a proposal number

The client sends a command to the leader

Accept(n, vᵢ) goes to a quorum

The quorum returns Accepted

The result is published to learners

The leader owns a proposal number

How Raft makes consensus easier to reason about

Raft: node interactions

Raft elects a leader and commits through a majority

Election timeout fires

The candidate sends RequestVote

A majority grants votes

The leader replicates a client command

The commit index advances

Election timeout fires

Paxos and Raft: engineering comparison

Paxos

Raft

Raft, Multi-Paxos, and ZAB: three schools of leader-based consensus

Joint consensus: why two configurations

Single-server changes: simpler, with a backstory

Practical consequences: the price in RTTs and where to draw the line

What one write costs

Three ways to read from a consensus cluster

Reading from the leader as-is

Read index

Lease read

Why "consensus on every write" does not scale

What to remember

When consensus can hurt

Sources and further reading

Related chapters