Consensus is best seen not as a badge of engineering sophistication, but as an expensive tool for cases where the system truly needs one agreed history of state.
In practice, this chapter helps separate situations where Raft or Paxos are justified from those where the team only buys extra latency and complexity without a critical correctness gain.
In interviews and architecture discussions, it is strongest when you talk not only about algorithm names, but also about quorum cost: write latency, recovery behavior, leader failover, and debugging overhead.
Practical value of this chapter
Design in practice
Clarifies where consensus is truly required versus where simpler guarantees are enough.
Decision quality
Helps choose Raft/Paxos with awareness of workload shape and fault-tolerance goals.
Interview articulation
Enables concise explanation of quorum, leader term, and commit behavior.
Risk and trade-offs
Highlights consensus cost: write latency, recovery complexity, and debugging overhead.
Related book
Designing Data‑Intensive Applications
The consensus and replication chapters connect command logs, quorums, and the practical trade-offs of fault-tolerant storage.
Consensus is needed when several nodes must make the same decision about system state, even while messages are delayed, nodes fail, and the network temporarily splits. Without it, a system cannot reliably elect a leader, fix the order of writes, or maintain one history of critical changes.
Foundation
TCP protocol
Consensus uses message exchange, but correctness comes from quorums, round numbers, and commit rules rather than from transport alone.
Where consensus is actually needed
Leader election
The cluster must appoint one coordinator and avoid two active leaders at the same time.
Cluster metadata
Membership, configuration, schema, and routing rules must change in one agreed order.
Linearizable writes
Critical operations must appear to pass through one sequential history.
How Paxos chooses one value
Paxos is Lamport's classic algorithm for choosing a single value under contention and failure. It splits the work into preparation and acceptance: participants first promise not to accept older proposals, and then a quorum accepts the selected value.
Paxos: node message flow
The scenarios show how a quorum chooses one value and what happens when proposals compete.
Single value
Paxos chooses a value through quorum intersection
The proposer runs two phases: it first collects promises, then asks a quorum to accept the value.
Active step
Prepare(n) is sent to a quorum
The proposer chooses a round number and sends Prepare(n) to acceptors.
Node interaction view
What it protects
Phase quorums intersect, so two different values cannot both be safely chosen.
Main risk
Without a stable leader, competing proposers may burn many network rounds.
What to watch
Proposal numbers, quorum size, and the value returned in Promise.
Implementation notes
- •Prepare and Accept act like two network safety barriers.
- •Promise must return the latest accepted value or safety breaks.
- •A learner does not choose the value; it observes the quorum result.
Multi-Paxos: reducing the number of rounds
When the leader is stable, the prepare phase does not need to run for every entry. The leader secures a proposal number once and then sends new values directly into the accept round.
What it gives you
- Fewer network rounds per write
- Higher throughput while the leader remains stable
- Clearer progress when several clients compete to write
Multi-Paxos: message flow
The scenarios show the short write path with a stable leader and a safe leader change.
Steady write
Multi-Paxos shortens the write path with a stable leader
The leader has already completed the prepare phase, so each new entry goes directly through the accept round.
Active step
The leader owns a proposal number
The prepare phase has already run, and the quorum recognizes the leader for the current number.
Node interaction view
What it protects
Paxos safety is preserved, while the common write path avoids another Prepare.
Main risk
If the leader stalls or loses the majority, the steady path stops making progress.
What to watch
Leader stability, write latency, and the share of writes that require a new Prepare.
Implementation notes
- •Multi-Paxos does not change Paxos safety rules; it optimizes repeated writes.
- •The client path stays shorter while the leader is not competing with another proposer.
- •Reads still need rules that do not bypass the current quorum.
How Raft makes consensus easier to reason about
Raft was designed as an understandable consensus protocol. It explicitly separates leader election, log replication, and membership changes, which makes the protocol easier to explain, implement, and debug.
Raft: node interactions
The scenarios show leader election, entry commit, and stale-leader step-down.
Election + commit
Raft elects a leader and commits through a majority
A candidate first wins a majority of votes, then the leader replicates a command and advances the commit index.
Active step
Election timeout fires
A follower stops seeing leader heartbeats, increments its term, and becomes a candidate.
Node interaction view
What it protects
A leader appears only after majority votes, and an entry commits only after majority acknowledgment.
Main risk
Poor timeout settings trigger unnecessary elections and delay commits.
What to watch
Term number, vote majority, replica lag, and commit-index movement.
Implementation notes
- •Raft separates leader election from log replication to make behavior easier to verify.
- •A client command is safe after majority acknowledgment, not just after one leader writes it locally.
- •Replicas apply entries to the state machine in log order.
Paxos and Raft: engineering comparison
Paxos
- Strong theoretical foundation and compact formal model
- Harder to explain and implement without mistakes
- Often hidden behind Multi-Paxos or derived algorithms in real products
Raft
- Explicit model for leaders, terms, and log replication
- Easier to explain to a team and operate in practice
- Used in etcd, Consul, CockroachDB, and other systems
What to remember
- Consensus is not needed for every write, only where the system needs one agreed decision.
- Paxos shows the fundamental mechanics of value selection, while Raft makes a similar idea easier to implement.
- The cost of consensus is a quorum, extra network rounds, write latency, and harder failure recovery.
When consensus can hurt
Consensus improves correctness for critical state, but it makes the write path slower and more complex. If a domain can tolerate temporary divergence, asynchronous replication, or idempotent repair, a simpler mechanism is often more reliable in operation.
Related chapters
- Why distributed systems and consistency are needed - Section map with baseline failure models, coordination challenges, and consistency boundaries.
- CAP theorem - Explains why a protocol must choose what to protect first when the network partitions.
- PACELC theorem - Shows the cost of coordination both during incidents and during normal operation.
- Clock Synchronization in Distributed Systems - How skew, clock drift, and timeouts shape the stability of leader-based protocols.
- Leader Election: patterns and implementations - Practical failover, lease, and split-brain protection patterns built on top of Raft and other coordination mechanisms.
- Distributed transactions: 2PC and 3PC - How cross-service operation coordination differs from consensus over replicated state.
- Jepsen and consistency models - How to validate consistency guarantees and find real violations in clustered systems.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - Key source on replication, command logs, consensus, and distributed-system trade-offs.
- Distributed Systems, 4th Edition (short summary) - Classic theoretical foundation for distributed algorithms and failure models.
- Leslie Lamport: causality, Paxos, and engineering mindset - Historical and practical context for the ideas that grew into the Paxos family.
