System Design Space
Knowledge graphSettings

Updated: April 30, 2026 at 10:03 PM

Consensus: Paxos and Raft

expert

How systems choose one agreed value: quorums, replicated logs, Paxos, Multi-Paxos, Raft, and the practical cost of consensus.

Consensus is best seen not as a badge of engineering sophistication, but as an expensive tool for cases where the system truly needs one agreed history of state.

In practice, this chapter helps separate situations where Raft or Paxos are justified from those where the team only buys extra latency and complexity without a critical correctness gain.

In interviews and architecture discussions, it is strongest when you talk not only about algorithm names, but also about quorum cost: write latency, recovery behavior, leader failover, and debugging overhead.

Practical value of this chapter

Design in practice

Clarifies where consensus is truly required versus where simpler guarantees are enough.

Decision quality

Helps choose Raft/Paxos with awareness of workload shape and fault-tolerance goals.

Interview articulation

Enables concise explanation of quorum, leader term, and commit behavior.

Risk and trade-offs

Highlights consensus cost: write latency, recovery complexity, and debugging overhead.

Related book

Designing Data‑Intensive Applications

The consensus and replication chapters connect command logs, quorums, and the practical trade-offs of fault-tolerant storage.

Read review

Consensus is needed when several nodes must make the same decision about system state, even while messages are delayed, nodes fail, and the network temporarily splits. Without it, a system cannot reliably elect a leader, fix the order of writes, or maintain one history of critical changes.

Foundation

TCP protocol

Consensus uses message exchange, but correctness comes from quorums, round numbers, and commit rules rather than from transport alone.

Читать обзор

Where consensus is actually needed

Leader election

The cluster must appoint one coordinator and avoid two active leaders at the same time.

Cluster metadata

Membership, configuration, schema, and routing rules must change in one agreed order.

Linearizable writes

Critical operations must appear to pass through one sequential history.

How Paxos chooses one value

Paxos is Lamport's classic algorithm for choosing a single value under contention and failure. It splits the work into preparation and acceptance: participants first promise not to accept older proposals, and then a quorum accepts the selected value.

Paxos: node message flow

The scenarios show how a quorum chooses one value and what happens when proposals compete.

Single value

Paxos chooses a value through quorum intersection

The proposer runs two phases: it first collects promises, then asks a quorum to accept the value.

Interactive replayStep 1/5

Active step

Prepare(n) is sent to a quorum

The proposer chooses a round number and sends Prepare(n) to acceptors.

Node interaction view

Proposern = 7A1acceptorA2acceptorA3acceptorLearnervalue = vquorum 2/3Prepare(n)

What it protects

Phase quorums intersect, so two different values cannot both be safely chosen.

Main risk

Without a stable leader, competing proposers may burn many network rounds.

What to watch

Proposal numbers, quorum size, and the value returned in Promise.

Implementation notes

  • Prepare and Accept act like two network safety barriers.
  • Promise must return the latest accepted value or safety breaks.
  • A learner does not choose the value; it observes the quorum result.

Multi-Paxos: reducing the number of rounds

When the leader is stable, the prepare phase does not need to run for every entry. The leader secures a proposal number once and then sends new values directly into the accept round.

What it gives you

  • Fewer network rounds per write
  • Higher throughput while the leader remains stable
  • Clearer progress when several clients compete to write

Multi-Paxos: message flow

The scenarios show the short write path with a stable leader and a safe leader change.

Steady write

Multi-Paxos shortens the write path with a stable leader

The leader has already completed the prepare phase, so each new entry goes directly through the accept round.

Interactive replayStep 1/5

Active step

The leader owns a proposal number

The prepare phase has already run, and the quorum recognizes the leader for the current number.

Node interaction view

ClientcommandLeadernumber nA1replicaA2replicaA3replicaLearnerslot iPrepare already done

What it protects

Paxos safety is preserved, while the common write path avoids another Prepare.

Main risk

If the leader stalls or loses the majority, the steady path stops making progress.

What to watch

Leader stability, write latency, and the share of writes that require a new Prepare.

Implementation notes

  • Multi-Paxos does not change Paxos safety rules; it optimizes repeated writes.
  • The client path stays shorter while the leader is not competing with another proposer.
  • Reads still need rules that do not bypass the current quorum.

How Raft makes consensus easier to reason about

Raft was designed as an understandable consensus protocol. It explicitly separates leader election, log replication, and membership changes, which makes the protocol easier to explain, implement, and debug.

Raft: node interactions

The scenarios show leader election, entry commit, and stale-leader step-down.

Election + commit

Raft elects a leader and commits through a majority

A candidate first wins a majority of votes, then the leader replicates a command and advances the commit index.

Interactive replayStep 1/5

Active step

Election timeout fires

A follower stops seeing leader heartbeats, increments its term, and becomes a candidate.

Node interaction view

ClientcommandCandidateterm 4Follower AlogFollower BlogFollower Clogtimeout2/3 votes

What it protects

A leader appears only after majority votes, and an entry commits only after majority acknowledgment.

Main risk

Poor timeout settings trigger unnecessary elections and delay commits.

What to watch

Term number, vote majority, replica lag, and commit-index movement.

Implementation notes

  • Raft separates leader election from log replication to make behavior easier to verify.
  • A client command is safe after majority acknowledgment, not just after one leader writes it locally.
  • Replicas apply entries to the state machine in log order.

Paxos and Raft: engineering comparison

Paxos

  • Strong theoretical foundation and compact formal model
  • Harder to explain and implement without mistakes
  • Often hidden behind Multi-Paxos or derived algorithms in real products

Raft

  • Explicit model for leaders, terms, and log replication
  • Easier to explain to a team and operate in practice
  • Used in etcd, Consul, CockroachDB, and other systems

What to remember

  • Consensus is not needed for every write, only where the system needs one agreed decision.
  • Paxos shows the fundamental mechanics of value selection, while Raft makes a similar idea easier to implement.
  • The cost of consensus is a quorum, extra network rounds, write latency, and harder failure recovery.

When consensus can hurt

Consensus improves correctness for critical state, but it makes the write path slower and more complex. If a domain can tolerate temporary divergence, asynchronous replication, or idempotent repair, a simpler mechanism is often more reliable in operation.

Related chapters

Enable tracking in Settings