Context
Consensus: Paxos and Raft
Leader election is an application layer on top of consensus/coordination mechanisms.
Leader election needed when the system needs a single active coordinator. Practice shows: choosing a leader in itself does not solve the problem if split-brain, fencing and correct failover semantics are not closed.
When you need a leader
- Single-writer operations (for example, scheduler, allocator, ownership map).
- Control of background tasks, where only one active coordinator should work.
- Failover stateful components and management of leader-only actions.
- Distributed lock/lease scripts in the control plane.
Election mechanisms
Lease-based election
The leader holds the lease and regularly renews it. When the lease expires, another candidate can become the leader.
Clock skew and network jitter can cause split-brain without fencing tokens.
Consensus-based election (Raft/Paxos)
The leader is selected through quorum and term/version semantics. The most reliable path for critical systems.
Complexity of implementation and operational requirements for quorum health.
Coordination-service election
Leader election via ZooKeeper/etcd/Consul primitives (ephemeral nodes, compare-and-swap, distributed locks).
Dependence on the availability of the coordination service and the correct configuration of timeouts.
Time
Clock Synchronization
Lease-based leadership without time discipline often leads to split-brain.
Practical implementations
Raft
Election timeout + majority vote + term. The de facto standard for many control plane systems.
ZooKeeper
Ephemeral sequential znode pattern: the youngest znode becomes the leader.
etcd
Leases + Compare-And-Swap + lock API. Often used for leadership in cloud-native systems.
Kubernetes
Lease objects in the coordination API for controller leader election.
Split-brain protection
Fencing tokens: each new leader receives a monotonically increasing token to protect the downstream write path.
Leader-only operations check term/token before executing side effects.
Stale leader detection: heartbeat + session expiry + fast revoke.
Read-only/degraded mode when quorum is lost instead of unsafe dual-writer behavior.
Practical checklist
- The system explicitly defines leader-only and follower-safe operations.
- There is a guaranteed way to prevent split-brain side effects (fencing/version checks).
- Election-timeouts are consistent with real network/GC characteristics.
- There are tests for partition, delayed packets, process pause/restart.
- Failover is checked regularly through game day scenarios.
A frequent anti-pattern: there is an election, but no fencing - and the system still does dual writes.
