System Design Space
Knowledge graphSettings

Updated: March 25, 2026 at 3:00 AM

Clock Synchronization in Distributed Systems

medium

Practice of time synchronization: physical vs logical clocks, NTP/PTP, clock skew impact and architectural protection against time drift.

Time rarely fails gracefully in a distributed system. It usually leaks quietly into leases, TTLs, aggregation windows, and event ordering until the team realizes that the clocks, not the business logic, were the real source of breakage.

In practice, this chapter helps decide when physical time is enough, when logical time is necessary, and where skew must be handled through architectural invariants rather than optimism about perfect NTP.

In interviews and engineering discussions, it is especially useful when you need to show how time drift damages correctness and SLA in seemingly harmless mechanisms such as deduplication, expiration, and leader leases.

Practical value of this chapter

Design in practice

Helps account for clock skew in idempotency, ordering, and event deduplication.

Decision quality

Provides criteria for physical vs logical time and bounded-uncertainty choices.

Interview articulation

Supports clear explanation of why time is not a global truth in distributed systems.

Risk and trade-offs

Highlights skew-sensitive areas: TTL logic, leader leases, and windowed aggregation.

Context

Distributed Systems: Overview

Clock semantics are the foundation for consistency, coordination and observability in distributed systems.

Open chapter

Clock synchronization - this is not only “the exact time on the servers”, but an architectural factor that affects consistency, retry/timeout behavior and even security. The more distributed the system, the higher the cost of errors in time assumptions.

Why is this important

  • Event ordering and correct replay in event-driven systems.
  • TTL/lease mechanics in cache, lock services and service discovery.
  • Correct deadlines and timeout budgets in RPC/queue processing.
  • Security: token expiration date, replay-window and anti-replay checks.
  • Audit and investigation of incidents where sequence of actions is important.

Time models

Physical clocks

Real time (UTC/NTP/PTP). Needed for business-time and compliance logic, but there is skew/drift.

Logical clocks

Lamport/Vector clocks for cause-and-effect order without assumptions about wall-clock accuracy.

Hybrid logical clocks (HLC)

Combination of physical + logical time: useful for distributed DB and snapshot operations.

Related

Consensus

Leader timeouts and lease-based mechanics depend on correct time behavior.

Open chapter

Synchronization approaches

NTP

When: Basic standard for most general purpose systems.

Restrictions: Accuracy is typically milliseconds; offset/jitter and fallback monitoring to multiple time sources is required.

PTP

When: When high accuracy (below milliseconds) is needed, for example trading/telecom/industrial circuits.

Restrictions: Requires network and hardware support; more difficult to operate.

Application-level ordering

When: If wall-clock is unreliable for business invariants, use sequence/causal ordering in the application.

Restrictions: You can't rely entirely on timestamps for strict ordering of operations.

Related

Lesley Lamport: Causality, Paxos and Engineering Thinking

Interview on causality, logical clocks, and Lamport's engineering approach to distributed systems.

Open chapter

Design patterns

Use monotonic clock to measure durations, and wall-clock only for display/business time.

For critical write-paths, enter server-assigned timestamp or sequence number.

Add an uncertainty window when comparing timestamps from different nodes.

Check and alert by clock offset; remove nodes with large drift from quorum.

Don't make security dependent on client time alone.

Practical checklist

  • Time offset/jitter metrics are visible for all production weeks.
  • There is a runbook in case of massive clock drift and time-source failure.
  • Timestamp logic is tested with artificial skew in integration/chaos tests.
  • Services do not use wall-clock for SLA timeout measurements.
  • Critical transactions have an independent ordering mechanism in addition to the wall-clock.

Frequent anti-pattern: use wall-clock timestamp as the only source of event order.

References

Related chapters

Enable tracking in Settings