Context
Distributed Systems: Overview
Clock semantics are the foundation for consistency, coordination and observability in distributed systems.
Clock synchronization - this is not only “the exact time on the servers”, but an architectural factor that affects consistency, retry/timeout behavior and even security. The more distributed the system, the higher the cost of errors in time assumptions.
Why is this important
- Event ordering and correct replay in event-driven systems.
- TTL/lease mechanics in cache, lock services and service discovery.
- Correct deadlines and timeout budgets in RPC/queue processing.
- Security: token expiration date, replay-window and anti-replay checks.
- Audit and investigation of incidents where sequence of actions is important.
Time models
Physical clocks
Real time (UTC/NTP/PTP). Needed for business-time and compliance logic, but there is skew/drift.
Logical clocks
Lamport/Vector clocks for cause-and-effect order without assumptions about wall-clock accuracy.
Hybrid logical clocks (HLC)
Combination of physical + logical time: useful for distributed DB and snapshot operations.
Related
Consensus
Leader timeouts and lease-based mechanics depend on correct time behavior.
Synchronization approaches
NTP
When: Basic standard for most general purpose systems.
Restrictions: Accuracy is typically milliseconds; offset/jitter and fallback monitoring to multiple time sources is required.
PTP
When: When high accuracy (below milliseconds) is needed, for example trading/telecom/industrial circuits.
Restrictions: Requires network and hardware support; more difficult to operate.
Application-level ordering
When: If wall-clock is unreliable for business invariants, use sequence/causal ordering in the application.
Restrictions: You can't rely entirely on timestamps for strict ordering of operations.
Related
Lesley Lamport: Causality, Paxos and Engineering Thinking
Interview on causality, logical clocks, and Lamport's engineering approach to distributed systems.
Design patterns
Use monotonic clock to measure durations, and wall-clock only for display/business time.
For critical write-paths, enter server-assigned timestamp or sequence number.
Add an uncertainty window when comparing timestamps from different nodes.
Check and alert by clock offset; remove nodes with large drift from quorum.
Don't make security dependent on client time alone.
Practical checklist
- Time offset/jitter metrics are visible for all production weeks.
- There is a runbook in case of massive clock drift and time-source failure.
- Timestamp logic is tested with artificial skew in integration/chaos tests.
- Services do not use wall-clock for SLA timeout measurements.
- Critical transactions have an independent ordering mechanism in addition to the wall-clock.
Frequent anti-pattern: use wall-clock timestamp as the only source of event order.
