Context
Frontend Architecture Overview
The collaborative editor case study shows the limits of frontend architecture in real-time multiplayer scenarios.
Design Google Docs collaborative editor requires accurate synchronization between clients, predictable UX during network delays, and correct conflict resolution without losing user edits.
Problem & Context
Functional requirements
- Collaborative editing of one document by several users in real time.
- Show cursors/highlights of other participants and presence indicators.
- Undo/redo on the client without destroying the global consistency of the document.
- Offline edit with subsequent synchronization of changes after the network is restored.
Non-functional requirements
- Latency for applying a remote change to the UI: preferably < 200ms.
- High resistance to temporary network partition and reconnect storm.
- Preservation of transaction history and protection against loss of user input.
- Scalability across active documents and number of participants per document.
Scale assumptions
Active documents
5M+/day
Most documents are small in size, but there are heavy collaborative sessions.
Concurrent editors per doc
1-50 typical, 200 peak
The architecture should also work correctly in asymmetric sessions with a large number of participants.
Ops throughput
10k-50k ops/s global
Peak windows are usually associated with school/work time zones.
Reconnect burst
x3 baseline
After a local network failure, some clients try to send buffered operations simultaneously.
Related
Consistency & Idempotency
Collaborative editing directly depends on the correct consistency model and handling of replays.
Architecture
Realtime gateway
WebSocket/WebTransport layer for bidirectional delivery of operations and presence events.
Collaboration engine
Applies OT/CRDT rules, validates the document version, serializes operations and distributes to subscribers.
Document state store
Stores snapshot + operation log; supports recovery, replay and point-in-time reconstruction.
Client sync module
Local buffer of unconfirmed operations, ack tracking and reconciliation after reconnect.
Deep dives
OT vs CRDT
OT is usually easier to integrate into a centralized server pipeline; CRDT is better for peer/offline scenarios, but increases metadata overhead.
Ordering and causality
The server must deterministically serialize concurrent operations. The client applies transformations and versions to eliminate divergence.
Offline-first sync
Local operations are applied optimistically, then sent in batches. In case of conflict - rebasing and re-validation of the document state.
Presence channel separation
Presence events (cursor/mouse/typing) are separated from document operations: you can drop them aggressively without losing document data.
Trade-offs
Strong server-side coordination simplifies consistency, but increases dependence on the central collab engine.
A full operation log is convenient for auditing and debugging, but can be expensive in terms of storage and replay latency.
More frequent snapshots speed up recovery, but increase write overhead on storage.
Aggressive compression of operations saves bandwidth, but complicates debugging and diagnosing incidents.
