File Sync (Dropbox / Google Drive)

File synchronization is not "upload a file to the cloud" — it is keeping one and the same view of a file tree across several devices at once: laptop, phone, web. Each device may edit files offline and then reconnect, and the sync engine must merge those changes carefully, move as little data as possible, and never lose anything silently.

The boundary with the adjacent case matters: storing the bytes is the job of object storage (immutable blocks, content addressing, cheap scaling), while here we design the sync engine itself. The unit of transfer is a block, not a file: a file is represented as a list of block hashes, delta sync sends only the changed blocks, and deduplication reuses identical blocks across versions and files.

The subtlest parts are metadata and conflicts. The tree, versions, and change log live in a strongly consistent metadata service kept separate from the block store, and operations run in the order "blocks first, commit second." On concurrent edits from different devices the system never loses data silently — it keeps a conflicted copy; automatic merging of edits is the realm of CRDTs and collaborative editing.

Durability

Replication and recovery must survive node, rack, and zone-level failures.

Metadata Control

Metadata/data separation and placement strategy shape cluster operability.

Hotspot Risk

Address skew with shard balancing, compaction strategy, and locality-aware routing.

Operating Cost

Compare storage tiers, network egress, and resilience overhead explicitly.

Boundary

Object storage is next door

This case is about syncing files across devices. Storing the actual blocks belongs to object storage, covered separately.

Читать обзор

File synchronization is not “upload a file to the cloud.” It is keeping a consistent picture of the same file tree across several devices at once: laptop, phone, web. Each device may edit files offline and then reconnect — and the system must merge those changes carefully, move as little data over the network as possible, and never silently lose anything.

Draw the boundary up front. Storing the bytes is the job of object storage: immutable objects, content addressing, cheap scaling. Here we design the sync engine: what the unit of transfer is, how to compare device states, how to notify about changes, and what to do on concurrent edits. We use the block store as a backend and solve “bring N devices to one state.”

What we sync and which scenarios matter

Multi-device: one state on laptop, phone, and web; a new device must pull the current picture.
Offline edits: a file is edited without network, changes accumulate locally and apply on reconnect.
Large files, small edits: fix one paragraph in a hundred-megabyte document and sending it whole is wasteful.
Sharing: a shared folder owned by several users with different access rights.

Functional requirements

Upload and download are not yet synchronization. The system has to see what changed on each device, expose a change stream, and hold shared access. Below is a working API contour and what it must guarantee for reliability.

Core API

POST /files/upload — upload new or changed blocks
POST /metadata/commit — commit a new file state (block list and version)
GET /changes?cursor=… — change stream from a given cursor
GET /blocks/:hash — download a block by its hash
POST /share — grant folder access to another user

Sync reliability

Each device keeps a local change log and catches up to the server
An interrupted upload resumes instead of restarting
A conflict keeps both versions instead of silent loss
State converges: after a pause all devices reach one picture

Non-functional requirements and scale estimates

For a sync engine, peak QPS matters less than minimal traffic, correct merging, and predictable metadata consistency. The numbers below are order-of-magnitude anchors for back-of-the-envelope reasoning, not measured figures for any specific product; in an interview, state them as assumptions.

Requirement	Anchor	Why it matters
Minimal traffic	Transfer only changed blocks	Most edits touch a small part of a file — sending it whole is uneconomical
Reliability	Committed state is never lost	A crash between block upload and metadata commit must not corrupt the tree
Consistency	All devices converge	After a pause devices must reach the same view of the tree
Notification latency	Seconds from edit to push	Users expect an edit on one device to show up on another quickly

Estimating delta savings (order of magnitude)

Assume a user has a few devices, each holding tens of thousands of files, with only a small share changing daily.
If an edit typically touches 1–2 blocks out of dozens, delta sync cuts traffic by one to two orders of magnitude versus full re-upload.
On top of that comes deduplication: identical blocks (attachments, template copies) are stored and transferred once.
These numbers illustrate the estimation method — they are not measured figures of any specific service.

Chunking and deduplication

The unit of transfer and storage is a block, not a file. A file is represented as an ordered list of block hashes (a blocklist). Then “what changed” is the diff of two lists, and identical blocks are reused across versions, files, and even users.

Backend

Blocks live in object storage

A block addressed by its content hash is an immutable object. Deduplication maps naturally onto content-addressable storage.

Читать обзор

Fixed-size blocks

A file is cut into fixed-size chunks (for example, 4 MB at Dropbox).
Simple and fast; works well for uploading and resuming large files.
Weak spot — the shift problem: inserting a byte at the start moves every boundary, so nearly all blocks get new hashes.

Variable-length blocks (content-defined chunking)

Block boundaries are chosen by content: a rolling window and a Rabin fingerprint mark a cut point where the fingerprint value satisfies a condition.
Inserting a byte changes boundaries only locally — neighbouring blocks keep their hashes and dedup stays high. This technique comes from LBFS (SOSP 2001).
The cost is more client-side computation and less predictable block size.

Cross-user dedup privacy

Global dedup (one block for everyone) saves storage but opens a leak channel: a “block already exists” answer reveals whether someone owns a specific file. Dedup is therefore often scoped to a user/folder, and the existence check is kept unobservable from the client side.

Metadata versus data

The key architectural split: data blocks live in one service, structure in another. The metadata service holds the file tree, versions, permissions, and indexes; the block store holds the bytes themselves, addressed by hash. A file is a metadata record that references a block list.

Metadata service

Folder/file tree, paths, names, sizes, modification times.
File versions and their blocklist; stable identifiers that survive rename and move.
A per-user change log and a cursor by which a device catches up to the current state.
Needs strong consistency: the tree must not diverge between client and server.

Block store

Immutable blocks addressed by content hash.
Built on object storage — the adjacent case — with cheap scaling and replication.
Durability over latency: once a block is written and replicated, it is no longer at risk.
Block deletion is lazy: reference-counted garbage collection rather than immediate removal.

This split yields an important invariant: upload blocks first, commit metadata second. If the commit never lands, the stray blocks are simply collected by garbage collection; the reverse order would produce references to data that was never uploaded. In its streaming-sync writeup, Dropbox uses exactly this kind of temporary intermediate metadata state so a downloading client can begin pulling blocks before the final commit.

Delta sync: what we send on an edit

On save, the client recomputes the file's blocklist and diffs it against the previous version. Only blocks not yet on the server are transferred; the rest is writing a new version into metadata. Here is the write path.

The client cuts the file into blocks and hashes them (fixed 4 MB or content-defined).
It asks the server which hashes already exist and uploads only the missing blocks.
Uploads can be compressed and resumed: if interrupted, continue from untransferred blocks.
After blocks are accepted, the client issues a commit of the new version into metadata — the file's pointer flips atomically to the new blocklist.
The old version stays available through history; blocks are reused across versions.

The downloading side is symmetric: given a new blocklist, it pulls only the blocks missing from its client cache. Often these are the same blocks already present in other files — no re-transfer needed.

Change notification and convergence

The client should not poll the server in a tight loop. The usual approach is a lightweight notification channel (for example, long-poll or a persistent connection): the server responds when the user has a new change. On the signal, the client fetches the delta by cursor.

Change cursor

Each device stores its position in the log — the cursor. A /changes?cursor=… call returns everything new and advances the cursor.
Resumes cheaply after downtime: the device just catches up on the log tail.
The notification service only wakes the client; the delta is fetched in a separate request.

Convergence

The server is the source of truth for change order; clients reach one state by applying the log.
Across devices, eventual consistency is acceptable: a brief divergence is fine, guaranteed convergence is what matters.
Metadata within one account still holds strong consistency.

Conflicts on concurrent edits

Two devices edit the same file offline and reconnect with different versions descended from a common ancestor. The guiding principle: never lose data silently. The simplest workable answer to conflict resolution is the conflicted copy.

Conflicted copies (file sync)

The first edit to arrive is committed; the second, whose ancestor is stale, is saved as a separate “… (conflicted copy)” file.
This is a deliberate refusal to auto-merge arbitrary binary files — in general they cannot be merged correctly.
Last-write-wins is simpler but silently drops one of the edits — a bad default for files.

When merging is required (collaborative)

For collaborative document editing a conflicted copy is wrong — edits must merge automatically.
Then CRDTs or operational transformation run on top of the file layer, giving a convergent merge at the document-structure level.
This is a different case: a file sync engine and a collaborative editing engine solve different problems and often coexist.

Deep dive

CRDTs and collaborative editing

Where conflict resolution goes when edits must merge automatically instead of producing conflicted copies.

Читать обзор

Deep dives

Metadata consistency

Stable file identifiers matter more than paths: moving a folder must not look like “delete and recreate.” In its sync-engine rewrite (Nucleus), Dropbox highlights exactly globally unique identifiers and strong consistency between client and server views as the foundation of correct synchronization.

Client cache

A local cache of blocks and metadata lets the client open files offline and avoid re-downloading what it already has. The cache is size-bounded and evicts rarely used blocks; selective sync keeps only part of the tree local.

Encryption

Blocks are encrypted in transit and on server disk. End-to-end encryption (keys only with the user) strengthens privacy but largely rules out cross-user dedup and server-side merging — an explicit trade-off between privacy and savings.

Sharing and permissions

A shared folder is a metadata record with access control: members, roles (read/write), permission inheritance. The shared folder's change log is delivered to all members, and revoking access must reflect immediately in their view of the tree.

Trade-offs and common mistakes

File as the unit of transfer: sending the whole file on any edit instantly loses the main benefit of delta sync.
Silent loss on conflict: applying last-write-wins to files and not keeping the second version.
Conflating metadata and data: storing the tree and the bytes in one service with one set of consistency requirements.
Wrong order: committing metadata before uploading blocks and ending up with references to missing data.
Polling instead of notifications: tight polling rather than a notification channel — wasted traffic and latency.
Ignoring dedup privacy: global dedup without regard to leaking the fact of file ownership.

What to make explicit in interviews

What the unit of transfer is: file, fixed block, or content-defined block — and why.
How metadata and data are split, and the operation order (blocks first, then commit).
How a device learns about changes: a notification channel plus a cursor over the log.
How offline-edit conflicts are resolved and where the line to collaborative editing runs.
Where dedup happens and how to avoid creating a leak channel for file ownership.

References

Source map: Dropbox 2014 supports streaming sync, blocks, and blocklists; Dropbox Nucleus supports globally unique identifiers and strong consistency in the sync engine; LBFS supports content-defined chunking and deduplication; ByteByteGo provides the interview framing. Block sizes, operation order, and conflict policies below should not be copied blindly to every file-sync product.

Dropbox Engineering — Streaming File Synchronization: 4 MB blocks, SHA-256 blocklists, prefetch and streaming sync (dropbox.tech, 2014)Sujay Jayakar — Rewriting the heart of our sync engine (Nucleus): globally unique identifiers, strong consistency, testing concurrent sync (dropbox.tech, 2020)MIT PDOS — LBFS: A Low-bandwidth Network File System: content-defined chunking and cross-file deduplication (Muthitacharoen, Chen, Mazières, SOSP 2001)Alex Xu — System Design Interview (ByteByteGo): interview walkthrough of file storage and sync (Google Drive / Dropbox)

Related chapters

Object Storage - Adjacent case: the backend for file blocks — immutable objects, content addressing, cheap storage scaling.
CRDTs and collaborative editing - Where conflict resolution goes when edits must merge automatically instead of producing conflicted copies.
CDN - Serving block downloads closer to the user and offloading read traffic from the block store.
Key-Value Database - Under the metadata service: file-tree indexes, versions, and a per-device change cursor.