A distributed file system matters when a cluster must store very large files and sustain heavy aggregate read and write throughput, not when it needs to serve millions of tiny random requests.
The chapter ties together the namespace, block placement, replication, write path, read path, and failure recovery into one working cluster architecture.
For interviews and architecture reviews, the case is useful because it forces a clear explanation of the control plane, the direct data path, read locality, and how replicas are spread across failure domains.
Cluster Throughput
DFS shines when large sequential reads and writes can be spread across many nodes and network links instead of bottlenecking on one machine.
Metadata and Journal
The namespace, block map, and edit journal must stay compact, reliable, and fast to recover after a master failure.
Hot Blocks
Popular files create load skew quickly unless you plan for nearest-replica reads, extra read copies, and steady background balancing.
Failure Recovery
Losing a disk, a node, or a whole rack must not stop the system; heartbeat handling, replica repair, and master recovery need a clear operating path.
Paper
The Google File System
Google's classic paper on designing a distributed file system for storing and processing large-scale data.
Distributed file systems matter when a cluster must store very large files and sustain heavy read and write throughput, not when it needs to serve millions of tiny random requests. Systems like GFS and HDFS split data into large blocks, keep several copies across the cluster, and separate metadata from the byte-serving path so the design can scale horizontally and survive routine hardware failures.
Requirements
For this kind of storage system, raw capacity is only one part of the story. You also need predictable throughput across the whole cluster and a clear availability story under partial failure.
Functional
Store very large files on top of a cluster of commodity servers.
Support append-heavy writes and large sequential reads for batch and analytics workloads.
Replicate blocks across nodes and recover automatically after failures.
Manage metadata: directory hierarchy, file-to-block mapping, and replica placement.
Run background work such as block rebalancing, replica repair, and cleanup of orphaned data.
Non-functional
Durability: 11+ nines in practice
Losing individual disks or nodes should not lead to user-visible data loss.
Throughput: High aggregate I/O
The system is optimized for large sequential reads and writes rather than tiny random I/O.
Scalability: Linear horizontal growth
Adding a DataNode should increase both cluster capacity and available bandwidth.
Availability: Partial-failure operation
The system should degrade predictably and remain available for the main read and write paths.
High-Level Architecture
Architecture Overview
High-level DFS layout in the spirit of GFS and HDFSControl Plane
Data Plane
The key pattern is to separate the control plane from the data path. The master owns placement, coordination, and recovery, while clients move bytes directly to and from DataNodes. Critical metadata changes are written to the journal first so the master can recover safely after a crash.
Write and Read Path
On writes, the client gets a lease and pushes the block through a replica chain. On reads, the system chooses the best healthy replica based on locality and rack placement so it avoids unnecessary cross-rack traffic.
Read and Write Path Explorer
Switch between flows and replay how a DFS write or read moves through the system.
Write path: key details
- The client asks the master to allocate a new block and choose the target DataNodes.
- The master selects nodes based on rack layout, free space, and current load.
- The client streams data into the chain DataNode1 -> DataNode2 -> DataNode3.
- Acknowledgements travel back along the same chain; the block is committed only after every replica is written.
- The master records the new block version and placement in the journal and checkpoint state.
Read path: key details
- The client gets the block map and the list of healthy replicas from the master.
- For each block, the system chooses the nearest healthy replica.
- Reads go directly from the DataNode, without proxying payload bytes through the master.
- If checksum verification fails, the client switches to another replica.
- The master stays on the metadata path only, which keeps it out of the main byte-serving bottleneck.
Failure Handling
A good failure plan must cover more than a dead node. It also needs safe recovery from checkpointed state and a response to hotspots on popular files and blocks.
- If a DataNode stops responding, the master marks it unavailable and starts replica repair.
- A corrupted block is detected by checksum validation and rebuilt from a healthy replica.
- After a master crash, the system recovers from the journal and checkpoint, then reapplies unfinished edits.
- During a network partition, HA schemes with active and standby master roles plus quorum logic keep the control path safe.
- Popular files and blocks often need extra read replicas, caches, and background balancing.
GFS vs HDFS
| Aspect | GFS | HDFS |
|---|---|---|
| Primary scenario | Large-scale data processing inside the Google ecosystem | Hadoop-based analytics and enterprise data platforms |
| Block size | Large chunks, historically 64 MB | Large blocks, typically 128 MB and above |
| Consistency model | Looser semantics tuned for append-heavy workloads | Classic write-once-read-many behavior |
| Master HA | Google-internal high-availability strategies | NameNode HA with active/standby roles and journal quorum |
What to discuss in an interview
Good answers explain how placement spans failure domains, why locality matters for both reads and cluster cost, and where DFS complements object storage instead of replacing it.
- Why large block size reduces metadata pressure and increases aggregate throughput.
- How placement policy balances locality, failure domains, and network cost.
- Why single-master designs were historically simpler and how to evolve them toward HA.
- How to validate integrity with checksums, background scrubbing, and automated repair.
- Where the boundary sits between DFS and object storage inside a data platform.
Key takeaways
- DFS shines when the system needs to store very large files and spread large sequential reads and writes across a cluster.
- The core architectural move is to separate metadata management from the byte-serving path so clients talk to DataNodes directly.
- Most of the operational complexity sits in replica placement, failure recovery, and hotspot mitigation.
Reference
HDFS Architecture Guide
Official HDFS architecture description: NameNode, DataNode, replication, and fault tolerance.
Related chapters
- Object Storage (S3) - helps compare DFS and object storage by access model, consistency guarantees, and operational trade-offs.
- Big Data - adds platform context where GFS and HDFS act as the core storage layer for batch and analytics pipelines.
- DDIA - strengthens the foundations around replication, partitioning, and fault tolerance in distributed storage systems.
- Database Internals - goes deeper into storage-engine internals and the split between the metadata layer and the data path.
