Distributed File System (GFS/HDFS)

A distributed file system matters when a cluster must store very large files and sustain heavy aggregate read and write throughput, not when it needs to serve millions of tiny random requests.

The chapter ties together the namespace, block placement, replication, write path, read path, and failure recovery into one working cluster architecture.

For interviews and architecture reviews, the case is useful because it forces a clear explanation of the control plane, the direct data path, read locality, and how replicas are spread across failure domains.

Cluster Throughput

DFS shines when large sequential reads and writes can be spread across many nodes and network links instead of bottlenecking on one machine.

Metadata and Journal

The namespace, block map, and edit journal must stay compact, reliable, and fast to recover after a master failure.

Hot Blocks

Popular files create load skew quickly unless you plan for nearest-replica reads, extra read copies, and steady background balancing.

Failure Recovery

Losing a disk, a node, or a whole rack must not stop the system; heartbeat handling, replica repair, and master recovery need a clear operating path.

Paper

The Google File System

Google's classic paper on designing a distributed file system for storing and processing large-scale data.

Open publication

Distributed file systems matter when a cluster must store very large files and sustain heavy read and write throughput, not when it needs to serve millions of tiny random requests. Systems like GFS and HDFS split data into large blocks, keep several copies across the cluster, and separate metadata from the byte-serving path so the design can scale horizontally and survive routine hardware failures.

Requirements

Raw capacity and node count are only the first half of the job. This kind of storage also has to deliver predictable throughput across the whole cluster and behave understandably on availability once some of the nodes have already dropped out.

Functional

Store very large files on top of a cluster of commodity servers.

Support append-heavy writes and large sequential reads for batch and analytics workloads.

Replicate blocks across nodes and recover automatically after failures.

Manage metadata: directory hierarchy, file-to-block mapping, and replica placement.

Run background work such as block rebalancing, replica repair, and cleanup of orphaned data.

Non-functional

Durability: 11+ nines in practice

Losing individual disks or nodes should not lead to user-visible data loss.

Throughput: High aggregate I/O

Optimization targets large sequential reads and writes; tiny random I/O is expensive here and is deliberately left out of the design scenario.

Scalability: Linear horizontal growth

A new DataNode should add both capacity and bandwidth; if only capacity grows, the cluster hits an I/O wall long before the disks fill up.

Availability: Partial-failure operation

The system should degrade predictably and remain available for the main read and write paths.

High-Level Architecture

Architecture Overview

High-level DFS layout in the spirit of GFS and HDFS

Control Plane

Master

namespace and block map

Metadata Journal

edits and checkpoints

Replication Manager

replica repair

Cluster Balancer

hotspot rebalancing

Data Plane

Client

block read and write

DataNode A

block replicas

DataNode B

block replicas

DataNode C

block replicas

read from the nearest replica

heartbeat and block report to master

Control Plane

Master

namespace and block map

Metadata Journal

edits and checkpoints

Replication Manager

replica repair

Cluster Balancer

hotspot rebalancing

Data Plane

Client

block read and write

DataNode A

block replicas

DataNode B

block replicas

DataNode C

block replicas

DFS overview: the metadata layer is separated from the byte path, with background repair and balancing around it.

Metadata is separated from the byte path

Clients talk to DataNodes directly

Background repair and balancing stay active

The key pattern is to separate the control plane from the data path. The master owns placement, coordination, and recovery, while clients move bytes directly to and from DataNodes. Critical metadata changes are written to the journal first so the master can recover safely after a crash.

Write and Read Path

On writes, the client gets a lease and pushes the block through a replica chain. On reads, the system chooses the best healthy replica based on locality and rack placement so it avoids unnecessary cross-rack traffic.

Read and Write Path Explorer

Switch between flows and replay how a DFS write or read moves through the system.

Client

request block placement

Master

placement and lease

Primary DataNode

write stream

Replica Chain

DN2 -> DN3 with ACK back

Journal Commit

metadata and version

Client

request block placement

Master

placement and lease

Primary DataNode

write stream

Replica Chain

DN2 -> DN3 with ACK back

Journal Commit

metadata and version

Write path: press Play to step through block placement, replica streaming, and metadata commit.

Write path: key details

The client asks the master to allocate a new block and choose the target DataNodes.
The master selects nodes based on rack layout, free space, and current load.
The client streams data into the chain DataNode1 -> DataNode2 -> DataNode3.
Acknowledgements travel back along the same chain; the block is committed only after every replica is written.
The master records the new block version and placement in the journal and checkpoint state.

Read path: key details

The client gets the block map and the list of healthy replicas from the master.
For each block, the system chooses the nearest healthy replica.
Reads go directly from the DataNode, without proxying payload bytes through the master.
If checksum verification fails, the client switches to another replica.
The master stays on the metadata path only, which keeps it out of the main byte-serving bottleneck.

Failure Handling

A dead node is just the most obvious failure. The plan also has to cover bringing the master back from checkpointed state and a response to hotspots on popular files and blocks — one you work out ahead of time, not during the incident.

If a DataNode stops responding, the master marks it unavailable and starts replica repair.
A corrupted block is detected by checksum validation and rebuilt from a healthy replica.
After a master crash, the system recovers from the journal and checkpoint, then reapplies unfinished edits.
During a network partition, HA schemes with active and standby master roles plus quorum logic keep the control path safe.
Popular files and blocks often need extra read replicas, caches, and background balancing.

GFS vs HDFS

Aspect	GFS	HDFS
Primary scenario	Large-scale data processing inside the Google ecosystem	Hadoop-based analytics and enterprise data platforms
Block size	Large chunks, historically 64 MB	Large blocks, typically 128 MB and above
Consistency model	Looser semantics tuned for append-heavy workloads	Classic write-once-read-many behavior
Master HA	Google-internal high-availability strategies	NameNode HA with active/standby roles and journal quorum

What to discuss in an interview

Good answers explain how placement spans failure domains, why locality matters for both reads and cluster cost, and where DFS complements object storage instead of replacing it.

Why large block size reduces metadata pressure and increases aggregate throughput.
How placement policy balances locality, failure domains, and network cost.
Why single-master designs were historically simpler and how to evolve them toward HA.
How to validate integrity with checksums, background scrubbing, and automated repair.
Where the boundary sits between DFS and object storage inside a data platform.

Key takeaways

DFS shines when the system needs to store very large files and spread large sequential reads and writes across a cluster.
The core architectural move is to separate metadata management from the byte-serving path so clients talk to DataNodes directly.
Most of the operational complexity sits in replica placement, failure recovery, and hotspot mitigation.

Reference

HDFS Architecture Guide

Official HDFS architecture description: NameNode, DataNode, replication, and fault tolerance.

Open documentation

Related chapters

Object Storage (S3) - helps compare DFS and object storage by access model, consistency guarantees, and operational trade-offs.
Big Data - adds platform context where GFS and HDFS act as the core storage layer for batch and analytics pipelines.
DDIA - strengthens the foundations around replication, partitioning, and fault tolerance in distributed storage systems.
Database Internals - goes deeper into storage-engine internals and the split between the metadata layer and the data path.