System Design Space
Knowledge graphSettings

Updated: April 9, 2026 at 12:52 PM

Distributed File System (GFS/HDFS)

medium

Classic task: store large files across a cluster, separate metadata from the data path, and design for replication, failure recovery, and high aggregate I/O.

A distributed file system matters when a cluster must store very large files and sustain heavy aggregate read and write throughput, not when it needs to serve millions of tiny random requests.

The chapter ties together the namespace, block placement, replication, write path, read path, and failure recovery into one working cluster architecture.

For interviews and architecture reviews, the case is useful because it forces a clear explanation of the control plane, the direct data path, read locality, and how replicas are spread across failure domains.

Cluster Throughput

DFS shines when large sequential reads and writes can be spread across many nodes and network links instead of bottlenecking on one machine.

Metadata and Journal

The namespace, block map, and edit journal must stay compact, reliable, and fast to recover after a master failure.

Hot Blocks

Popular files create load skew quickly unless you plan for nearest-replica reads, extra read copies, and steady background balancing.

Failure Recovery

Losing a disk, a node, or a whole rack must not stop the system; heartbeat handling, replica repair, and master recovery need a clear operating path.

Paper

The Google File System

Google's classic paper on designing a distributed file system for storing and processing large-scale data.

Open publication

Distributed file systems matter when a cluster must store very large files and sustain heavy read and write throughput, not when it needs to serve millions of tiny random requests. Systems like GFS and HDFS split data into large blocks, keep several copies across the cluster, and separate metadata from the byte-serving path so the design can scale horizontally and survive routine hardware failures.

Requirements

For this kind of storage system, raw capacity is only one part of the story. You also need predictable throughput across the whole cluster and a clear availability story under partial failure.

Functional

Store very large files on top of a cluster of commodity servers.

Support append-heavy writes and large sequential reads for batch and analytics workloads.

Replicate blocks across nodes and recover automatically after failures.

Manage metadata: directory hierarchy, file-to-block mapping, and replica placement.

Run background work such as block rebalancing, replica repair, and cleanup of orphaned data.

Non-functional

Durability: 11+ nines in practice

Losing individual disks or nodes should not lead to user-visible data loss.

Throughput: High aggregate I/O

The system is optimized for large sequential reads and writes rather than tiny random I/O.

Scalability: Linear horizontal growth

Adding a DataNode should increase both cluster capacity and available bandwidth.

Availability: Partial-failure operation

The system should degrade predictably and remain available for the main read and write paths.

High-Level Architecture

Architecture Overview

High-level DFS layout in the spirit of GFS and HDFS

Control Plane

Master
namespace and block map
Metadata Journal
edits and checkpoints
Replication Manager
replica repair
Cluster Balancer
hotspot rebalancing

Data Plane

Client
block read and write
DataNode A
block replicas
DataNode B
block replicas
DataNode C
block replicas
DFS overview: the metadata layer is separated from the byte path, with background repair and balancing around it.
Metadata is separated from the byte path
Clients talk to DataNodes directly
Background repair and balancing stay active

The key pattern is to separate the control plane from the data path. The master owns placement, coordination, and recovery, while clients move bytes directly to and from DataNodes. Critical metadata changes are written to the journal first so the master can recover safely after a crash.

Write and Read Path

On writes, the client gets a lease and pushes the block through a replica chain. On reads, the system chooses the best healthy replica based on locality and rack placement so it avoids unnecessary cross-rack traffic.

Read and Write Path Explorer

Switch between flows and replay how a DFS write or read moves through the system.

1
Client
request block placement
2
Master
placement and lease
3
Primary DataNode
write stream
4
Replica Chain
DN2 -> DN3 with ACK back
5
Journal Commit
metadata and version
Write path: press Play to step through block placement, replica streaming, and metadata commit.

Write path: key details

  • The client asks the master to allocate a new block and choose the target DataNodes.
  • The master selects nodes based on rack layout, free space, and current load.
  • The client streams data into the chain DataNode1 -> DataNode2 -> DataNode3.
  • Acknowledgements travel back along the same chain; the block is committed only after every replica is written.
  • The master records the new block version and placement in the journal and checkpoint state.

Read path: key details

  • The client gets the block map and the list of healthy replicas from the master.
  • For each block, the system chooses the nearest healthy replica.
  • Reads go directly from the DataNode, without proxying payload bytes through the master.
  • If checksum verification fails, the client switches to another replica.
  • The master stays on the metadata path only, which keeps it out of the main byte-serving bottleneck.

Failure Handling

A good failure plan must cover more than a dead node. It also needs safe recovery from checkpointed state and a response to hotspots on popular files and blocks.

  • If a DataNode stops responding, the master marks it unavailable and starts replica repair.
  • A corrupted block is detected by checksum validation and rebuilt from a healthy replica.
  • After a master crash, the system recovers from the journal and checkpoint, then reapplies unfinished edits.
  • During a network partition, HA schemes with active and standby master roles plus quorum logic keep the control path safe.
  • Popular files and blocks often need extra read replicas, caches, and background balancing.

GFS vs HDFS

AspectGFSHDFS
Primary scenarioLarge-scale data processing inside the Google ecosystemHadoop-based analytics and enterprise data platforms
Block sizeLarge chunks, historically 64 MBLarge blocks, typically 128 MB and above
Consistency modelLooser semantics tuned for append-heavy workloadsClassic write-once-read-many behavior
Master HAGoogle-internal high-availability strategiesNameNode HA with active/standby roles and journal quorum

What to discuss in an interview

Good answers explain how placement spans failure domains, why locality matters for both reads and cluster cost, and where DFS complements object storage instead of replacing it.

  • Why large block size reduces metadata pressure and increases aggregate throughput.
  • How placement policy balances locality, failure domains, and network cost.
  • Why single-master designs were historically simpler and how to evolve them toward HA.
  • How to validate integrity with checksums, background scrubbing, and automated repair.
  • Where the boundary sits between DFS and object storage inside a data platform.

Key takeaways

  • DFS shines when the system needs to store very large files and spread large sequential reads and writes across a cluster.
  • The core architectural move is to separate metadata management from the byte-serving path so clients talk to DataNodes directly.
  • Most of the operational complexity sits in replica placement, failure recovery, and hotspot mitigation.

Reference

HDFS Architecture Guide

Official HDFS architecture description: NameNode, DataNode, replication, and fault tolerance.

Open documentation

Related chapters

  • Object Storage (S3) - helps compare DFS and object storage by access model, consistency guarantees, and operational trade-offs.
  • Big Data - adds platform context where GFS and HDFS act as the core storage layer for batch and analytics pipelines.
  • DDIA - strengthens the foundations around replication, partitioning, and fault tolerance in distributed storage systems.
  • Database Internals - goes deeper into storage-engine internals and the split between the metadata layer and the data path.

Enable tracking in Settings