System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 9:20 AM

Distributed File System (GFS/HDFS)

mid

Classic task: master/data nodes, block replication, metadata management, recovery and high throughput I/O.

Paper

The Google File System

Google's classic work on designing a distributed file system for big data.

Open publication

Distributed File System - a basic building block for data platforms. The idea behind GFS/HDFS: split large files into large blocks, replicate them across the cluster, and keep metadata separate from the data path. This simplifies horizontal scaling and makes the system resistant to constant hardware failures.

Requirements

Functional

Storing very large files (gigabytes/terabytes) on top of a cluster of commodity servers.

Consecutive append/read operations for batch and analytical workloads.

Block replication between nodes and automatic recovery from failures.

File system metadata: namespace, mapping file -> blocks -> replicas.

Background processes: rebalancing, re-replication, garbage collection of orphan blocks.

Non-functional

Durability: 11+ nines (practically)

The loss of individual disks/nodes should not lead to data loss.

Throughput: High aggregate I/O

Optimization for large sequential read/writes, and not for small random IOs.

Scalability: Linear horizontal

Adding a DataNode should increase both capacity and bandwidth.

Availability: Operation during partial failures

The system is degrading, but remains accessible to majority scenarios.

High-Level Architecture

Architecture Map

DFS high-level structure (GFS/HDFS style)

Control Plane

Master
namespace + block map
Metadata Journal
edits + checkpoints
Replication Manager
repair replicas
Cluster Balancer
hotspot rebalance

Data Plane

Client
append/read API
DataNode A
block replicas
DataNode B
block replicas
DataNode C
block replicas
DFS overview: metadata plane is separated from data plane, plus background control processes.
Metadata and data paths are separated
Direct client-to-DataNode I/O
Background repair and balancing

Key pattern: metadata plane And data plane separated. The master controls the placement and namespace, but the bytes themselves flow directly between the client and data nodes.

Read/Write Flow

Read/Write Path Explorer

Switch path and replay steps, similar to the object-storage flow style.

1
Client
request block allocation
2
Master
placement + lease
3
Primary DataNode
stream chunk
4
Replica Pipeline
DN2 -> DN3 + ack chain
5
Journal Commit
metadata + version
Write path: start Play to inspect block write pipeline and metadata commit.

Write Path: operational notes

  • The client requests master to allocate a new block and a list of target DataNodes.
  • Master selects nodes according to placement policy (rack-aware, free space, load).
  • The client streams the chunk into the pipeline DataNode1 -> DataNode2 -> DataNode3.
  • ACK is returned in a chain back; The block is marked committed after all replicas have been successfully written.
  • Metadata about the block and version is recorded in the journal/snapshot of the master outline.

Read Path: operational notes

  • The client receives from the master a map of blocks and a list of available replicas.
  • For each block, the closest replica is selected (rack/locality aware).
  • Reading goes directly from the DataNode, without proxying the payload through the master.
  • When checksum mismatch occurs, the client switches to another replica.
  • Master participates only in the metadata plane, which reduces the bottleneck.

Failure Handling

  • DataNode timeout: master marks the node dead and schedules re-replication of its blocks.
  • Corrupted block: detected by checksum; repair is launched from a healthy replica.
  • Master crash: recovery from journal + checkpoint, then replay edits.
  • Network partition: quorum/HA mechanics for active/standby master (in modern implementations).
  • Hot file/block: read replicas and cache tiers reduce hotspots on popular data.

GFS vs HDFS (in short)

AspectGFSHDFS
Main use-caseBig data processing within the Google ecosystemHadoop analytics ecosystem in enterprise
Block sizeLarge chunks (classically 64MB)Large blocks (usually 128MB+)
Consistency modelRelaxed, append-heavy semanticsWrite-once-read-many (classical)
Master HAGoogle Internal HA StrategiesNameNode HA (active/standby + journal quorum)

What to discuss in an interview

  • Why large block size reduces metadata overhead and increases throughput.
  • How placement policy balances locality, fault domains and network cost.
  • Why single-master is historically simpler, and how to evolve to HA.
  • How to check integrity: checksums, scrubbing, background repair.
  • Where is the boundary between DFS and Object Storage and how do they combine in the data platform.

Reference

HDFS Architecture Guide

Official HDFS Design Description: NameNode, DataNode, replication and fault tolerance.

Open documentation

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov