Paper
The Google File System
Google's classic work on designing a distributed file system for big data.
Distributed File System - a basic building block for data platforms. The idea behind GFS/HDFS: split large files into large blocks, replicate them across the cluster, and keep metadata separate from the data path. This simplifies horizontal scaling and makes the system resistant to constant hardware failures.
Requirements
Functional
Storing very large files (gigabytes/terabytes) on top of a cluster of commodity servers.
Consecutive append/read operations for batch and analytical workloads.
Block replication between nodes and automatic recovery from failures.
File system metadata: namespace, mapping file -> blocks -> replicas.
Background processes: rebalancing, re-replication, garbage collection of orphan blocks.
Non-functional
Durability: 11+ nines (practically)
The loss of individual disks/nodes should not lead to data loss.
Throughput: High aggregate I/O
Optimization for large sequential read/writes, and not for small random IOs.
Scalability: Linear horizontal
Adding a DataNode should increase both capacity and bandwidth.
Availability: Operation during partial failures
The system is degrading, but remains accessible to majority scenarios.
High-Level Architecture
Architecture Map
DFS high-level structure (GFS/HDFS style)Control Plane
Data Plane
Key pattern: metadata plane And data plane separated. The master controls the placement and namespace, but the bytes themselves flow directly between the client and data nodes.
Read/Write Flow
Read/Write Path Explorer
Switch path and replay steps, similar to the object-storage flow style.
Write Path: operational notes
- The client requests master to allocate a new block and a list of target DataNodes.
- Master selects nodes according to placement policy (rack-aware, free space, load).
- The client streams the chunk into the pipeline DataNode1 -> DataNode2 -> DataNode3.
- ACK is returned in a chain back; The block is marked committed after all replicas have been successfully written.
- Metadata about the block and version is recorded in the journal/snapshot of the master outline.
Read Path: operational notes
- The client receives from the master a map of blocks and a list of available replicas.
- For each block, the closest replica is selected (rack/locality aware).
- Reading goes directly from the DataNode, without proxying the payload through the master.
- When checksum mismatch occurs, the client switches to another replica.
- Master participates only in the metadata plane, which reduces the bottleneck.
Failure Handling
- DataNode timeout: master marks the node dead and schedules re-replication of its blocks.
- Corrupted block: detected by checksum; repair is launched from a healthy replica.
- Master crash: recovery from journal + checkpoint, then replay edits.
- Network partition: quorum/HA mechanics for active/standby master (in modern implementations).
- Hot file/block: read replicas and cache tiers reduce hotspots on popular data.
GFS vs HDFS (in short)
| Aspect | GFS | HDFS |
|---|---|---|
| Main use-case | Big data processing within the Google ecosystem | Hadoop analytics ecosystem in enterprise |
| Block size | Large chunks (classically 64MB) | Large blocks (usually 128MB+) |
| Consistency model | Relaxed, append-heavy semantics | Write-once-read-many (classical) |
| Master HA | Google Internal HA Strategies | NameNode HA (active/standby + journal quorum) |
What to discuss in an interview
- Why large block size reduces metadata overhead and increases throughput.
- How placement policy balances locality, fault domains and network cost.
- Why single-master is historically simpler, and how to evolve to HA.
- How to check integrity: checksums, scrubbing, background repair.
- Where is the boundary between DFS and Object Storage and how do they combine in the data platform.
Reference
HDFS Architecture Guide
Official HDFS Design Description: NameNode, DataNode, replication and fault tolerance.
