Object storage looks like a simple file API, but underneath it is an architecture problem about data durability, a massive namespace, and background repair work.
The case ties together the metadata layer, raw-byte placement, multipart upload, the choice between replication and erasure coding, and the guarantees behind PUT, GET, and LIST.
For interviews and architecture reviews, it is valuable because it quickly shows whether you understand the gap between a clean client interface and the real cost of S3-scale storage.
Data Durability
The real question is not just how many copies exist, but how quickly the system detects corruption, repairs it, and spreads data across independent failure domains.
Metadata Layer
Metadata decides where each object lives and how LIST behaves, which is why it often becomes the real bottleneck in the design.
Hot Buckets
Skewed traffic across buckets and prefixes creates hotspots quickly unless catalog partitioning and secondary indexes are designed up front.
Storage Cost
Replication, erasure coding, storage classes, and network egress shape the bill almost as much as the disks themselves.
Object storage keeps files, images, videos, and backups as independent objects behind an HTTP API. Instead of a hierarchical file tree, it relies on a flat namespace and is designed around extreme data durability, predictable storage economics, and effectively unbounded scale.
Source
System Design Interview Vol. 2
The Design S3 chapter walks through the write path, metadata layer, and why the system only looks simple from the outside.
Object storage examples
- Amazon S3: the reference interface for the industry, known for very high data durability.
- Google Cloud Storage: tightly integrated with analytics and ML services across Google Cloud.
- Azure Blob Storage: offers several storage classes from hot to archive.
- MinIO: an S3-compatible open-source option for private deployments and on-premise environments.
- Ceph: a distributed platform that supports block, file, and object storage in one stack.
Functional Requirements
Core operations
PUT /bucket/object— upload an objectGET /bucket/object— read or download an objectDELETE /bucket/object— delete an objectLIST /bucket?prefix=— list objects by prefix
Advanced capabilities
- Upload large objects in multiple parts
- Keep multiple versions of the same object
- Move data between storage classes automatically
- Grant temporary access through pre-signed URLs
Non-Functional Requirements
In an S3-style system you need to agree not only on scale, but also on target availability, expected throughput, and how close to zero data-loss probability the system must get in practice.
| Requirement | Target | Why it matters |
|---|---|---|
| Data durability | 99.999999999% (11 nines) | Losing user data is unacceptable even when hardware fails. |
| Availability | 99.99% | Applications should be able to reach objects almost all the time. |
| Scale | Exabytes and beyond | The system must grow without a full redesign of the storage layer. |
| Throughput | Tbps+ | Massive parallel uploads and downloads should remain routine. |
| Object size | Up to 5 TB (like S3) | The system should support both tiny files and very large archives. |
High-Level Architecture
Theory
DDIA: Storage Engines
Useful context for understanding why metadata and raw object bytes live under different storage rules.
Architecture Diagram
Write and read paths in object storageMetadata Service
This layer stores the object name, size, content-type, checksum, versions, access rules, and the pointer to physical data placement. It is responsible for locating an object by key and serving LIST operations, which is why it often becomes the hardest operational layer in the system.
Example metadata record:
{
"bucket_id": "uuid",
"object_key": "/photos/2024/vacation.jpg",
"object_id": "uuid",
"size": 4_500_000,
"content_type": "image/jpeg",
"checksum": "sha256:abc123...",
"version_id": "v3",
"created_at": "2024-01-15T10:30:00Z",
"storage_class": "STANDARD",
"replicas": ["node1", "node2", "node3"]
}Data Layer
Raw object bytes live separately from metadata. That split lets you scale the catalog and the storage fleet independently, and it allows different placement strategies for hot and cold data.
Storage approaches:
- Replication: multiple copies across nodes or AZs
- Erasure coding: lower storage overhead at the cost of more complex repair
- Tiered storage: hot, infrequent-access, and archive layers
Practical optimizations:
- Sequential writes for large objects
- Batching for tiny files
- Background cleanup and compaction for sparse segments
Object Upload Flow
Upload Flow
A simplified object write pathMultipart upload for large objects
Multipart upload matters for large files because clients can send parts in parallel, resume after failure, and avoid retransmitting chunks that have already been stored.
# 1. Initialization
POST /bucket/object?uploads → upload_id
# 2. Upload parts (in parallel)
PUT /bucket/object?uploadId=X&partNumber=1 → ETag1
PUT /bucket/object?uploadId=X&partNumber=2 → ETag2
...
# 3. Completion
POST /bucket/object?uploadId=X
{
"parts": [
{"partNumber": 1, "ETag": "..."},
{"partNumber": 2, "ETag": "..."}
]
}Data Durability and Placement
Deeper
Database Internals
Helpful when you need to talk about replication, repair, and the real cost of each extra storage guarantee.
In practice, object storage often combines replication for fast recovery with erasure coding for lower storage cost on large volumes of colder data.
Replication
The same object is stored as several copies across different nodes, racks, or availability zones.
Advantages:
- Simple write and read path
- Fast recovery after a node failure
- Low compute overhead
Drawbacks:
- Roughly 3x storage overhead with three copies
- Expensive for very large archival datasets
Erasure Coding
Data is split into k data fragments and m parity fragments, so the object can be rebuilt even after some disks are lost.
Advantages:
- Much lower storage overhead than triple replication
- A good fit for archive and infrequently accessed data
- Preserves high durability without the full cost of extra copies
Drawbacks:
- Higher CPU cost during write, read, and repair
- A more complex recovery path after failures
Where the “11 nines” come from
The number is not magic. It comes from spreading copies across independent failure domains, repairing damage quickly, and continuously checking the integrity of stored fragments.
# Assume: # - Annual Failure Rate (AFR) of one disk = 2% # - 3 copies in different failure domains P(loss of one disk) = 0.02 P(loss of two disks at the same time) = 0.02 × 0.02 = 0.0004 P(loss of three disks before repair) = 0.0004 × 0.02 = 0.000008 # Then add: # - different AZs # - fast repair # - scrubbing and checksum validation # → very high data durability
Metadata Sharding
Once the object count reaches the billions, metadata becomes the main bottleneck. Sharding removes pressure from that layer, but it immediately complicates LIST operations and the consistency model between the main catalog and any secondary indexes.
1. Shard by bucket
The simplest choice: keep all objects from one bucket on a single shard. The downside is that popular buckets quickly turn into hotspots.
2. Shard by object key hash
hash(bucket_id + object_key) % N spreads load well, but LIST needs a fan-out to every shard and a merge step on the way back.
3. Hybrid approach
The primary catalog is partitioned one way, while prefix queries use a separate index. This is the pattern many production systems eventually adopt.
Why LIST is harder than it looks
With hash-based partitioning, LIST /bucket?prefix=/photos/ cannot know in advance which shard owns the relevant range. The coordinator has to query every shard and then merge partial results.
What helps:
- A dedicated index for prefix queries
- Range-based partitioning for ordered listings
- Denormalizing into a separate catalog table
What it costs:
- Extra storage for indexes
- More work to keep the catalog and indexes in sync
- A more complex write path for metadata updates
Security Considerations
Access control
- IAM Policies: permissions at the user or role level
- Bucket Policies: resource-level access rules
- ACLs: object-level permissions when finer control is needed
- Pre-signed URLs: temporary access without permanent credential sharing
Encryption
- SSE-S3: server-side encryption with provider-managed keys
- SSE-KMS: server-side encryption backed by customer-managed keys
- SSE-C: server-side encryption with keys supplied by the client
- Client-side: encrypt data before it is sent to storage
Storage Classes
Connection
CDN Integration
Object storage often serves as the origin behind a CDN, where object versioning, predictable reads, and cost control matter together.
| Class | Latency | Cost | Typical use |
|---|---|---|---|
| Standard (Hot) | milliseconds | $$$ | Frequent access and live application data |
| Infrequent Access | milliseconds | $$ | Backups and objects read only occasionally |
| Archive (Glacier) | hours | $ | Long-term retention and compliance archives |
| Deep Archive | 12+ hours | ¢ | Data that is almost never read but still must exist |
Lifecycle policies
{
"rules": [
{
"filter": {"prefix": "logs/"},
"transitions": [
{"days": 30, "storageClass": "INFREQUENT_ACCESS"},
{"days": 90, "storageClass": "GLACIER"}
],
"expiration": {"days": 365}
}
]
}Interview Questions
1. How do you get to 11 nines of durability?
Spread copies or coded fragments across independent AZs, validate checksums, repair damaged data quickly, and treat background repair as a first-class production concern.
2. How do you speed up uploads for very large files?
Upload in parts, allow parallel transfers, resume after interruptions, and give the client a pre-signed URL whenever direct-to-storage upload is the cleanest path.
3. When do you choose replication over erasure coding?
Replication fits hot data and fast recovery. Erasure coding is usually the better choice for colder data where storage efficiency matters more than extra compute during repair.
4. How do you implement versioning safely?
Each PUT writes a new version with a unique version_id, while DELETE writes a delete marker instead of immediately removing historical bytes.
5. What happens to abandoned parts and deleted objects?
You need background garbage collection to find incomplete uploads, remove unreferenced blocks, compact sparse segments, and keep cleanup from hurting the read path.
Key Takeaways
- ✓Metadata and raw data must be separated — that split is what lets the catalog and the byte-storage fleet scale independently.
- ✓Durability comes from several mechanisms working together — copy placement, background repair, checksums, and failure-domain isolation all matter.
- ✓Storage classes change the economics — hot and archival data should not be priced or placed the same way.
- ✓LIST and metadata often hurt more than PUT/GET — hotspots, indexing, and consistency concerns usually show up there first.
- ✓Multipart upload is about resilience, not API decoration — it keeps large uploads practical when clients fail or connections drop.
Related chapters
- Content Delivery Network (CDN) - shows how object storage serves as the origin layer behind global delivery and cache distribution.
- System Design Interview: An Insider's Guide (short summary) - provides the classic S3-style walkthrough focused on scale, data durability, and the write path.
- Database Internals: A Deep Dive (short summary) - helps explain metadata-store internals and storage-engine trade-offs behind the object catalog.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - reinforces the distributed-systems foundations behind replication, consistency, and repair.
- Distributed File System (GFS/HDFS) - adds a closely related storage case about data placement, repair, and cluster behavior under load.
- Acing the System Design Interview (short summary) - helps package object storage into a clean interview answer with API choices, critical path, risks, and trade-offs.
- System design case studies examples - puts object storage in the wider case-study map and makes cross-domain comparisons easier.
