Object Storage (S3) — System Design Space

Object storage looks like a simple file API, but underneath it is an architecture problem about data durability, a massive namespace, and background repair work.

The case ties together the metadata layer, raw-byte placement, multipart upload, the choice between replication and erasure coding, and the guarantees behind PUT, GET, and LIST.

For interviews and architecture reviews, it is valuable because it quickly shows whether you understand the gap between a clean client interface and the real cost of S3-scale storage.

Data Durability

The real question is not just how many copies exist, but how quickly the system detects corruption, repairs it, and spreads data across independent failure domains.

Metadata Layer

Metadata decides where each object lives and how LIST behaves, which is why it often becomes the real bottleneck in the design.

Hot Buckets

Skewed traffic across buckets and prefixes creates hotspots quickly unless catalog partitioning and secondary indexes are designed up front.

Storage Cost

Replication, erasure coding, storage classes, and network egress shape the bill almost as much as the disks themselves.

Object storage keeps files, images, videos, and backups as independent objects behind an HTTP API. There is no familiar directory tree; in its place sits a flat namespace, and the whole system is built around three requirements that pull against each other — extreme data durability, predictable storage economics, and effectively unbounded scale.

Source

System Design Interview Vol. 2

The Design S3 chapter walks through the write path, metadata layer, and why the system only looks simple from the outside.

Читать обзор

Object storage examples

Amazon S3: the reference interface for the industry, known for very high data durability.
Google Cloud Storage: tightly integrated with analytics and ML services across Google Cloud.
Azure Blob Storage: offers several storage classes from hot to archive.
MinIO: an S3-compatible open-source option for private deployments and on-premise environments.
Ceph: a distributed platform that supports block, file, and object storage in one stack.

Functional Requirements

Core operations

PUT /bucket/object — upload an object
GET /bucket/object — read or download an object
DELETE /bucket/object — delete an object
LIST /bucket?prefix= — list objects by prefix

Advanced capabilities

Upload large objects in multiple parts
Keep multiple versions of the same object
Move data between storage classes automatically
Grant temporary access through pre-signed URLs

Non-Functional Requirements

Scale is only the first number here. Before any design, pin down target availability, expected throughput on the read and write path, and how close to zero data-loss probability the system must get in practice — without those numbers, every choice about copies and shards stays a guess.

Requirement	Target	Why it matters
Data durability	99.999999999% (11 nines)	Losing user data is unacceptable even when hardware fails.
Availability	99.99%	Applications should be able to reach objects almost all the time.
Scale	Exabytes and beyond	The system must grow without a full redesign of the storage layer.
Throughput	Tbps+	Massive parallel uploads and downloads should remain routine.
Object size	Up to 5 TB (like S3)	The system should support both tiny files and very large archives.

High-Level Architecture

Theory

DDIA: Storage Engines

Useful context for understanding why metadata and raw object bytes live under different storage rules.

Читать обзор

Architecture Diagram

Write and read paths in object storage

Ingress

Clients

Web · Mobile · SDK

Load Balancer

Edge routing

API Gateway

Auth · Rate limits · Routing

Services

Metadata Service

Keys · ACL · Versions

Object Service

Placement · Replicas

Bucket Service

Namespace · Policies

Storage

Metadata DB

Sharded KV

Data Store

Disks · Erasure

Bucket DB

Configs

Ingress

Clients

Web · Mobile · SDK

Load Balancer

Edge routing

API Gateway

Auth · Rate limits · Routing

Services

Metadata Service

Keys · ACL · Versions

Object Service

Placement · Replicas

Bucket Service

Namespace · Policies

Storage

Metadata DB

Sharded KV

Data Store

Disks · Erasure

Bucket DB

Configs

Metadata Service

This layer stores the object name, size, content-type, checksum, versions, access rules, and the pointer to physical data placement. It is responsible for locating an object by key and serving LIST operations, which is why it often becomes the hardest operational layer in the system.

Example metadata record:

{
  "bucket_id": "uuid",
  "object_key": "/photos/2024/vacation.jpg",
  "object_id": "uuid",
  "size": 4_500_000,
  "content_type": "image/jpeg",
  "checksum": "sha256:abc123...",
  "version_id": "v3",
  "created_at": "2024-01-15T10:30:00Z",
  "storage_class": "STANDARD",
  "replicas": ["node1", "node2", "node3"]
}

Data Layer

Raw object bytes live separately from metadata. That split lets you scale the catalog and the storage fleet independently, and it allows different placement strategies for hot and cold data.

Storage approaches:

Replication: multiple copies across nodes or AZs
Erasure coding: lower storage overhead at the cost of more complex repair
Tiered storage: hot, infrequent-access, and archive layers

Practical optimizations:

Sequential writes for large objects
Batching for tiny files
Background cleanup and compaction for sparse segments

Object Upload Flow

Upload Flow

A simplified object write path

Client

PUT /bucket/object

API Gateway

Access and quota checks

Object Service

Placement decision

Data Layer

Write and replicas

Metadata DB

Commit metadata

Response

200 OK and ETag

Client

PUT /bucket/object

API Gateway

Access and quota checks

Object Service

Placement decision

Data Layer

Write and replicas

Metadata DB

Commit metadata

Response

200 OK and ETag

Press Play to step through the object write path.

Multipart upload for large objects

Multipart upload matters for large files because clients can send parts in parallel, resume after failure, and avoid retransmitting chunks that have already been stored.

# 1. Initialization
POST /bucket/object?uploads → upload_id

# 2. Upload parts (in parallel)
PUT /bucket/object?uploadId=X&partNumber=1 → ETag1
PUT /bucket/object?uploadId=X&partNumber=2 → ETag2
...

# 3. Completion
POST /bucket/object?uploadId=X
{
  "parts": [
    {"partNumber": 1, "ETag": "..."},
    {"partNumber": 2, "ETag": "..."}
  ]
}

Data Durability and Placement

Deeper

Database Internals

Helpful when you need to talk about replication, repair, and the real cost of each extra storage guarantee.

Читать обзор

In practice, object storage often combines replication for fast recovery with erasure coding for lower storage cost on large volumes of colder data.

Replication

The same object is stored as several copies across different nodes, racks, or availability zones.

Advantages:

Simple write and read path
Fast recovery after a node failure
Low compute overhead

Drawbacks:

Roughly 3x storage overhead with three copies
Expensive for very large archival datasets

Erasure Coding

Data is split into k data fragments and m parity fragments, so the object can be rebuilt even after some disks are lost.

Advantages:

Much lower storage overhead than triple replication
A good fit for archive and infrequently accessed data
Preserves high durability without the full cost of extra copies

Drawbacks:

Higher CPU cost during write, read, and repair
A more complex recovery path after failures

Where the “11 nines” come from

The number is not magic. It comes from spreading copies across independent failure domains, repairing damage quickly, and continuously checking the integrity of stored fragments.

# Assume:
# - Annual Failure Rate (AFR) of one disk = 2%
# - 3 copies in different failure domains

P(loss of one disk) = 0.02
P(loss of two disks at the same time) = 0.02 × 0.02 = 0.0004
P(loss of three disks before repair) = 0.0004 × 0.02 = 0.000008

# Then add:
# - different AZs
# - fast repair
# - scrubbing and checksum validation
# → very high data durability

Metadata Sharding

Once the object count reaches the billions, metadata becomes the main bottleneck. Sharding removes pressure from that layer, but it immediately complicates LIST operations and the consistency model between the main catalog and any secondary indexes.

1. Shard by bucket

The simplest choice: keep all objects from one bucket on a single shard. The downside is that popular buckets quickly turn into hotspots.

2. Shard by object key hash

hash(bucket_id + object_key) % N spreads load well, but LIST needs a fan-out to every shard and a merge step on the way back.

3. Hybrid approach

The primary catalog is partitioned one way, while prefix queries use a separate index. This is the pattern many production systems eventually adopt.

Why LIST is harder than it looks

With hash-based partitioning, LIST /bucket?prefix=/photos/ cannot know in advance which shard owns the relevant range. The coordinator has to query every shard and then merge partial results.

What helps:

A dedicated index for prefix queries
Range-based partitioning for ordered listings
Denormalizing into a separate catalog table

What it costs:

Extra storage for indexes
More work to keep the catalog and indexes in sync
A more complex write path for metadata updates

Security Considerations

Access control

IAM Policies: permissions at the user or role level
Bucket Policies: resource-level access rules
ACLs: object-level permissions when finer control is needed
Pre-signed URLs: temporary access without permanent credential sharing

Encryption

SSE-S3: server-side encryption with provider-managed keys
SSE-KMS: server-side encryption backed by customer-managed keys
SSE-C: server-side encryption with keys supplied by the client
Client-side: encrypt data before it is sent to storage

Storage Classes

Connection

CDN Integration

Object storage often serves as the origin behind a CDN, where object versioning, predictable reads, and cost control matter together.

Читать обзор

Class	Latency	Cost	Typical use
Standard (Hot)	milliseconds	$$$	Frequent access and live application data
Infrequent Access	milliseconds	$$	Backups and objects read only occasionally
Archive (Glacier)	hours	$	Long-term retention and compliance archives
Deep Archive	12+ hours	¢	Data that is almost never read but still must exist

Lifecycle policies

{
  "rules": [
    {
      "filter": {"prefix": "logs/"},
      "transitions": [
        {"days": 30, "storageClass": "INFREQUENT_ACCESS"},
        {"days": 90, "storageClass": "GLACIER"}
      ],
      "expiration": {"days": 365}
    }
  ]
}

Interview Questions

1. How do you get to 11 nines of durability?

Spread copies or coded fragments across independent AZs, validate checksums, repair damaged data quickly, and treat background repair as a first-class production concern.

2. How do you speed up uploads for very large files?

Upload in parts, allow parallel transfers, resume after interruptions, and give the client a pre-signed URL whenever direct-to-storage upload is the cleanest path.

3. When do you choose replication over erasure coding?

Replication fits hot data and fast recovery. Erasure coding is usually the better choice for colder data where storage efficiency matters more than extra compute during repair.

4. How do you implement versioning safely?

Each PUT writes a new version with a unique version_id, while DELETE writes a delete marker instead of immediately removing historical bytes.

5. What happens to abandoned parts and deleted objects?

You need background garbage collection to find incomplete uploads, remove unreferenced blocks, compact sparse segments, and keep cleanup from hurting the read path.

Key Takeaways

✓Metadata and raw data must be separated — that split is what lets the catalog and the byte-storage fleet scale independently.
✓Durability comes from several mechanisms working together — copy placement, background repair, checksums, and failure-domain isolation all matter.
✓Storage classes change the economics — hot and archival data should not be priced or placed the same way.
✓LIST and metadata often hurt more than PUT/GET — hotspots, indexing, and consistency concerns usually show up there first.
✓Multipart upload is about resilience, not API decoration — it keeps large uploads practical when clients fail or connections drop.

References

Amazon Web Services — Data protection in Amazon S3 (AWS S3 User Guide)Amazon Web Services — Understanding and managing Amazon S3 storage classes (AWS S3 User Guide)Amazon Web Services — Uploading and copying objects using multipart upload (AWS S3 User Guide)

Related chapters

Content Delivery Network (CDN) - shows how object storage serves as the origin layer behind global delivery and cache distribution.
System Design Interview: An Insider's Guide (short summary) - provides the classic S3-style walkthrough focused on scale, data durability, and the write path.
Database Internals: A Deep Dive (short summary) - helps explain metadata-store internals and storage-engine trade-offs behind the object catalog.
Designing Data-Intensive Applications, 2nd Edition (short summary) - reinforces the distributed-systems foundations behind replication, consistency, and repair.
Distributed File System (GFS/HDFS) - adds a closely related storage case about data placement, repair, and cluster behavior under load.
Acing the System Design Interview (short summary) - helps package object storage into a clean interview answer with API choices, critical path, risks, and trade-offs.
System design case studies examples - puts object storage in the wider case-study map and makes cross-domain comparisons easier.