System Design Space
Knowledge graphSettings

Updated: April 30, 2026 at 7:40 AM

Object Storage (S3)

medium

Classic task: separate metadata from data, design for extreme durability, choose the right storage scheme, and support large-object uploads in parts.

Object storage looks like a simple file API, but underneath it is an architecture problem about data durability, a massive namespace, and background repair work.

The case ties together the metadata layer, raw-byte placement, multipart upload, the choice between replication and erasure coding, and the guarantees behind PUT, GET, and LIST.

For interviews and architecture reviews, it is valuable because it quickly shows whether you understand the gap between a clean client interface and the real cost of S3-scale storage.

Data Durability

The real question is not just how many copies exist, but how quickly the system detects corruption, repairs it, and spreads data across independent failure domains.

Metadata Layer

Metadata decides where each object lives and how LIST behaves, which is why it often becomes the real bottleneck in the design.

Hot Buckets

Skewed traffic across buckets and prefixes creates hotspots quickly unless catalog partitioning and secondary indexes are designed up front.

Storage Cost

Replication, erasure coding, storage classes, and network egress shape the bill almost as much as the disks themselves.

Object storage keeps files, images, videos, and backups as independent objects behind an HTTP API. Instead of a hierarchical file tree, it relies on a flat namespace and is designed around extreme data durability, predictable storage economics, and effectively unbounded scale.

Source

System Design Interview Vol. 2

The Design S3 chapter walks through the write path, metadata layer, and why the system only looks simple from the outside.

Читать обзор

Object storage examples

  • Amazon S3: the reference interface for the industry, known for very high data durability.
  • Google Cloud Storage: tightly integrated with analytics and ML services across Google Cloud.
  • Azure Blob Storage: offers several storage classes from hot to archive.
  • MinIO: an S3-compatible open-source option for private deployments and on-premise environments.
  • Ceph: a distributed platform that supports block, file, and object storage in one stack.

Functional Requirements

Core operations

  • PUT /bucket/object — upload an object
  • GET /bucket/object — read or download an object
  • DELETE /bucket/object — delete an object
  • LIST /bucket?prefix= — list objects by prefix

Advanced capabilities

  • Upload large objects in multiple parts
  • Keep multiple versions of the same object
  • Move data between storage classes automatically
  • Grant temporary access through pre-signed URLs

Non-Functional Requirements

In an S3-style system you need to agree not only on scale, but also on target availability, expected throughput, and how close to zero data-loss probability the system must get in practice.

RequirementTargetWhy it matters
Data durability99.999999999% (11 nines)Losing user data is unacceptable even when hardware fails.
Availability99.99%Applications should be able to reach objects almost all the time.
ScaleExabytes and beyondThe system must grow without a full redesign of the storage layer.
ThroughputTbps+Massive parallel uploads and downloads should remain routine.
Object sizeUp to 5 TB (like S3)The system should support both tiny files and very large archives.

High-Level Architecture

Theory

DDIA: Storage Engines

Useful context for understanding why metadata and raw object bytes live under different storage rules.

Читать обзор

Architecture Diagram

Write and read paths in object storage
Clients
Web · Mobile · SDK
Load Balancer
Edge routing
API Gateway
Auth · Rate limits · Routing
Metadata Service
Keys · ACL · Versions
Object Service
Placement · Replicas
Bucket Service
Namespace · Policies
Metadata DB
Sharded KV
Data Store
Disks · Erasure
Bucket DB
Configs

Metadata Service

This layer stores the object name, size, content-type, checksum, versions, access rules, and the pointer to physical data placement. It is responsible for locating an object by key and serving LIST operations, which is why it often becomes the hardest operational layer in the system.

Example metadata record:

{
  "bucket_id": "uuid",
  "object_key": "/photos/2024/vacation.jpg",
  "object_id": "uuid",
  "size": 4_500_000,
  "content_type": "image/jpeg",
  "checksum": "sha256:abc123...",
  "version_id": "v3",
  "created_at": "2024-01-15T10:30:00Z",
  "storage_class": "STANDARD",
  "replicas": ["node1", "node2", "node3"]
}

Data Layer

Raw object bytes live separately from metadata. That split lets you scale the catalog and the storage fleet independently, and it allows different placement strategies for hot and cold data.

Storage approaches:

  • Replication: multiple copies across nodes or AZs
  • Erasure coding: lower storage overhead at the cost of more complex repair
  • Tiered storage: hot, infrequent-access, and archive layers

Practical optimizations:

  • Sequential writes for large objects
  • Batching for tiny files
  • Background cleanup and compaction for sparse segments

Object Upload Flow

Upload Flow

A simplified object write path
1
Client
PUT /bucket/object
2
API Gateway
Access and quota checks
3
Object Service
Placement decision
4
Data Layer
Write and replicas
5
Metadata DB
Commit metadata
6
Response
200 OK and ETag
Press Play to step through the object write path.

Multipart upload for large objects

Multipart upload matters for large files because clients can send parts in parallel, resume after failure, and avoid retransmitting chunks that have already been stored.

# 1. Initialization
POST /bucket/object?uploads → upload_id

# 2. Upload parts (in parallel)
PUT /bucket/object?uploadId=X&partNumber=1 → ETag1
PUT /bucket/object?uploadId=X&partNumber=2 → ETag2
...

# 3. Completion
POST /bucket/object?uploadId=X
{
  "parts": [
    {"partNumber": 1, "ETag": "..."},
    {"partNumber": 2, "ETag": "..."}
  ]
}

Data Durability and Placement

Deeper

Database Internals

Helpful when you need to talk about replication, repair, and the real cost of each extra storage guarantee.

Читать обзор

In practice, object storage often combines replication for fast recovery with erasure coding for lower storage cost on large volumes of colder data.

Replication

The same object is stored as several copies across different nodes, racks, or availability zones.

Advantages:

  • Simple write and read path
  • Fast recovery after a node failure
  • Low compute overhead

Drawbacks:

  • Roughly 3x storage overhead with three copies
  • Expensive for very large archival datasets

Erasure Coding

Data is split into k data fragments and m parity fragments, so the object can be rebuilt even after some disks are lost.

Advantages:

  • Much lower storage overhead than triple replication
  • A good fit for archive and infrequently accessed data
  • Preserves high durability without the full cost of extra copies

Drawbacks:

  • Higher CPU cost during write, read, and repair
  • A more complex recovery path after failures

Where the “11 nines” come from

The number is not magic. It comes from spreading copies across independent failure domains, repairing damage quickly, and continuously checking the integrity of stored fragments.

# Assume:
# - Annual Failure Rate (AFR) of one disk = 2%
# - 3 copies in different failure domains

P(loss of one disk) = 0.02
P(loss of two disks at the same time) = 0.02 × 0.02 = 0.0004
P(loss of three disks before repair) = 0.0004 × 0.02 = 0.000008

# Then add:
# - different AZs
# - fast repair
# - scrubbing and checksum validation
# → very high data durability

Metadata Sharding

Once the object count reaches the billions, metadata becomes the main bottleneck. Sharding removes pressure from that layer, but it immediately complicates LIST operations and the consistency model between the main catalog and any secondary indexes.

1. Shard by bucket

The simplest choice: keep all objects from one bucket on a single shard. The downside is that popular buckets quickly turn into hotspots.

2. Shard by object key hash

hash(bucket_id + object_key) % N spreads load well, but LIST needs a fan-out to every shard and a merge step on the way back.

3. Hybrid approach

The primary catalog is partitioned one way, while prefix queries use a separate index. This is the pattern many production systems eventually adopt.

Why LIST is harder than it looks

With hash-based partitioning, LIST /bucket?prefix=/photos/ cannot know in advance which shard owns the relevant range. The coordinator has to query every shard and then merge partial results.

What helps:

  • A dedicated index for prefix queries
  • Range-based partitioning for ordered listings
  • Denormalizing into a separate catalog table

What it costs:

  • Extra storage for indexes
  • More work to keep the catalog and indexes in sync
  • A more complex write path for metadata updates

Security Considerations

Access control

  • IAM Policies: permissions at the user or role level
  • Bucket Policies: resource-level access rules
  • ACLs: object-level permissions when finer control is needed
  • Pre-signed URLs: temporary access without permanent credential sharing

Encryption

  • SSE-S3: server-side encryption with provider-managed keys
  • SSE-KMS: server-side encryption backed by customer-managed keys
  • SSE-C: server-side encryption with keys supplied by the client
  • Client-side: encrypt data before it is sent to storage

Storage Classes

Connection

CDN Integration

Object storage often serves as the origin behind a CDN, where object versioning, predictable reads, and cost control matter together.

Читать обзор
ClassLatencyCostTypical use
Standard (Hot)milliseconds$$$Frequent access and live application data
Infrequent Accessmilliseconds$$Backups and objects read only occasionally
Archive (Glacier)hours$Long-term retention and compliance archives
Deep Archive12+ hours¢Data that is almost never read but still must exist

Lifecycle policies

{
  "rules": [
    {
      "filter": {"prefix": "logs/"},
      "transitions": [
        {"days": 30, "storageClass": "INFREQUENT_ACCESS"},
        {"days": 90, "storageClass": "GLACIER"}
      ],
      "expiration": {"days": 365}
    }
  ]
}

Interview Questions

1. How do you get to 11 nines of durability?

Spread copies or coded fragments across independent AZs, validate checksums, repair damaged data quickly, and treat background repair as a first-class production concern.

2. How do you speed up uploads for very large files?

Upload in parts, allow parallel transfers, resume after interruptions, and give the client a pre-signed URL whenever direct-to-storage upload is the cleanest path.

3. When do you choose replication over erasure coding?

Replication fits hot data and fast recovery. Erasure coding is usually the better choice for colder data where storage efficiency matters more than extra compute during repair.

4. How do you implement versioning safely?

Each PUT writes a new version with a unique version_id, while DELETE writes a delete marker instead of immediately removing historical bytes.

5. What happens to abandoned parts and deleted objects?

You need background garbage collection to find incomplete uploads, remove unreferenced blocks, compact sparse segments, and keep cleanup from hurting the read path.

Key Takeaways

  • Metadata and raw data must be separated — that split is what lets the catalog and the byte-storage fleet scale independently.
  • Durability comes from several mechanisms working together — copy placement, background repair, checksums, and failure-domain isolation all matter.
  • Storage classes change the economics — hot and archival data should not be priced or placed the same way.
  • LIST and metadata often hurt more than PUT/GET — hotspots, indexing, and consistency concerns usually show up there first.
  • Multipart upload is about resilience, not API decoration — it keeps large uploads practical when clients fail or connections drop.

Related chapters

Enable tracking in Settings