System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 8:00 PM

Object Storage (S3)

mid

Classic task: metadata/data separation, durability (11 nines), erasure coding, multipart upload.

Object Storage is a distributed storage system for unstructured data (files, images, videos, backups) with access via HTTP API. Unlike file systems, object storage operates with a flat namespace and provides virtually unlimited scalability.

Source

System Design Interview Vol. 2

Chapter 'Design S3' with a detailed analysis of the object storage architecture.

Читать обзор

Object Storage Examples

  • Amazon S3: industry standard, 11 nines durability
  • Google Cloud Storage: integration with BigQuery, ML services
  • Azure Blob Storage: Hot/Cool/Archive tiers
  • MinIO: S3-compatible open-source solution
  • Ceph: distributed storage system (block, file, object)

Functional Requirements

Core API

  • PUT /bucket/object — object loading
  • GET /bucket/object — downloading an object
  • DELETE /bucket/object - deletion
  • LIST /bucket?prefix= — listing

Advanced Features

  • Multipart upload (for large files)
  • Versioning (object version history)
  • Lifecycle policies (auto-delete, archiving)
  • Pre-signed URLs (temporary access)

Non-functional requirements

RequirementTarget valueRationale
Durability99.999999999% (11 nines)No data should be lost
Availability99.99%Access to data almost always
ScalabilityExabytes+Petabytes of data storage
ThroughputTbps+Parallel uploads/downloads
Object sizeUp to 5 TB (S3)Large file support

High-level architecture

Theory

DDIA: Storage Engines

Deep dive into B-Trees, LSM-Trees and storage design.

Читать обзор

Architecture Map

High-level data flow
Clients
Web · Mobile · SDK
Load Balancer
Edge routing
API Gateway
Auth · Rate Limit · Routing
Metadata Service
Keys · ACL · Versions
Object Service
Placement · Replicas
Bucket Service
Namespace · Policies
Metadata DB
Sharded KV
Data Store
Disks · Erasure
Bucket DB
Configs

Metadata Service

Stores object metadata: name, size, content-type, checksum, versions, ACL. This is a critical component—without metadata, the object cannot be found.

Metadata structure:

{
  "bucket_id": "uuid",
  "object_key": "/photos/2024/vacation.jpg",
  "object_id": "uuid", // pointer to data
  "size": 4_500_000,
  "content_type": "image/jpeg",
  "checksum": "sha256:abc123...",
  "version_id": "v3",
  "created_at": "2024-01-15T10:30:00Z",
  "storage_class": "STANDARD",
  "replicas": ["node1", "node2", "node3"]
}

Data Store

The actual object data store. Objects are split into chunks and distributed across multiple disks/servers with replication.

Storage Strategies:

  • Replication: 3 copies on different nodes
  • Erasure Coding: (k, m) scheme for saving
  • Tiered Storage: Hot → Warm → Cold

Optimizations:

  • Append-only write for sequential writing
  • Batching small objects
  • Compaction for garbage collection

Upload Flow

Upload Flow

Simple upload path
1
Client
PUT /bucket/object
2
API Gateway
Auth + Quotas
3
Object Service
Placement strategy
4
Data Store
Write + Replicas
5
Metadata DB
Commit metadata
6
Response
200 OK + ETag
Нажмите Play, чтобы увидеть поэтапную загрузку объекта.

Multipart Upload (large files)

For files larger than 5 GB, a multi-part download is used. This allows uploads to be parallelized and resumed in case of failures.

#1. Initialization
POST /bucket/object?uploads → upload_id

#2. Loading parts (in parallel)
PUT /bucket/object?uploadId=X&partNumber=1 → ETag1
PUT /bucket/object?uploadId=X&partNumber=2 → ETag2
...

#3. Completion
POST /bucket/object?uploadId=X
{
  "parts": [
    {"partNumber": 1, "ETag": "..."},
    {"partNumber": 2, "ETag": "..."}
  ]
}

Durability & Replication

Deeper

Database Internals

A detailed analysis of replication and consensus in distributed systems.

Читать обзор

Replication

Creating multiple copies of data on different nodes/data centers.

Advantages:

  • Ease of implementation
  • Fast recovery
  • Low CPU load

Flaws:

  • 3x storage overhead
  • High cost

Erasure Coding

Splitting data into k parts + m parity blocks. Reconstruction from any k parts.

Advantages:

  • 1.5x overhead (vs 3x for replication)
  • Saving on cold data
  • High durability

Flaws:

  • High CPU load when writing/reading
  • Complex recovery

How to get a lot of nines in durability

How is 99.999999999% durability achieved with three replicas?

# Let's assume:
# - Annual Failure Rate (AFR) of one disk = 2%
# - 3 replicas in different failure domains

P(loss of 1 disk) = 0.02
P(loss of 2 disks at the same time) = 0.02 × 0.02 = 0.0004
P(loss of 3 disks before recovery) = 0.0004 × 0.02 = 0.000008

# Add: different AZ, fast recovery, monitoring
# → 99.999999999% durability

Metadata Sharding

Metadata is the bottleneck of the system. With billions of objects, metadata sharding is needed.

1. Sharding by Bucket

A simple option: all objects in the bucket are on one shard. The problem is hot buckets.

2. Sharding by Object Key Hash

hash(bucket_id + object_key) % N — uniform distribution, but LIST operations require scatter-gather.

3. Hybrid Approach

Sharding by bucket + secondary index for prefix queries. Used in production systems.

LIST operations problem

With hash-based sharding LIST /bucket?prefix=/photos/requires a request to all shards (scatter-gather).

Solutions:

  • Separate index for prefix queries
  • Range-based sharding for ordered listing
  • Denormalization into a separate table

Trade-offs:

  • Additional storage overhead
  • Consistency between indexes
  • Write Operation Complexity

Security Considerations

Access Control

  • IAM Policies: user/role-based access
  • Bucket Policies: resource-based rules
  • ACLs: object-level permissions
  • Pre-signed URLs: temporary access

Encryption

  • SSE-S3: server-side, managed keys
  • SSE-KMS: customer-managed keys
  • SSE-C: customer-provided keys
  • Client-side: encrypt before upload

Storage Classes

Connection

CDN Integration

Object Storage is often used as the origin for CDN.

Читать обзор
Storage ClassLatencyCostUse Case
Standard (Hot)ms$$$Frequent access, production data
Infrequent Accessms$$Rare access, backups
Archive (Glacier)hours$Long-term storage, compliance
Deep Archive12+ hours¢Archives, rarely needed data

Lifecycle Policies

{
  "rules": [
    {
      "filter": {"prefix": "logs/"},
      "transitions": [
        {"days": 30, "storageClass": "INFREQUENT_ACCESS"},
        {"days": 90, "storageClass": "GLACIER"}
      ],
      "expiration": {"days": 365}
    }
  ]
}

Interview Questions

1. How to achieve 11 nines durability?

Replication in different AZs, erasure coding, checksums, fast recovery, disk health monitoring, periodic scrubbing.

2. How to optimize the download of large files?

Multipart upload with parallel parts, resume in case of failures, pre-signed URLs for direct upload to storage.

3. Replication vs Erasure Coding?

Replication (3x overhead) for hot data with frequent access. Erasure coding (1.5x) for cold data, where CPU overhead is acceptable.

4. How to implement versioning?

Each PUT creates a new version with a unique version_id. Metadata stores version history. DELETE creates a delete marker.

5. How does garbage collection work?

Background compaction: merging small files, removing unreferenced data, defragmenting disks. Mark-and-sweep or reference counting.

Key Findings

  • Separating metadata and data - key architectural pattern for scaling
  • Durability through replication — 3+ copies in different failure domains
  • Erasure coding for cold data — saving storage while maintaining durability
  • Tiered storage — automatic data movement according to lifecycle policies
  • Multipart upload - for large files with parallelism and resumability

Related materials

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov