Object Storage is a distributed storage system for unstructured data (files, images, videos, backups) with access via HTTP API. Unlike file systems, object storage operates with a flat namespace and provides virtually unlimited scalability.
Source
System Design Interview Vol. 2
Chapter 'Design S3' with a detailed analysis of the object storage architecture.
Object Storage Examples
- Amazon S3: industry standard, 11 nines durability
- Google Cloud Storage: integration with BigQuery, ML services
- Azure Blob Storage: Hot/Cool/Archive tiers
- MinIO: S3-compatible open-source solution
- Ceph: distributed storage system (block, file, object)
Functional Requirements
Core API
PUT /bucket/object— object loadingGET /bucket/object— downloading an objectDELETE /bucket/object- deletionLIST /bucket?prefix=— listing
Advanced Features
- Multipart upload (for large files)
- Versioning (object version history)
- Lifecycle policies (auto-delete, archiving)
- Pre-signed URLs (temporary access)
Non-functional requirements
| Requirement | Target value | Rationale |
|---|---|---|
| Durability | 99.999999999% (11 nines) | No data should be lost |
| Availability | 99.99% | Access to data almost always |
| Scalability | Exabytes+ | Petabytes of data storage |
| Throughput | Tbps+ | Parallel uploads/downloads |
| Object size | Up to 5 TB (S3) | Large file support |
High-level architecture
Theory
DDIA: Storage Engines
Deep dive into B-Trees, LSM-Trees and storage design.
Architecture Map
High-level data flowMetadata Service
Stores object metadata: name, size, content-type, checksum, versions, ACL. This is a critical component—without metadata, the object cannot be found.
Metadata structure:
{
"bucket_id": "uuid",
"object_key": "/photos/2024/vacation.jpg",
"object_id": "uuid", // pointer to data
"size": 4_500_000,
"content_type": "image/jpeg",
"checksum": "sha256:abc123...",
"version_id": "v3",
"created_at": "2024-01-15T10:30:00Z",
"storage_class": "STANDARD",
"replicas": ["node1", "node2", "node3"]
}Data Store
The actual object data store. Objects are split into chunks and distributed across multiple disks/servers with replication.
Storage Strategies:
- Replication: 3 copies on different nodes
- Erasure Coding: (k, m) scheme for saving
- Tiered Storage: Hot → Warm → Cold
Optimizations:
- Append-only write for sequential writing
- Batching small objects
- Compaction for garbage collection
Upload Flow
Upload Flow
Simple upload pathMultipart Upload (large files)
For files larger than 5 GB, a multi-part download is used. This allows uploads to be parallelized and resumed in case of failures.
#1. Initialization
POST /bucket/object?uploads → upload_id
#2. Loading parts (in parallel)
PUT /bucket/object?uploadId=X&partNumber=1 → ETag1
PUT /bucket/object?uploadId=X&partNumber=2 → ETag2
...
#3. Completion
POST /bucket/object?uploadId=X
{
"parts": [
{"partNumber": 1, "ETag": "..."},
{"partNumber": 2, "ETag": "..."}
]
}Durability & Replication
Deeper
Database Internals
A detailed analysis of replication and consensus in distributed systems.
Replication
Creating multiple copies of data on different nodes/data centers.
Advantages:
- Ease of implementation
- Fast recovery
- Low CPU load
Flaws:
- 3x storage overhead
- High cost
Erasure Coding
Splitting data into k parts + m parity blocks. Reconstruction from any k parts.
Advantages:
- 1.5x overhead (vs 3x for replication)
- Saving on cold data
- High durability
Flaws:
- High CPU load when writing/reading
- Complex recovery
How to get a lot of nines in durability
How is 99.999999999% durability achieved with three replicas?
# Let's assume: # - Annual Failure Rate (AFR) of one disk = 2% # - 3 replicas in different failure domains P(loss of 1 disk) = 0.02 P(loss of 2 disks at the same time) = 0.02 × 0.02 = 0.0004 P(loss of 3 disks before recovery) = 0.0004 × 0.02 = 0.000008 # Add: different AZ, fast recovery, monitoring # → 99.999999999% durability
Metadata Sharding
Metadata is the bottleneck of the system. With billions of objects, metadata sharding is needed.
1. Sharding by Bucket
A simple option: all objects in the bucket are on one shard. The problem is hot buckets.
2. Sharding by Object Key Hash
hash(bucket_id + object_key) % N — uniform distribution, but LIST operations require scatter-gather.
3. Hybrid Approach
Sharding by bucket + secondary index for prefix queries. Used in production systems.
LIST operations problem
With hash-based sharding LIST /bucket?prefix=/photos/requires a request to all shards (scatter-gather).
Solutions:
- Separate index for prefix queries
- Range-based sharding for ordered listing
- Denormalization into a separate table
Trade-offs:
- Additional storage overhead
- Consistency between indexes
- Write Operation Complexity
Security Considerations
Access Control
- IAM Policies: user/role-based access
- Bucket Policies: resource-based rules
- ACLs: object-level permissions
- Pre-signed URLs: temporary access
Encryption
- SSE-S3: server-side, managed keys
- SSE-KMS: customer-managed keys
- SSE-C: customer-provided keys
- Client-side: encrypt before upload
Storage Classes
Connection
CDN Integration
Object Storage is often used as the origin for CDN.
| Storage Class | Latency | Cost | Use Case |
|---|---|---|---|
| Standard (Hot) | ms | $$$ | Frequent access, production data |
| Infrequent Access | ms | $$ | Rare access, backups |
| Archive (Glacier) | hours | $ | Long-term storage, compliance |
| Deep Archive | 12+ hours | ¢ | Archives, rarely needed data |
Lifecycle Policies
{
"rules": [
{
"filter": {"prefix": "logs/"},
"transitions": [
{"days": 30, "storageClass": "INFREQUENT_ACCESS"},
{"days": 90, "storageClass": "GLACIER"}
],
"expiration": {"days": 365}
}
]
}Interview Questions
1. How to achieve 11 nines durability?
Replication in different AZs, erasure coding, checksums, fast recovery, disk health monitoring, periodic scrubbing.
2. How to optimize the download of large files?
Multipart upload with parallel parts, resume in case of failures, pre-signed URLs for direct upload to storage.
3. Replication vs Erasure Coding?
Replication (3x overhead) for hot data with frequent access. Erasure coding (1.5x) for cold data, where CPU overhead is acceptable.
4. How to implement versioning?
Each PUT creates a new version with a unique version_id. Metadata stores version history. DELETE creates a delete marker.
5. How does garbage collection work?
Background compaction: merging small files, removing unreferenced data, defragmenting disks. Mark-and-sweep or reference counting.
Key Findings
- ✓Separating metadata and data - key architectural pattern for scaling
- ✓Durability through replication — 3+ copies in different failure domains
- ✓Erasure coding for cold data — saving storage while maintaining durability
- ✓Tiered storage — automatic data movement according to lifecycle policies
- ✓Multipart upload - for large files with parallelism and resumability
