Social Media Infrastructure — System Design Space

A social platform looks like one product from the outside, but underneath it is a set of tightly connected systems: feed serving, publishing, the social graph, notifications, moderation, and operability controls.

The case helps draw the boundary between shared platform capabilities and domain-specific services, showing where a common infrastructure layer is justified and where separate pipelines stay healthier.

For interviews and engineering discussions, this case is useful because it moves the conversation from isolated services to platform-wide resilience: how to limit coupling, survive spikes, and justify platform investments.

Hybrid Fanout

One distribution strategy does not fit every user: ordinary accounts and celebrity-scale authors need different balances between precomputation, read-time assembly, and caching.

Fault Isolation

Ranking, moderation, and notifications should never take down the baseline feed path, which is why those dependencies need to be isolated before the incident begins.

Graceful Degradation

During spikes and partial failures, the platform has to simplify the feed and shed secondary features while keeping the main user journey alive.

Platform SLOs

Reliability is defined less by one healthy service and more by whether the feed opens on time, posts are acknowledged quickly, and the team notices degradation early.

Social media infrastructure is not a conversation about one feature — it is about what breaks first: the product is read around the clock, constantly written to, and personalized, and every so often a single author draws a spike large enough to overload downstream services. In interviews, this is where you show whether you can hold an architecture together with clear fault isolation, graceful degradation, observability, and reliable control loops in a product where reads outnumber writes by orders of magnitude — and where the feed read is the first thing to degrade.

Source

Acing the System Design Interview

Chapter 14: an infrastructure-wide view of a social platform with a strong reliability focus.

Читать обзор

Where this pattern matters most

Twitter/X: massive post fanout, ranking, and protection from celebrity-driven spikes.
Instagram / Threads: media-heavy feeds where moderation and notifications must degrade in a controlled way rather than take the whole serving path down.
TikTok: personalization is expensive, yet the feed still has to serve fast — otherwise the cost of model quality turns into user-visible latency.
LinkedIn / Reddit: here you have to balance relevance, freshness, and predictable serving without sacrificing platform stability for any one of the three.

Functional requirements

Core APIs and user journeys

POST /posts - publish content
GET /feed - fetch the personalized feed
POST /interactions - likes, comments, and reposts
POST /relationships - follow and unfollow

Platform capabilities

Hybrid feed distribution for regular users and very high-follower accounts
Moderation and policy checks before content enters serving paths
Separate notification and ranking paths with controlled fallback behavior
Operational support: incident runbooks, release guardrails, and safe replay flows

Non-functional requirements

Requirement	Target	Why it matters
Feed-open latency (p95)	< 250ms	A core retention journey that users feel immediately.
Publish acknowledgement latency (p95)	< 400ms	Creators need fast feedback after posting.
Platform availability	99.95%	The product is part of users’ daily routine, so outages are highly visible.
High-follower spike handling	No cascade failures	One post from a major account can create extreme fanout pressure.
Error-budget governance	SLO-driven releases	Release speed has to stay aligned with production risk.

The core trade-off here runs between personalization depth and end-user latency: the more signals ranking weighs, the more expensive it is to open the feed. And the system should be judged on the end-to-end availability of the user journey, not on individual services looking healthy by their own internal metrics.

High-Level Architecture

Theory

Twitter/X

Practical feed case: distribution strategy, cache topology, ranking, and scaling trade-offs.

Читать обзор

High-Level Architecture

publish path, feed serving, and the operability control loop

Core services

product services and the social graph

Post Service

create and update content

Timeline Service

feed assembly

Ranking Service

personalization

Graph Service

follows and edges

Post Service

create and update content

Timeline Service

feed assembly

Ranking Service

personalization

Graph Service

follows and edges

This topology combines content publishing, feed serving, and the control loop that keeps the platform stable.

The architecture separates the user-facing data path from the control loop that owns SLOs, observability, and degraded modes. That separation limits blast radius, reduces coupling between services, and keeps the platform steadier during spikes.

Write and Read Paths

How publishing moves through the platform and how the feed is served under heavy read load.

Client Post

create content

Gateway + Auth

validate request

Post Service

durable commit

Async Fanout

timeline + moderation

User Signals

feed + notifications

Client Post

create content

Gateway + Auth

validate request

Post Service

durable commit

Async Fanout

timeline + moderation

User Signals

feed + notifications

Write path: the post request is validated, committed to durable storage, and then propagated through asynchronous fanout into timeline, notification, and moderation paths.

Write path checkpoints

•Durable post commit happens before downstream fanout.
•Moderation and notifications are typically asynchronous and isolated from core feed availability.
•Celebrity posts require controlled fanout to avoid consumer overload.

One strategy is not enough. For regular users, fanout-on-write is cheaper — the feed is assembled ahead of time and opens fast. But for high-follower accounts that same write turns into an avalanche of updates, so for them and for hot feed segments the system switches to fanout-on-read.

Resilience and operations

Deeper

Observability and Monitoring

User-journey SLOs, traces, error budgets, and operational decision loops.

Читать обзор

SLO contract

Platform paths should be tied to user-facing SLOs and SLIs rather than isolated service metrics:

feed_open_slo = latency_p95 + error_rate + freshness
publish_slo   = ack_latency + durability + moderation_delay

Error budgets keep risky releases under control.
Trace coverage shortens root-cause analysis.
Golden signals show latency, traffic, errors, and saturation.

Degradation strategy

Bulkheads and circuit breakers isolate ranking and moderation from the baseline feed path.
Fallback serving switches to simpler ordering when personalization degrades.
Load shedding turns off secondary features under saturation.
Progressive rollout and critical-journey prioritization help the platform absorb incidents.

Risks and typical mistakes

Large celebrity blast radius: unrestricted fanout overloads downstream services.
Over-coupled ranking: an ML outage should not take down the baseline feed.
Weak moderation integration: forbidden content can leak into serving paths before policies are applied.
No degraded mode: partial dependency failures turn into a full outage too easily.
Metric blindness: services look healthy while the user journey is already degrading.

What to cover in an interview

Where service boundaries sit and which paths must survive neighboring failures.
Why the system combines fanout-on-write and fanout-on-read for different user segments.
What degraded feed mode looks like when ranking or moderation is impaired.
Which SLOs and SLIs actually govern release decisions and incident response.

References

Akos Lada, Meihong Wang, Tak Yan — News Feed ranking, powered by machine learning (Engineering at Meta, 2021)Engineering at Meta — How Meta built the infrastructure for Threads (2023)Google — Site Reliability Engineering, Ch. 4: Service Level Objectives (SLIs/SLOs/SLAs)Netflix — Hystrix: How it Works (bulkhead, circuit breaker, fallback)

Related chapters

Twitter/X - Practical social-feed case: large-scale fanout, ranking, cache design, and spike handling.
Event-Driven Architecture - How to decouple publishing, feed assembly, and notifications through events so you can scale and fail them independently.
Resilience patterns - Failure isolation and graceful degradation — so one dependency outage does not turn into a user-facing incident.
Observability and monitoring design - The SLOs and traces without which healthy service dashboards hide a degrading user journey.
SRE Book - Error budgets as a mechanism: they decide when to ship new work and when to slow down for reliability.
Notification System - An adjacent engagement pipeline with the same logic: asynchronous delivery and degraded modes planned in advance.