System Design Space
Knowledge graphSettings

Updated: April 11, 2026 at 7:35 PM

Social Media Infrastructure

medium

Classic task: feed serving, publishing, failure isolation, graceful degradation, and operability of a social platform.

A social platform looks like one product from the outside, but underneath it is a set of tightly connected systems: feed serving, publishing, the social graph, notifications, moderation, and operability controls.

The case helps draw the boundary between shared platform capabilities and domain-specific services, showing where a common infrastructure layer is justified and where separate pipelines stay healthier.

For interviews and engineering discussions, this case is useful because it moves the conversation from isolated services to platform-wide resilience: how to limit coupling, survive spikes, and justify platform investments.

Hybrid Fanout

One distribution strategy does not fit every user: ordinary accounts and celebrity-scale authors need different balances between precomputation, read-time assembly, and caching.

Fault Isolation

Ranking, moderation, and notifications should never take down the baseline feed path, which is why those dependencies need to be isolated before the incident begins.

Graceful Degradation

During spikes and partial failures, the platform has to simplify the feed and shed secondary features while keeping the main user journey alive.

Platform SLOs

Reliability is defined less by one healthy service and more by whether the feed opens on time, posts are acknowledged quickly, and the team notices degradation early.

Social media infrastructure is a case about keeping the whole platform stable, not about designing one isolated feature. In interviews, this is where you show whether you can build a large consumer system with clear fault isolation, graceful degradation, observability, and reliable control loops for a product dominated by read traffic.

Source

Acing the System Design Interview

Chapter 14: an infrastructure-wide view of a social platform with a strong reliability focus.

Читать обзор

Where this pattern matters most

  • Twitter/X: massive post fanout, ranking, and protection from celebrity-driven spikes.
  • Instagram / Threads: media-heavy feeds with moderation, notifications, and degraded serving modes.
  • TikTok: very fast feed serving on top of expensive personalization.
  • LinkedIn / Reddit: balancing relevance, freshness, and stable platform SLOs.

Functional requirements

Core APIs and user journeys

  • POST /posts - publish content
  • GET /feed - fetch the personalized feed
  • POST /interactions - likes, comments, and reposts
  • POST /relationships - follow and unfollow

Platform capabilities

  • Hybrid feed distribution for regular users and very high-follower accounts
  • Moderation and policy checks before content enters serving paths
  • Separate notification and ranking paths with controlled fallback behavior
  • Operational support: incident runbooks, release guardrails, and safe replay flows

Non-functional requirements

RequirementTargetWhy it matters
Feed-open latency (p95)< 250msA core retention journey that users feel immediately.
Publish acknowledgement latency (p95)< 400msCreators need fast feedback after posting.
Platform availability99.95%The product is part of users’ daily routine, so outages are highly visible.
High-follower spike handlingNo cascade failuresOne post from a major account can create extreme fanout pressure.
Error-budget governanceSLO-driven releasesRelease speed has to stay aligned with production risk.

The core trade-off here is between personalization depth and end-user latency, while overall platform availability matters more than a single service looking healthy in isolation.

High-Level Architecture

Theory

Twitter/X

Practical feed case: distribution strategy, cache topology, ranking, and scaling trade-offs.

Читать обзор

High-Level Architecture

publish path, feed serving, and the operability control loop

This topology combines content publishing, feed serving, and the control loop that keeps the platform stable.

Client Apps
web and mobile
API Gateway
auth and routing
Post Service
create and update content
Timeline Service
feed assembly
Ranking Service
personalization
Graph Service
follows and edges
Event Bus
fanout backbone
Feed Cache
hot feed slices
Post Store
durable source of truth
Graph Store
social edges
Notifications
push and email
Moderation
policy checks
Observability
logs, metrics, traces
SLO Controller
degradation policy

The architecture separates the user-facing data path from the control loop that owns SLOs, observability, and degraded modes. That separation limits blast radius, reduces coupling between services, and keeps the platform steadier during spikes.

Write and Read Paths

Write and Read Paths

How publishing moves through the platform and how the feed is served under heavy read load.

Write path: the post request is validated, committed to durable storage, and then propagated through asynchronous fanout into timeline, notification, and moderation paths.

Client Post

Layer 1

create content

The user publishes content from a mobile or web client.

Gateway + Auth

Layer 2

validate request

The gateway checks auth and quotas, then routes the request to the post service.

Post Service

Layer 3

durable commit

Post is committed into durable storage and event is produced.

Async Fanout

Layer 4

timeline + moderation

The event fans out into timeline build, moderation, and notification pipelines.

User Signals

Layer 5

feed + notifications

Followers receive timeline updates/notifications without blocking publish ACK.

Write path checkpoints

  • Durable post commit happens before downstream fanout.
  • Moderation and notifications are typically asynchronous and isolated from core feed availability.
  • Celebrity posts require controlled fanout to avoid consumer overload.

In practice, the system often mixes fanout-on-write and fanout-on-read so it can serve most users fast without forcing the same strategy onto celebrity-scale accounts.

Resilience and operations

Deeper

Observability and Monitoring

User-journey SLOs, traces, error budgets, and operational decision loops.

Читать обзор

SLO contract

Platform paths should be tied to user-facing SLOs and SLIs rather than isolated service metrics:

feed_open_slo = latency_p95 + error_rate + freshness
publish_slo   = ack_latency + durability + moderation_delay
  • Error budgets keep risky releases under control.
  • Trace coverage shortens root-cause analysis.
  • Golden signals show latency, traffic, errors, and saturation.

Degradation strategy

  • Bulkheads and circuit breakers isolate ranking and moderation from the baseline feed path.
  • Fallback serving switches to simpler ordering when personalization degrades.
  • Load shedding turns off secondary features under saturation.
  • Progressive rollout and critical-journey prioritization help the platform absorb incidents.

Risks and typical mistakes

  • Large celebrity blast radius: unrestricted fanout overloads downstream services.
  • Over-coupled ranking: an ML outage should not take down the baseline feed.
  • Weak moderation integration: forbidden content can leak into serving paths before policies are applied.
  • No degraded mode: partial dependency failures turn into a full outage too easily.
  • Metric blindness: services look healthy while the user journey is already degrading.

What to cover in an interview

  • Where service boundaries sit and which paths must survive neighboring failures.
  • Why the system combines fanout-on-write and fanout-on-read for different user segments.
  • What degraded feed mode looks like when ranking or moderation is impaired.
  • Which SLOs and SLIs actually govern release decisions and incident response.

Related chapters

  • Twitter/X - Practical social-feed case: large-scale fanout, ranking, cache design, and spike handling.
  • Event-Driven Architecture - Asynchronous event flows for publishing, feed assembly, and service decoupling.
  • Resilience patterns - Failure isolation, protective patterns, and graceful degradation under partial outages.
  • Observability and monitoring design - User-journey SLOs and SLIs, trace correlation, and operational decision loops.
  • SRE Book - Error budgets, reliable release practices, and operational risk management.
  • Notification System - Adjacent engagement pipeline with asynchronous delivery and controlled degraded modes.

Enable tracking in Settings