A social platform looks like one product from the outside, but underneath it is a set of tightly connected systems: feed serving, publishing, the social graph, notifications, moderation, and operability controls.
The case helps draw the boundary between shared platform capabilities and domain-specific services, showing where a common infrastructure layer is justified and where separate pipelines stay healthier.
For interviews and engineering discussions, this case is useful because it moves the conversation from isolated services to platform-wide resilience: how to limit coupling, survive spikes, and justify platform investments.
Hybrid Fanout
One distribution strategy does not fit every user: ordinary accounts and celebrity-scale authors need different balances between precomputation, read-time assembly, and caching.
Fault Isolation
Ranking, moderation, and notifications should never take down the baseline feed path, which is why those dependencies need to be isolated before the incident begins.
Graceful Degradation
During spikes and partial failures, the platform has to simplify the feed and shed secondary features while keeping the main user journey alive.
Platform SLOs
Reliability is defined less by one healthy service and more by whether the feed opens on time, posts are acknowledged quickly, and the team notices degradation early.
Social media infrastructure is a case about keeping the whole platform stable, not about designing one isolated feature. In interviews, this is where you show whether you can build a large consumer system with clear fault isolation, graceful degradation, observability, and reliable control loops for a product dominated by read traffic.
Source
Acing the System Design Interview
Chapter 14: an infrastructure-wide view of a social platform with a strong reliability focus.
Where this pattern matters most
- Twitter/X: massive post fanout, ranking, and protection from celebrity-driven spikes.
- Instagram / Threads: media-heavy feeds with moderation, notifications, and degraded serving modes.
- TikTok: very fast feed serving on top of expensive personalization.
- LinkedIn / Reddit: balancing relevance, freshness, and stable platform SLOs.
Functional requirements
Core APIs and user journeys
POST /posts- publish contentGET /feed- fetch the personalized feedPOST /interactions- likes, comments, and repostsPOST /relationships- follow and unfollow
Platform capabilities
- Hybrid feed distribution for regular users and very high-follower accounts
- Moderation and policy checks before content enters serving paths
- Separate notification and ranking paths with controlled fallback behavior
- Operational support: incident runbooks, release guardrails, and safe replay flows
Non-functional requirements
| Requirement | Target | Why it matters |
|---|---|---|
| Feed-open latency (p95) | < 250ms | A core retention journey that users feel immediately. |
| Publish acknowledgement latency (p95) | < 400ms | Creators need fast feedback after posting. |
| Platform availability | 99.95% | The product is part of users’ daily routine, so outages are highly visible. |
| High-follower spike handling | No cascade failures | One post from a major account can create extreme fanout pressure. |
| Error-budget governance | SLO-driven releases | Release speed has to stay aligned with production risk. |
The core trade-off here is between personalization depth and end-user latency, while overall platform availability matters more than a single service looking healthy in isolation.
High-Level Architecture
Theory
Twitter/X
Practical feed case: distribution strategy, cache topology, ranking, and scaling trade-offs.
High-Level Architecture
publish path, feed serving, and the operability control loopThis topology combines content publishing, feed serving, and the control loop that keeps the platform stable.
The architecture separates the user-facing data path from the control loop that owns SLOs, observability, and degraded modes. That separation limits blast radius, reduces coupling between services, and keeps the platform steadier during spikes.
Write and Read Paths
Write and Read Paths
How publishing moves through the platform and how the feed is served under heavy read load.
Write path: the post request is validated, committed to durable storage, and then propagated through asynchronous fanout into timeline, notification, and moderation paths.
Client Post
Layer 1create content
The user publishes content from a mobile or web client.
Gateway + Auth
Layer 2validate request
The gateway checks auth and quotas, then routes the request to the post service.
Post Service
Layer 3durable commit
Post is committed into durable storage and event is produced.
Async Fanout
Layer 4timeline + moderation
The event fans out into timeline build, moderation, and notification pipelines.
User Signals
Layer 5feed + notifications
Followers receive timeline updates/notifications without blocking publish ACK.
Write path checkpoints
- •Durable post commit happens before downstream fanout.
- •Moderation and notifications are typically asynchronous and isolated from core feed availability.
- •Celebrity posts require controlled fanout to avoid consumer overload.
In practice, the system often mixes fanout-on-write and fanout-on-read so it can serve most users fast without forcing the same strategy onto celebrity-scale accounts.
Resilience and operations
Deeper
Observability and Monitoring
User-journey SLOs, traces, error budgets, and operational decision loops.
SLO contract
Platform paths should be tied to user-facing SLOs and SLIs rather than isolated service metrics:
feed_open_slo = latency_p95 + error_rate + freshness publish_slo = ack_latency + durability + moderation_delay
- Error budgets keep risky releases under control.
- Trace coverage shortens root-cause analysis.
- Golden signals show latency, traffic, errors, and saturation.
Degradation strategy
- Bulkheads and circuit breakers isolate ranking and moderation from the baseline feed path.
- Fallback serving switches to simpler ordering when personalization degrades.
- Load shedding turns off secondary features under saturation.
- Progressive rollout and critical-journey prioritization help the platform absorb incidents.
Risks and typical mistakes
- Large celebrity blast radius: unrestricted fanout overloads downstream services.
- Over-coupled ranking: an ML outage should not take down the baseline feed.
- Weak moderation integration: forbidden content can leak into serving paths before policies are applied.
- No degraded mode: partial dependency failures turn into a full outage too easily.
- Metric blindness: services look healthy while the user journey is already degrading.
What to cover in an interview
- Where service boundaries sit and which paths must survive neighboring failures.
- Why the system combines fanout-on-write and fanout-on-read for different user segments.
- What degraded feed mode looks like when ranking or moderation is impaired.
- Which SLOs and SLIs actually govern release decisions and incident response.
Related chapters
- Twitter/X - Practical social-feed case: large-scale fanout, ranking, cache design, and spike handling.
- Event-Driven Architecture - Asynchronous event flows for publishing, feed assembly, and service decoupling.
- Resilience patterns - Failure isolation, protective patterns, and graceful degradation under partial outages.
- Observability and monitoring design - User-journey SLOs and SLIs, trace correlation, and operational decision loops.
- SRE Book - Error budgets, reliable release practices, and operational risk management.
- Notification System - Adjacent engagement pipeline with asynchronous delivery and controlled degraded modes.
