An A/B testing platform is valuable not because it has a dashboard, but because it can assign variants correctly, record exposure cleanly, and protect causal signal from noisy data.
The chapter ties together assignment, configuration delivery, event capture, analytics, and guardrail checks into one system that teams can actually trust.
For interviews and architecture discussions, this case is useful because it forces a separate explanation of the fast decision path, the event pipeline, SRM, and safe rollout rules.
Variant Assignment
The assignment decision sits on the critical request path, so it must be deterministic, fast, and independent of node-local state.
Exposure Events
The platform has to record the moment a variant was shown, or product metrics quickly lose their causal link to the experiment.
Metric Integrity
SRM, duplicates, dropped events, and early peeking break conclusions more often than the statistics formula itself.
Rollout Safety
You need staged rollout, guardrails, and a fast rollback path so the experiment does not quietly turn into an incident.
Problem statement
A/B testing is useful only when the system can assign a variant consistently, tie user actions back to that assignment, and surface results without corrupting the data. That means the platform needs both a tiny live decision path inside the product and a separate analytics path that teams can actually trust.
Functional Requirements
In practice, an experimentation platform is more than a flag switch. It needs to pick the right audience, control traffic exposure safely, and connect product outcomes back to the assigned variant with enough fidelity to support real product decisions.
Experiment management
- Create experiments with control and treatment variants
- Define audience rules, guardrails, and launch policies
- Set runtime and stop conditions
- Choose primary and secondary metrics
Variant assignment
- Deterministic bucketing for users and experiments
- Stable assignment across repeated requests
- Support traffic splits such as 1%, 5%, and 50%
- Ramp experiments up and roll them back safely
Data collection
- Log exposure and downstream product events
- Attach every event to the experiment and variant
- Aggregate CTR, conversion, retention, and guardrail metrics
Analysis of results
- Calculate statistical significance and confidence intervals
- Break down results by segment and guardrails
- Provide operational dashboards and final reports
Non-Functional Requirements
The critical path answers one question: which variant should the product show right now? That path needs extremely low latency, stable assignment, and a separate event loop that can absorb large volumes without polluting the product-serving path.
Low latency
Variant lookup should stay under 10 ms so it does not change the user journey
Stable assignment
A user should keep seeing the same variant for the lifetime of the experiment
Event scale
The analytics loop must handle billions of daily events without collapsing
High-Level Architecture
The platform usually splits into two paths. The hot path returns a variant during a live product request. The cold path collects events, recomputes metrics, and prepares reports. Keeping those paths separate is useful because they have very different latency budgets and very different failure costs.
Main components
Experiment Management Service
Stores experiment config, audience rules, guardrails, and traffic allocations
Variant Assignment Service
Computes the user bucket and returns a decision on the critical path
Event Ingestion Pipeline
Ingests exposure, click, and conversion events with experiment metadata attached
Analysis Engine
Computes metrics, confidence intervals, and final statistical output
The design separates a tiny decision path from a much heavier analytics loop.
A failure in the first hurts UX. A failure in the second damages trust in the result.
C4 visualization
The same platform is shown below at three C4 levels: external context, platform containers, and the internals of the assignment service. For the modeling approach itself, see the chapter on the C4 model.
L1 — System Context
Shows who requests a variant, who consumes the results, and where experiment events are exported.
Randomization Algorithms
Variant assignment should not depend on local server memory or on which node happened to receive the request. That is why the default approach is usually stateless: compute the answer from the user id, experiment id, and bucket configuration.
Hash and Partition (HP)
variant = Hash(UserID + ExperimentID) % 100if variant < 50: return "Control"else: return "Treatment"- Does not require a separate state store for assignment
- Deterministic: the same input always yields the same output
- Keeps experiments independent through the ExperimentID
- Scales horizontally with little coordination
Pseudorandom with Caching (PwC)
Pick a random variant once, then cache the result for later requests.
- Server-side mode needs a database or distributed cache
- Client-side mode usually relies on cookies or local storage
- Requires additional storage
- Can break stable assignment when cookies or cached rows disappear
Variant Assignment Methods
The choice between server-side and client-side assignment depends on where experiment logic lives and how expensive an extra network hop is on the product path.
Server-side Assignment
- ✓ Safer because the logic stays off the client
- ✓ Works well for backend and API experiments
- △ Needs a very fast service or embedded library
- ✗ Adds a network hop to the critical path
Client-side Assignment
- ✓ Works well for fast UI changes
- ✓ Avoids a round-trip on every render
- △ Requires configuration to be delivered in advance
- ✗ Exposes part of the logic to end users
Optimization: distribute configuration closer to the product
Experiment configuration is often pushed into Redis, edge nodes, or a CDN layer so that the SDK or local library can compute the variant close to the product while the rollout rules still stay centrally managed.
Data Pipeline
The analytics loop deserves its own pipeline because its workload is very different from the live decision path: high throughput, continuous event intake, and stream processing for fast metrics and guardrail signals.
Client Events
Kafka
Flink/Spark
ClickHouse
Reports
Ingestion
Events land in Kafka or a similar log. Each record typically includes user_id, experiment_id, variant, timestamp, and the event payload.
Processing
Flink powers near-real-time metrics and guardrail checks, while Spark or batch jobs handle heavier aggregates and deeper analysis.
Storage and reporting
An OLAP store such as ClickHouse or Pinot serves fast analytical queries, and dashboards or reports sit on top of that layer.
Parallel experiments
Experiments rarely live in isolation. If one test changes the interface and another changes ranking, the platform needs a clear rule for which tests may overlap and which ones must be isolated.
Problem
Experiment A changes search, experiment B changes a button. If the same user lands in both treatments, it becomes much harder to explain what actually moved conversion.
Solution: domains and layers
Group experiments by domain such as UI, backend, or ranking. Inside one layer, tests are usually mutually exclusive. Across layers, they can run independently.
Layering example
Common pitfalls
The most expensive failures do not happen in the launch UI. They happen in the data. If the platform loses exposure events or drifts into a sample ratio mismatch, the dashboard can look convincing while the conclusion is still wrong.
Sample Ratio Mismatch (SRM)
The system expects 50/50 traffic but sees 52/48 instead. Common causes include bot traffic, redirect issues, missing clients, or broken filters. SRM should be checked before any metric interpretation.
Peeking Problem
Teams look at the graph too early and decide before enough data has accumulated. Fixed sample sizes or properly configured sequential testing reduce that risk.
Network Effects
In social products, users influence each other, so a change shown to one group can indirectly affect another. Cluster-based randomization can be safer than user-level assignment.
Multiple Testing
The more metrics a team inspects at once, the easier it becomes to find a false win by chance. Pick a primary metric and apply corrections when necessary.
Key Takeaways
Hash-based assignment gives deterministic bucketing without a separate state store and scales cleanly.
Configuration distribution keeps the decision path small by moving rules closer to the product.
Layered isolation lets teams run multiple experiments without blending their effects into one noisy result.
A separate event pipeline keeps heavy analytics traffic away from the live assignment path.
Statistical discipline matters more than a polished dashboard: SRM checks, sample sizing, and peeking controls come first.
OLAP storage and reporting separate deep analytical queries from the product-serving path.
This chapter is based on the public interview «System Design Interview: A/B Testing Platform» and Ron Kohavi’s article “Trustworthy Online Controlled Experiments”
Related chapters
- C4 Model: Context, Containers, Components, and Code - helps break the experimentation platform into clean layers, from system context to containers and critical components.
- Troubleshooting Interviews - adds practical incident-debugging framing for SRM, metric regressions, and failures in the event pipeline.
- Ad Click Event Aggregator - provides a neighboring data case about event intake, aggregation, and large-scale stream processing.
- System design case studies examples - places the experimentation platform next to other design cases and makes the repeated architecture patterns easier to compare.
- Engineering Reliable Mobile Applications (short summary) - adds the operational angle: staged rollouts, guardrails, observability, and release safety.
