A/B Testing Platform — System Design Space

An A/B testing platform is valuable not because it has a dashboard, but because it can assign variants correctly, record exposure cleanly, and protect causal signal from noisy data.

The chapter ties together assignment, configuration delivery, event capture, analytics, and guardrail checks into one system that teams can actually trust.

For interviews and architecture discussions, this case is useful because it forces a separate explanation of the fast decision path, the event pipeline, SRM, and safe rollout rules.

Variant Assignment

The assignment decision sits on the critical request path, so it must be deterministic, fast, and independent of node-local state.

Exposure Events

The platform has to record the moment a variant was shown, or product metrics quickly lose their causal link to the experiment.

Metric Integrity

SRM, duplicates, dropped events, and early peeking break conclusions more often than the statistics formula itself.

Rollout Safety

You need staged rollout, guardrails, and a fast rollback path so the experiment does not quietly turn into an incident.

Problem statement

A/B testing is only as useful as it is trustworthy. And it earns trust only when the system assigns a variant consistently, ties every user action back to that assignment without losing it, and surfaces results without corrupting the data. That is the real tension of the problem: one platform has to carry a fast decision path inside the product and a separate analytics path whose numbers the team actually believes.

Functional Requirements

A flag turns a feature on, and that is where the simple tool ends. An experimentation platform also has to pick the right audience, control traffic exposure safely, and connect product outcomes back to the assigned variant without breaking the causal link. That link is easy to break, and once it is, even clean-looking numbers prove nothing.

Experiment management

Create experiments with control and treatment variants
Define audience rules, guardrails, and launch policies
Set runtime and stop conditions
Choose primary and secondary metrics

Variant assignment

Deterministic bucketing for users and experiments
Stable assignment across repeated requests
Support traffic splits such as 1%, 5%, and 50%
Ramp experiments up and roll them back safely

Data collection

Log exposure and downstream product events
Attach every event to the experiment and variant
Aggregate CTR, conversion, retention, and guardrail metrics

Analysis of results

Calculate statistical significance and confidence intervals
Break down results by segment and guardrails
Provide operational dashboards and final reports

Non-Functional Requirements

The critical path answers one question: which variant should the product show right now? That path needs extremely low latency, must not lose assignment consistency, and at the same time has to absorb a separate high-volume event loop. Those requirements pull the system in opposite directions, which is exactly why they are split into two paths.

Critical

Low latency

Variant lookup should stay under 10 ms so it does not change the user journey

Important

Stable assignment

A user should keep seeing the same variant for the lifetime of the experiment

Scale

Event scale

The analytics loop must handle billions of daily events without collapsing

High-Level Architecture

The platform usually splits into two paths. The hot path returns a variant during a live product request. The cold path collects events, recomputes metrics, and prepares reports. Keeping those paths separate is useful because they have very different latency budgets and very different failure costs.

Main components

Experiment Management Service

Stores experiment config, audience rules, guardrails, and traffic allocations

Variant Assignment Service

Computes the user bucket and returns a decision on the critical path

Event Ingestion Pipeline

Ingests exposure, click, and conversion events with experiment metadata attached

Analysis Engine

Computes metrics, confidence intervals, and final statistical output

🧪

The design separates a tiny decision path from a much heavier analytics loop.
A failure in the first hurts UX. A failure in the second damages trust in the result.

C4 visualization

The same platform is shown below at three C4 levels: external context, platform containers, and the internals of the assignment service. For the modeling approach itself, see the chapter on the C4 model.

L1 — System Context

Shows who requests a variant, who consumes the results, and where experiment events are exported.

Randomization Algorithms

Variant assignment should not depend on local server memory or on which node happened to receive the request. That is why the default approach is usually stateless: compute the answer from the user id, experiment id, and bucket configuration.

Hash and Partition (HP)

Recommended

variant = Hash(UserID + ExperimentID) % 100
if variant < 50: return "Control"
else: return "Treatment"

Does not require a separate state store for assignment
Deterministic: the same input always yields the same output
Keeps experiments independent through the ExperimentID
Scales horizontally with little coordination

Pseudorandom with Caching (PwC)

Alternative

Pick a random variant once, then cache the result for later requests.

Server-side mode needs a database or distributed cache
Client-side mode usually relies on cookies or local storage
Requires additional storage
Can break stable assignment when cookies or cached rows disappear

Variant Assignment Methods

The choice between server-side and client-side assignment depends on where experiment logic lives and how expensive an extra network hop is on the product path.

Server-side Assignment

✓ Safer because the logic stays off the client
✓ Works well for backend and API experiments
△ Needs a very fast service or embedded library
✗ Adds a network hop to the critical path

Client-side Assignment

✓ Works well for fast UI changes
✓ Avoids a round-trip on every render
△ Requires configuration to be delivered in advance
✗ Exposes part of the logic to end users

Optimization: distribute configuration closer to the product

Experiment configuration is often pushed into Redis, edge nodes, or a CDN layer so that the SDK or local library can compute the variant close to the product while the rollout rules still stay centrally managed.

Data Pipeline

The analytics loop deserves its own pipeline because its workload is very different from the live decision path: high throughput, continuous event intake, and stream processing for fast metrics and guardrail signals.

📱

Client Events

📨

Kafka

⚡

Flink/Spark

🗄️

ClickHouse

📊

Reports

Ingestion

Events land in Kafka or a similar log. Each record typically includes user_id, experiment_id, variant, timestamp, and the event payload.

Processing

Flink powers near-real-time metrics and guardrail checks, while Spark or batch jobs handle heavier aggregates and deeper analysis.

Storage and reporting

An OLAP store such as ClickHouse or Pinot serves fast analytical queries, and dashboards or reports sit on top of that layer.

Parallel experiments

Experiments rarely live in isolation. If one test changes the interface and another changes ranking, the platform needs a clear rule for which tests may overlap and which ones must be isolated.

Problem

Experiment A changes search, experiment B changes a button. If the same user lands in both treatments, it becomes much harder to explain what actually moved conversion.

Solution: domains and layers

Group experiments by domain such as UI, backend, or ranking. Inside one layer, tests are usually mutually exclusive. Across layers, they can run independently.

Layering example

Layer: UI → [Button Color Test, Layout Test] (mutually exclusive)

Layer: Search → [Ranking Algorithm Test] (independent)

Layer: Recommendations → [ML Model A/B Test] (independent)

Common pitfalls

The most expensive failures do not happen in the launch UI. They happen in the data. If the platform loses exposure events or drifts into a sample ratio mismatch, the dashboard can look convincing while the conclusion is still wrong.

Sample Ratio Mismatch (SRM)

The system expects 50/50 traffic but sees 52/48 instead. Common causes include bot traffic, redirect issues, missing clients, or broken filters. SRM should be checked before any metric interpretation.

Peeking Problem

Teams look at the graph too early and decide before enough data has accumulated. Fixed sample sizes or properly configured sequential testing reduce that risk.

Network Effects

In social products, users influence each other, so a change shown to one group can indirectly affect another. Cluster-based randomization can be safer than user-level assignment.

Multiple Testing

The more metrics a team inspects at once, the easier it becomes to find a false win by chance. Pick a primary metric and apply corrections when necessary.

Key Takeaways

①

Hash-based assignment gives deterministic bucketing without a separate state store and scales cleanly.

②

Configuration distribution keeps the decision path small by moving rules closer to the product.

③

Layered isolation lets teams run multiple experiments without blending their effects into one noisy result.

④

A separate event pipeline keeps heavy analytics traffic away from the live assignment path.

⑤

Statistical discipline matters more than a polished dashboard: SRM checks, sample sizing, and peeking controls come first.

⑥

OLAP storage and reporting separate deep analytical queries from the product-serving path.

This chapter is based on the public interview «System Design Interview: A/B Testing Platform» and Ron Kohavi’s article “Trustworthy Online Controlled Experiments”

Related chapters

C4 Model: Context, Containers, Components, and Code - helps break the experimentation platform into clean layers, from system context to containers and critical components.
Troubleshooting Interviews - adds practical incident-debugging framing for SRM, metric regressions, and failures in the event pipeline.
Ad Click Event Aggregator - provides a neighboring data case about event intake, aggregation, and large-scale stream processing.
System design case studies examples - places the experimentation platform next to other design cases and makes the repeated architecture patterns easier to compare.
Engineering Reliable Mobile Applications (short summary) - adds the operational angle: staged rollouts, guardrails, observability, and release safety.