System Design Space
Knowledge graphSettings

Updated: March 13, 2026 at 3:30 PM

Feature Store & Model Serving

hard

Classic case on designing the Feature Store + Model Serving pair for an ML platform with online/offline parity, point-in-time correctness, and training-serving skew controls.

This Theme 3 chapter focuses on AI/ML workflows, feature pipelines, and rollout control. The goal is not only to propose a working design, but also to explain behavior under scale and failure pressure.

Use a stable structure: requirements -> architecture -> critical deep dive -> evolution. This makes the solution clear, defensible, and interview-ready.

Offline/Online Parity

Keep feature semantics consistent across training and serving paths.

Rollout Safety

Canary, shadow, rollback, and drift alerting are baseline architecture requirements.

Data Quality

Use guardrails for freshness, lineage, and training-serving skew prevention.

Platform Efficiency

Balance pipeline cost, feature-store footprint, and inference latency.

Case-Solving Playbook

1

Define feature contract

Phase 1

Align feature schema and semantics across offline and online paths.

2

Build rollout policy

Phase 2

Specify canary/shadow/rollback and model-quality observability.

3

Cover data quality risks

Phase 3

Add freshness, lineage, and drift-skew guardrails end-to-end.

4

Optimize cost envelope

Phase 4

Balance inference SLA with feature/model pipeline operating cost.

Context

Machine Learning System Design

Core ML architecture overview that helps structure Feature Store decisions in interviews.

Open chapter

Design a Feature Store is a classic ML System Design case. Interviewers expect you to show how you keep training and inference in parity, how you enforce freshness SLAs, and how the system degrades when the online serving path fails.

Chapter scope boundaries

Covered in this chapter

  • Feature contracts and data plane: registry, offline/online stores, materialization, and retrieval APIs.
  • Control of online/offline parity, point-in-time correctness, and training-serving skew.
  • Reliability of the inference feature path: latency/freshness SLOs, fallback modes, and operational guardrails.

Not covered here

  • Training orchestration, model selection, and experiment lifecycle management.
  • Model-registry release governance: approval policy, canary/shadow rollout logic, and rollback decisions.
  • End-to-end retraining loop, drift-driven release cadence, and ownership of the full ML delivery lifecycle.

End-to-end lifecycle and release governance are covered in ML Ops Pipeline.

Problem & Context

The product runs multiple ML use cases (personalization, fraud, risk scoring), but teams compute features in separate pipelines and end up with inconsistent data between training and production. The goal is to design Feature Store as a platform layer with one contract and clear SLAs.

Functional requirements

  • Single feature registry: owner, schema, entity keys, source-of-truth, and transformation version.
  • Offline retrieval for train/validation with strict point-in-time correctness and no data leakage.
  • Online retrieval for inference with low-latency access and stable API contracts.
  • Materialization from batch/stream pipelines into online store with explicit freshness control.
  • Backfill and history replay when feature logic changes, without ad-hoc scripts.

Non-functional requirements

  • Online read latency: p95 < 30 ms. Otherwise Feature Store becomes a bottleneck in user-facing inference paths.
  • Availability: 99.95%+. Online store downtime directly blocks models in critical product workflows.
  • Freshness SLA: <= 5 minutes for hot features. Stale features quickly degrade ranking, personalization, and fraud quality.
  • Skew detection: 0 critical skew without alert. Training-serving skew must be caught before user-visible degradation.

Scale & Capacity assumptions

Inference traffic

40k-120k RPS

Peak online-store pressure in recommendation and fraud scenarios.

Feature vectors

50-300 features/request

Requires batched entity fetch and efficient serialization contracts.

Entity cardinality

100M+ users/devices/items

High cardinality impacts shard strategy and online index size.

Streaming ingress

1M-3M events/s

Needs backpressure control and idempotent materialization logic.

Daily offline snapshots

2-8 TB/day

Backfill and point-in-time joins require deliberate partition/layout strategy.

Related chapter

ETL/ELT Architecture

Feature Store relies on mature batch/stream pipelines and robust orchestration.

Open chapter

Architecture

Feature Store architecture should clearly separate ingestion, offline training, and online serving paths. This makes training-serving skew control and SLA ownership explicit.

Feature Store Architecture

Highlight a slice: ingestion, offline, online, or observability

Event Sources

Product events, CRM, billing, clicks, logs

Batch ETL/ELT

Daily/hourly pipelines and backfills

Stream Processing

Near real-time transforms with watermarking

Offline Store

Historical snapshots for train/validation

Feature Registry

Schemas, owners, versions, SLA, lineage

Materialization Service

Online-store upserts, dedup, conflict policy

Online Store

Low-latency key-value for inference

Serving SDK / Gateway

Stable feature API contract for models

Skew checks
Freshness SLA
Null-rate alerts

SLA

Latency budget: p95 < 30msFreshness budget: <= 5m (hot features)Replay window: 30-90 days

Layer responsibilities

Feature Registry

Feature catalog with schema/versioning, owner, SLA, lineage, and production readiness status.

Ingestion & Transform

Batch ETL/ELT plus stream processing. Feature computations are implemented as reusable transformation contracts.

Offline Store

Historical feature storage for training, replay, and reproducible dataset generation.

Online Store

Low-latency key-value reads for inference with TTL, selective invalidation, and hot-key protection.

Materialization Service

Moves computed features into online store and controls watermarking, late events, and exactly-once semantics where feasible.

Serving SDK / Gateway

Unified feature-fetch API that pins request schema and shields clients from internal storage changes.

Quality & Observability

Freshness, availability, skew, null-rate, and latency metrics with alerts and business-critical dashboards.

Feature contract strategy

Lock entity keys, transformation version, TTL, freshness SLA, and owner in the registry. It reduces hidden skew during updates and makes rollback faster when model quality drops.

Deep dives

Point-in-time correctness

Training datasets must include only feature values available at event time. This requires event-time joins and explicit time-travel rules.

Materialization consistency

Batch and stream paths often overlap. You need idempotent upserts, deduplication, and deterministic conflict resolution by version/time.

Training-serving skew control

Compare feature distributions between offline training snapshots and online production traffic; define skew budgets and trigger rollback when exceeded.

Online degradation plan

When Feature Store fails, inference should degrade gracefully: cached features, reduced feature set, or rule-based baseline.

Trade-offs

Strongly normalized feature definitions reduce duplication but slow down local experiment velocity.

Streaming-first improves freshness but significantly increases operational complexity and on-call cost.

One global Feature Store simplifies governance but increases blast radius during materialization failures.

Aggressive TTL lowers stale risk but increases compute pressure and cache churn.

Recommendations

  • Start with a limited set of high-impact features and explicit ownership per feature.
  • Version transformations as code and enforce schema/skew checks in CI/CD.
  • Break SLA into ingest, materialization, and online-read budgets to localize degradation faster.
  • Design fallback paths before production rollout, not after the first incident.

Common mistakes

  • Feature logic is duplicated between notebooks and production code without shared registry/versioning.
  • Training snapshots are built without point-in-time rules, causing hidden data leakage.
  • Online store is updated without freshness/skew monitoring, so degradation appears only in business KPIs.
  • Model is shipped without fail-open/fail-safe strategy for Feature Store outages.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov