This Theme 3 chapter focuses on AI/ML workflows, feature pipelines, and rollout control. The goal is not only to propose a working design, but also to explain behavior under scale and failure pressure.
Use a stable structure: requirements -> architecture -> critical deep dive -> evolution. This makes the solution clear, defensible, and interview-ready.
Offline/Online Parity
Keep feature semantics consistent across training and serving paths.
Rollout Safety
Canary, shadow, rollback, and drift alerting are baseline architecture requirements.
Data Quality
Use guardrails for freshness, lineage, and training-serving skew prevention.
Platform Efficiency
Balance pipeline cost, feature-store footprint, and inference latency.
Case-Solving Playbook
Define feature contract
Phase 1Align feature schema and semantics across offline and online paths.
Build rollout policy
Phase 2Specify canary/shadow/rollback and model-quality observability.
Cover data quality risks
Phase 3Add freshness, lineage, and drift-skew guardrails end-to-end.
Optimize cost envelope
Phase 4Balance inference SLA with feature/model pipeline operating cost.
Context
Machine Learning System Design
Core ML architecture overview that helps structure Feature Store decisions in interviews.
Design a Feature Store is a classic ML System Design case. Interviewers expect you to show how you keep training and inference in parity, how you enforce freshness SLAs, and how the system degrades when the online serving path fails.
Chapter scope boundaries
Covered in this chapter
- Feature contracts and data plane: registry, offline/online stores, materialization, and retrieval APIs.
- Control of online/offline parity, point-in-time correctness, and training-serving skew.
- Reliability of the inference feature path: latency/freshness SLOs, fallback modes, and operational guardrails.
Not covered here
- Training orchestration, model selection, and experiment lifecycle management.
- Model-registry release governance: approval policy, canary/shadow rollout logic, and rollback decisions.
- End-to-end retraining loop, drift-driven release cadence, and ownership of the full ML delivery lifecycle.
End-to-end lifecycle and release governance are covered in ML Ops Pipeline.
Problem & Context
Functional requirements
- Single feature registry: owner, schema, entity keys, source-of-truth, and transformation version.
- Offline retrieval for train/validation with strict point-in-time correctness and no data leakage.
- Online retrieval for inference with low-latency access and stable API contracts.
- Materialization from batch/stream pipelines into online store with explicit freshness control.
- Backfill and history replay when feature logic changes, without ad-hoc scripts.
Non-functional requirements
- Online read latency: p95 < 30 ms. Otherwise Feature Store becomes a bottleneck in user-facing inference paths.
- Availability: 99.95%+. Online store downtime directly blocks models in critical product workflows.
- Freshness SLA: <= 5 minutes for hot features. Stale features quickly degrade ranking, personalization, and fraud quality.
- Skew detection: 0 critical skew without alert. Training-serving skew must be caught before user-visible degradation.
Scale & Capacity assumptions
Inference traffic
40k-120k RPS
Peak online-store pressure in recommendation and fraud scenarios.
Feature vectors
50-300 features/request
Requires batched entity fetch and efficient serialization contracts.
Entity cardinality
100M+ users/devices/items
High cardinality impacts shard strategy and online index size.
Streaming ingress
1M-3M events/s
Needs backpressure control and idempotent materialization logic.
Daily offline snapshots
2-8 TB/day
Backfill and point-in-time joins require deliberate partition/layout strategy.
Related chapter
ETL/ELT Architecture
Feature Store relies on mature batch/stream pipelines and robust orchestration.
Architecture
Feature Store architecture should clearly separate ingestion, offline training, and online serving paths. This makes training-serving skew control and SLA ownership explicit.
Feature Store Architecture
Highlight a slice: ingestion, offline, online, or observability
Event Sources
Product events, CRM, billing, clicks, logs
Batch ETL/ELT
Daily/hourly pipelines and backfills
Stream Processing
Near real-time transforms with watermarking
Offline Store
Historical snapshots for train/validation
Feature Registry
Schemas, owners, versions, SLA, lineage
Materialization Service
Online-store upserts, dedup, conflict policy
Online Store
Low-latency key-value for inference
Serving SDK / Gateway
Stable feature API contract for models
SLA
Layer responsibilities
Feature Registry
Feature catalog with schema/versioning, owner, SLA, lineage, and production readiness status.
Ingestion & Transform
Batch ETL/ELT plus stream processing. Feature computations are implemented as reusable transformation contracts.
Offline Store
Historical feature storage for training, replay, and reproducible dataset generation.
Online Store
Low-latency key-value reads for inference with TTL, selective invalidation, and hot-key protection.
Materialization Service
Moves computed features into online store and controls watermarking, late events, and exactly-once semantics where feasible.
Serving SDK / Gateway
Unified feature-fetch API that pins request schema and shields clients from internal storage changes.
Quality & Observability
Freshness, availability, skew, null-rate, and latency metrics with alerts and business-critical dashboards.
Feature contract strategy
Lock entity keys, transformation version, TTL, freshness SLA, and owner in the registry. It reduces hidden skew during updates and makes rollback faster when model quality drops.
Deep dives
Point-in-time correctness
Training datasets must include only feature values available at event time. This requires event-time joins and explicit time-travel rules.
Materialization consistency
Batch and stream paths often overlap. You need idempotent upserts, deduplication, and deterministic conflict resolution by version/time.
Training-serving skew control
Compare feature distributions between offline training snapshots and online production traffic; define skew budgets and trigger rollback when exceeded.
Online degradation plan
When Feature Store fails, inference should degrade gracefully: cached features, reduced feature set, or rule-based baseline.
Trade-offs
Strongly normalized feature definitions reduce duplication but slow down local experiment velocity.
Streaming-first improves freshness but significantly increases operational complexity and on-call cost.
One global Feature Store simplifies governance but increases blast radius during materialization failures.
Aggressive TTL lowers stale risk but increases compute pressure and cache churn.
Recommendations
- Start with a limited set of high-impact features and explicit ownership per feature.
- Version transformations as code and enforce schema/skew checks in CI/CD.
- Break SLA into ingest, materialization, and online-read budgets to localize degradation faster.
- Design fallback paths before production rollout, not after the first incident.
Common mistakes
- Feature logic is duplicated between notebooks and production code without shared registry/versioning.
- Training snapshots are built without point-in-time rules, causing hidden data leakage.
- Online store is updated without freshness/skew monitoring, so degradation appears only in business KPIs.
- Model is shipped without fail-open/fail-safe strategy for Feature Store outages.
References
- Feast documentation - Open-source reference for registry, offline/online stores, and materialization jobs.
- Hopsworks Feature Store docs - Approach to feature groups, training datasets, and online serving in one platform.
- Tecton docs - Production patterns for feature engineering, realtime transformations, and serving.
- Google Cloud MLOps architecture guide - System-level view of ML delivery pipelines and operational controls.
Related chapters
- How the System Design task section is structured - Entry map of the case-studies section and the shared framework this case follows.
- Machine Learning System Design (short summary) - End-to-end ML lifecycle view where Feature Store connects data and serving layers.
- AI Engineering (short summary) - Production AI practices for evaluation, deployment, governance, and operations.
- ETL/ELT Architecture - Foundation for batch pipelines, backfills, and orchestration of feature computation.
- Designing Event-Driven Systems (short summary) - Streaming ingestion and delivery semantics for near real-time feature updates.
- Data Governance & Compliance - PII control, lineage, and auditability requirements for sensitive feature pipelines.
- Observability & Monitoring Design - Freshness, skew, and latency metrics as part of reliability engineering.
- ML Platform at T-Bank - A practical platform-engineering case for ML workflows at enterprise scale.
