This Theme 3 chapter focuses on AI/ML workflows, feature pipelines, and rollout control. The goal is not only to propose a working design, but also to explain behavior under scale and failure pressure.
Use a stable structure: requirements -> architecture -> critical deep dive -> evolution. This makes the solution clear, defensible, and interview-ready.
Offline/Online Parity
Keep feature semantics consistent across training and serving paths.
Rollout Safety
Canary, shadow, rollback, and drift alerting are baseline architecture requirements.
Data Quality
Use guardrails for freshness, lineage, and training-serving skew prevention.
Platform Efficiency
Balance pipeline cost, feature-store footprint, and inference latency.
Case-Solving Playbook
Define feature contract
Phase 1Align feature schema and semantics across offline and online paths.
Build rollout policy
Phase 2Specify canary/shadow/rollback and model-quality observability.
Cover data quality risks
Phase 3Add freshness, lineage, and drift-skew guardrails end-to-end.
Optimize cost envelope
Phase 4Balance inference SLA with feature/model pipeline operating cost.
Related chapter
Machine Learning System Design
Framework for ML case studies: requirements, metrics, data, and operational risks.
ML Ops Pipeline is a system-design case about moving models from experimentation into stable production operations. Interviewers expect you to design the end-to-end lifecycle: data, training, release, serving, monitoring, and safe degradation under failures.
Chapter scope boundaries
Covered in this chapter
- End-to-end lifecycle: ingest -> training/eval -> registry/release -> serving -> monitoring -> retraining.
- Release governance: quality gates, rollout policy (canary/shadow/A-B), and rollback readiness.
- Operating model: SLOs, ownership, runbooks, and response to drift/quality incidents.
Not covered here
- Low-level feature-registry schema design and API-level schema evolution mechanics.
- Detailed online/offline retrieval contracts, key design, TTL strategy, and hot-key mitigation in online store.
- Deep internals of feature materialization jobs and batch/stream conflict resolution.
Detailed runtime design of Feature Store and serving contracts is covered in Feature Store & Model Serving.
Functional requirements
- Build one end-to-end pipeline from raw events/data to production inference.
- Support reproducible training with versioned datasets, feature definitions, and model artifacts.
- Enable controlled model rollout (canary/shadow/A-B) with safe rollback.
- Implement a feedback loop with online metrics, drift signals, and retraining triggers.
Non-functional requirements
- p95 online inference latency below 150 ms for user-facing paths.
- Feature freshness SLA of 1-5 minutes for critical behavioral signals.
- 99.95% inference availability with graceful degradation via fallback baseline.
- Full auditability: lineage for data, features, models, and rollout decisions.
Scale and assumptions
| Parameter | Assumption | Why it matters |
|---|---|---|
| DAU | 8M | Large product with continuous user events and realtime personalization. |
| Peak inference QPS | 120k | Traffic is spread across multiple user surfaces: feed, search, and recommendations. |
| Feature updates | 1.5B/day | Event streams require near real-time materialization into online feature stores. |
| Model retraining cadence | daily + emergency retrains | Models must adapt to seasonality, campaigns, and distribution shifts. |
| Peak artifact size | 2-8 GB/model | Needs robust storage, delivery, and rollback policies for model artifacts. |
High-Level Architecture
Stage 1: Data & Feature Pipelines
Batch + streaming ingestion, quality checks, point-in-time joins, and feature publication to offline/online stores.
Stage 2: Training & Validation
Train/eval orchestration, experiment tracking, reproducible datasets, and model-quality guardrails.
Stage 3: Registry & Release Management
Model registry with stage transitions (staging -> canary -> prod), approval policy, and rollback-ready packages.
Stage 4: Serving & Monitoring
Online inference API, fallback policy, latency/error/freshness SLOs, and drift monitoring with auto-alerting.
Typical flow: events and source data enter ingestion, features are published to offline/online stores, orchestrators run train/eval jobs, registry controls versions and rollout, and serving closes the feedback loop with online metrics and drift signals.
Deep Dives and trade-offs
Freshness vs reproducibility
Faster feature/model refresh improves adaptation to new signals, but increases reproducibility risk and makes regression analysis harder.
Batch simplicity vs streaming responsiveness
Batch pipelines are cheaper and easier to operate but lose on freshness. Streaming lowers lag at the cost of significantly higher operational complexity.
Single model vs multi-model routing
A single general model is easier to manage but often lower quality. Segment routing improves quality but increases versioning and rollout complexity.
Strict guardrails vs release speed
Hard quality gates reduce incident risk but slow down delivery. A practical balance is achieved with risk-tier policies and automated checks.
Common anti-patterns
Feature logic is duplicated between training notebooks and production code without shared registry/versioning.
Model rollout happens without canary/shadow checks and without fallback, making incidents immediately user-visible.
No point-in-time controls: training leaks future signals and production quality drops sharply.
Drift monitoring is applied only to model output, without monitoring input feature distributions.
Recommendations
Define pipeline contracts explicitly: schema, ownership, SLO, rollback procedure, and runbook for each stage.
Maintain a single lineage graph: source data -> features -> model version -> release decision.
Prepare at least two degradation modes: fallback model and rule-based baseline.
Enforce budget-aware inference with latency/cost constraints and critical-surface prioritization.
Interview prompts to cover
- How does your design prevent training-serving skew and data leakage?
- Which quality gates can block rollout, and which canary signals are acceptable?
- How does the architecture evolve under 10x inference QPS growth?
- Which end-to-end SLOs do you monitor: data lag, feature freshness, model quality, latency, fallback rate?
Related chapters
- Feature Store & Model Serving - Deep dive on offline/online parity, serving contracts, and operational guardrails.
- Recommendation System - Applied ML case with candidate generation, ranking, and production quality constraints.
- Data Pipeline / ETL / ELT Architecture - Foundation for ingestion, backfills, orchestration, and data quality controls.
- Observability & Monitoring Design - SLO monitoring, alerting, and incident response patterns for production systems.
- ML Platform at T-Bank - Real platform-engineering case about ML workflow evolution in a large company.
- Precision and recall basics - Metrics foundation for rollout decisions and post-release quality interpretation.
