System Design SpaceSystem Design Space
Back to table of contents

ML Engineering

14 chapters

This page contains all chapters in this theme. Open chapters in sequence or use this page as a section map.

1

ML Engineering: Designing Models, Pipelines, and the Production Loop

Original Contenteasy

Introductory map of ML Engineering: metrics, data, training, serving, platform ownership, and production operations around models.

Open chapter
2

Machine Learning System Design (short summary)

Book Summaryhard

Practical guide from Babushkin and Kravchenko: problem analysis, metrics, working with data, common mistakes and preparation for ML interviews.

Open chapter
3

Precision and recall at your fingertips

Original Contenteasy

A simple explanation of precision, recall, threshold choice, ROC AUC, and PR AUC built around the story of Vasya and the wolf.

Open chapter
4

ML Lifecycle: From Data and Training to Production and Feedback Loops

Original Contentmedium

A practical map of the ML system lifecycle: data contracts, training, quality checks, model registry, release flow, monitoring, and retraining.

Open chapter
5

Model Release, Calibration, and Experiment Loops

Original Contentmedium

How to release ML models safely: calibration, threshold tuning, shadow mode, canary release, A/B experiments, and rollback.

Open chapter
6

Model Serving and Inference Architecture

Original Contentmedium

How to design the live inference path for ML and LLM systems: online, batch, and stream modes, autoscaling, CPU/GPU routing, degraded behavior, and latency-cost trade-offs.

Open chapter
7

Human-in-the-Loop, Data Quality, and the Operational AI Loop

Original Contentmedium

The operating loop of ML systems: feedback capture, annotation workflows, data quality, error analysis, drift investigation, and retraining triggers.

Open chapter
8

ML Ops Pipeline

Case Studyhard

Case study on the MLOps loop: data, features, training, model registry, rollout, live inference, and drift monitoring as one engineering system.

Open chapter
9

Feature Store & Model Serving

Case Studyhard

Case study on feature stores and model serving: preserving one meaning of features across training and runtime, keeping point-in-time correctness, and controlling training-serving skew.

Open chapter
10

The history of Google TPUs and their evolution

Original Contentmedium

How Google moved from TPU v1 for inference to Ironwood: architectural trade-offs, compute economics, and what distinguishes the TPU approach from GPU-centric designs.

Open chapter
11

The History of NVIDIA AI Accelerators

Original Contentmedium

How NVIDIA moved from programmable GPUs and CUDA to Tensor Cores, DGX, H100, Blackwell, and rack-scale AI infrastructure: architectural inflection points, ecosystem leverage, and compute economics.

Open chapter
12

ML platform in T-Bank: the common good or better not needed

Original Contentmedium

Analysis of an interview about the evolution of the ML platform at T-Bank: how teams moved from manual SSH workflows to platform engineering, shared data flows, and mature model operations.

Open chapter
13

Fraud / Risk Scoring ML System

Case Studyhard

Practical ML case: realtime scoring, review operations, delayed labels, threshold tuning, drift analysis, and the next calibration cycle.

Open chapter
14

Ranking and Recommendation Architecture for ML Systems

Original Contentmedium

How to design a recommendation loop: candidate generation, ranking, policy layers, freshness, feedback, and the next training cycle.

Open chapter