ML Engineering
14 chaptersThis page contains all chapters in this theme. Open chapters in sequence or use this page as a section map.
ML Engineering: Designing Models, Pipelines, and the Production Loop
Original ContenteasyIntroductory map of ML Engineering: metrics, data, training, serving, platform ownership, and production operations around models.
Open chapterMachine Learning System Design (short summary)
Book SummaryhardPractical guide from Babushkin and Kravchenko: problem analysis, metrics, working with data, common mistakes and preparation for ML interviews.
Open chapterPrecision and recall at your fingertips
Original ContenteasyA simple explanation of precision, recall, threshold choice, ROC AUC, and PR AUC built around the story of Vasya and the wolf.
Open chapterML Lifecycle: From Data and Training to Production and Feedback Loops
Original ContentmediumA practical map of the ML system lifecycle: data contracts, training, quality checks, model registry, release flow, monitoring, and retraining.
Open chapterModel Release, Calibration, and Experiment Loops
Original ContentmediumHow to release ML models safely: calibration, threshold tuning, shadow mode, canary release, A/B experiments, and rollback.
Open chapterModel Serving and Inference Architecture
Original ContentmediumHow to design the live inference path for ML and LLM systems: online, batch, and stream modes, autoscaling, CPU/GPU routing, degraded behavior, and latency-cost trade-offs.
Open chapterHuman-in-the-Loop, Data Quality, and the Operational AI Loop
Original ContentmediumThe operating loop of ML systems: feedback capture, annotation workflows, data quality, error analysis, drift investigation, and retraining triggers.
Open chapterML Ops Pipeline
Case StudyhardCase study on the MLOps loop: data, features, training, model registry, rollout, live inference, and drift monitoring as one engineering system.
Open chapterFeature Store & Model Serving
Case StudyhardCase study on feature stores and model serving: preserving one meaning of features across training and runtime, keeping point-in-time correctness, and controlling training-serving skew.
Open chapterThe history of Google TPUs and their evolution
Original ContentmediumHow Google moved from TPU v1 for inference to Ironwood: architectural trade-offs, compute economics, and what distinguishes the TPU approach from GPU-centric designs.
Open chapterThe History of NVIDIA AI Accelerators
Original ContentmediumHow NVIDIA moved from programmable GPUs and CUDA to Tensor Cores, DGX, H100, Blackwell, and rack-scale AI infrastructure: architectural inflection points, ecosystem leverage, and compute economics.
Open chapterML platform in T-Bank: the common good or better not needed
Original ContentmediumAnalysis of an interview about the evolution of the ML platform at T-Bank: how teams moved from manual SSH workflows to platform engineering, shared data flows, and mature model operations.
Open chapterFraud / Risk Scoring ML System
Case StudyhardPractical ML case: realtime scoring, review operations, delayed labels, threshold tuning, drift analysis, and the next calibration cycle.
Open chapterRanking and Recommendation Architecture for ML Systems
Original ContentmediumHow to design a recommendation loop: candidate generation, ranking, policy layers, freshness, feedback, and the next training cycle.
Open chapter