A platform view of ML matters most when the challenge is not one model, but how dozens of teams build and run models at scale.
The chapter shows how data workflows, engineering experience, and standardized release practices come together into one operating model.
For interviews, it gives strong material on platform ownership, organizational design, and maturity in ML operations.
Practical value of this chapter
Platform value
See an ML platform as a product for engineers, not just a bag of tools.
Organizational design
Understand how team structure and platform practices affect model delivery speed.
Operational maturity
See how standardization reduces chaos around releasing and running ML systems.
Interview material
Get a real platform story instead of abstract theory.
Source
Yellow AI Club Talks
An interview about how T-Bank evolved its ML platform from an SSH-driven workflow into a mature platform product.
The ML platform at T-Bank is framed here as an infrastructure product that helps teams move from manual SSH-based work toward a platform model. The platform takes on compute, backups, observability, and repeatable workflows so engineers can focus on models and product value instead of constant manual operations.
The goal is not to hide complexity behind magic, but to provide understandable self-service, strong developer experience, and reproducibility for many different ML teams working at the same time.
Who participated in the interview
Host
Daniil Gavrilov
Head of a research team at T-Bank.
Guest
Mikhail Chebakov
Head of ML platform development at T-Bank.
Platform evolution
Early stage
SSH clusters and manual management
Teams worked directly on servers over SSH. That felt transparent and controllable, but it did not scale well and made experiments harder to reproduce.
First platform step
Simple orchestrator
A task-planning and resource-allocation layer appeared. It improved server utilization and reduced the amount of manual work.
Mature stage
ML platform as a product
The focus shifted to platform primitives for data and workflows, self-service, and standard paths for building, releasing, and operating models.
Three Key Domains of an ML Platform
1. Engineering experience
Interactive work for an engineer with a short feedback cycle.
Fast experiments, easy environment setup, predictable tooling behavior.
2. Production pipelines
Automation of robust ML processes with an emphasis on repeatability and safe delivery.
Standard pipelines, versioned artifacts, quality checks.
3. Deployment and operations
A reliable live environment where ML systems create measurable product and business value.
Service objectives, monitoring, degradation modes, cost and capacity management.
Most important function: data management
A critical element was the ability to create working folders/dataspaces accessible from anywhere in the cluster, with automatic backups.
This reduces the risk of losing experimental artifacts, simplifies the processing of non-standard data, and helps move work between computational loops.
Why teams resist migrating from SSH
A sense of complete control
The SSH approach feels clear and transparent: the engineer sees the environment directly and can quickly adapt familiar tools.
The hidden cost of this approach
At scale, this leads to problems with reproducibility, data loss, and the complexity of running multiple manual scripts.
Platform Design Principles
Making the right path simple
The platform should steer users toward good defaults: reproducibility, logging, backups, and safe releases.
Making the wrong path difficult
If a scenario creates risk, such as data loss, unreproducible runs, or manual operations, the platform should make that path harder or block it.
UX is as important as architecture
Technical flexibility alone does not make a system usable: capabilities should be easy to find and understand without reading long manuals.
How to measure the effectiveness of an ML platform
A platform matters not only when it speeds up experiments. It also has to support predictable rollouts, a short feedback loop, and understandable latency in live scenarios.
- Basic product metrics: how many engineers and teams use the platform and keep coming back to it.
- Regular satisfaction surveys across different ML domains.
- Whether the platform team actively uses the platform itself, not only other teams.
- Joint development with product teams instead of building the platform in isolation.
Variety of ML domains
The platform has to support domains with very different requirements for data, compute, latency, and reproducibility. One universal abstraction for every domain does not work here.
Practical checklist
- Separate the engineer's interactive workflow from production pipelines, but connect them with one shared artifact contract.
- Design cross-cluster portability and backup of working data from the start.
- Define a default path for training, inference, and monitoring, then treat non-standard cases as extensions.
- Test the usability of new features with real teams before broad rollout to reduce resistance to leaving the SSH model behind.
- Evaluate the platform not only by reliability, but also by delivery speed and reproducibility.
References
Related chapters
- Brief overview of the T-Bank data platform - How data flows and data management work at bank scale.
- Evolution of T-Bank Architecture - How the bank moved from boxed solutions toward its own platform practices.
- ML System Design (short summary) - How to design an ML system end to end, from signals and metrics to production release.
- AI Engineering (short summary) - How to build AI applications, integrations, and live operational workflows.
- Hands-On Large Language Models (short summary) - A practical foundation for LLM systems, data, and operational patterns.
- ML Lifecycle: From Data and Training to Production and Feedback Loops - A core chapter on the ML lifecycle that the platform needs to support as one integrated product.
- Human-in-the-Loop, Data Quality, and the Operational AI Loop - Shows how manual review and the feedback loop become part of the platform's day-to-day operating model.
- Fraud / Risk Scoring ML System - An applied ML case where latency, feature data, and manual review requirements become especially visible.
