Source
Yellow AI Club Talks
Interview about the philosophy, evolution and practical compromises of building an ML platform at T-Bank.
ML platform in T-Bank is considered as an infrastructure product that should be almost invisible in the daily work of teams, but is critical for scaling ML production. Key idea: Encapsulate operational complexity (resources, resiliency, monitoring, reproducibility) so engineers focus on models and product value.
Who participated in the interview
Leading
Daniil Gavrilov
Head of the Research team (T-Bank).
Guest
Mikhail Chebakov
Head of ML platform development (T-Bank).
Platform evolution
Early stage
SSH clusters and manual management
The teams worked directly on the servers via SSH. This gave control, but did not scale well and made reproducibility of experiments difficult.
First platform step
Simple orchestrator
A layer of task planning and resource allocation appeared, which increased server utilization and reduced the share of manual operations.
Mature stage
ML platform as a product
The focus has shifted to data/workflow primitives, self-service and standardized paths for developing, producing and operating models.
Three Key Domains of an ML Platform
1. Engineering experience
Interactive work of one engineer with a minimal feedback cycle.
Fast experiments, convenient launch of environments, predictable UX.
2. Production conveyors
Automate robust ML processes with a focus on repeatability.
Standardized pipelines, versioning of artifacts, quality control.
3. Deployment and Operation
A reliable runtime circuit where ML solutions bring measurable business benefits.
SLO, monitoring, degradation, cost and capacity management.
Most important function: data management
A critical element was the ability to create working folders/dataspaces accessible from anywhere in the cluster, with automatic backups.
This reduces the risk of losing experimental artifacts, simplifies the processing of non-standard data, and helps move work between computational loops.
Why teams resist migrating from SSH
Feeling of complete control
The SSH approach is clear and transparent: the engineer sees the environment directly and quickly adapts open-source tools.
The hidden cost of this approach
At scale, this leads to problems with reproducibility, data loss, and the complexity of running multiple manual scripts.
Platform Design Principles
Making the right path simple
The platform should guide the user to good default practices: reproducibility, logging, backup and secure deployments.
Making the wrong way difficult
If a scenario leads to risks (data loss, unrepeatable startup, manual operation), the platform should complicate this path or block it.
UX is as important as architecture
A technically flexible solution does not equal user-friendly: functions must be discoverable and understandable without reading long documentation.
How to measure the effectiveness of an ML platform
- Basic adoption metrics: number of users, teams, retention.
- Periodic surveys and measurements of satisfaction in various ML areas.
- Dogfooding: using the platform by the platform team itself.
- Co-development with product teams instead of platform isolation.
Variety of ML directions
The platform simultaneously supports areas with different requirements for data, hardware, latency and reproducibility. A universal abstraction without domain-awareness does not work here.
Practical checklist
- Separate the interactive DevEx circuit from the production pipeline, but connect them with a single artifact contract.
- Design for cross-cluster portability and backup of production data right away.
- Fix the golden path for standard tasks (training, inference, monitoring), and design non-standard scenarios as extensions.
- Test the UX of new features on real teams before mass rollout to reduce resistance to migration from the SSH approach.
- Evaluate the platform not only by uptime, but also by the speed of ML delivery and reproducibility of results.
