ML platform in T-Bank: the common good or better not needed

A platform view of ML matters most when the challenge is not one model, but how dozens of teams build and run models at scale.

The chapter shows how data workflows, engineering experience, and standardized release practices come together into one operating model.

For interviews, it gives strong material on platform ownership, organizational design, and maturity in ML operations.

Practical value of this chapter

Platform value

See an ML platform as a product for engineers, not just a bag of tools.

Organizational design

Understand how team structure and platform practices affect model delivery speed.

Operational maturity

See how standardization reduces chaos around releasing and running ML systems.

Interview material

Get a real platform story instead of abstract theory.

Source

Yellow AI Club Talks

An interview about how T-Bank moved from an SSH-driven workflow to a mature platform product — and what it cost.

Watch interview

The ML platform at T-Bank is framed here as an infrastructure product that helps teams move from manual SSH-based work toward a platform model. The platform takes on compute, backups, observability, and repeatable workflows — the work that otherwise eats an engineer's time instead of going into models and product value.

The goal is not to hide complexity behind magic, but to provide understandable self-service, strong developer experience, and reproducibility for many different ML teams working at the same time.

Who participated in the interview

Host

Daniil Gavrilov

Head of a research team at T-Bank.

Guest

Mikhail Chebakov

Head of ML platform development at T-Bank.

Platform evolution

Early stage

SSH clusters and manual management

Teams worked directly on servers over SSH. Control was complete, but reproducing someone else's experiment or onboarding a new team without manual fiddling was no longer realistic.

First platform step

Simple orchestrator

A task-planning and resource-allocation layer removed some of the manual work and improved server utilization. But it was still a tool, not a product: there was no standard path for releasing a model.

Mature stage

ML platform as a product

Now the team has primitives for data and workflows, self-service, and standard paths for building, releasing, and operating models. The platform is owned like a product, not like a pile of scripts.

Three Key Domains of an ML Platform

1. Engineering experience

Interactive work for an engineer, where the change-then-check loop is measured in minutes, not hours.

Fast experiments, easy environment setup, predictable tooling behavior.

2. Production pipelines

Once an experiment becomes a standing process, it has to be automated so it repeats without its author in the loop.

Standard pipelines, versioned artifacts, quality checks.

3. Deployment and operations

This is where the model meets real traffic: product impact is measurable only in the live environment, and this is also where failure costs the most.

Service objectives, monitoring, degradation modes, cost and capacity management.

ExperimentsPipelinesProduct value

Most important function: data management

The critical element turned out to be data, not compute: the ability to create working folders and dataspaces accessible from anywhere in the cluster, with automatic backups.

Without it, a single node failure threatens to wipe out experimental artifacts. With a shared dataspace, artifacts survive restarts, non-standard data is easier to process, and work can move between computational loops.

Why teams resist migrating from SSH

A sense of complete control

The SSH approach feels clear and transparent: the engineer sees the environment directly and adapts familiar tools in minutes. A platform first takes that speed away in exchange for a promise of order.

The hidden cost of this approach

That speed is local. At scale, unreproducible runs, data loss, and dozens of manual scripts surface — scripts nobody can maintain once their author goes on vacation.

Platform Design Principles

Making the right path simple

Reproducibility, logging, backups, and safe releases should reach the user by default — without a separate decision or extra steps. If a good practice takes effort to follow, it gets skipped.

Making the wrong path difficult

Risky scenarios — data loss, unreproducible runs, manual operations — should be made harder or blocked outright. The cheaper a dangerous step is, the more often people take it under deadline pressure.

UX is as important as architecture

Technical flexibility alone does not make a system usable. If a capability cannot be found and understood without long manuals, at scale people won't use it — they'll fall back to the SSH they already know.

How to measure the effectiveness of an ML platform

Speeding up experiments is only half the job. The real payoff comes later: when the platform supports predictable rollouts, a short feedback loop, and understandable latency in live scenarios.

Basic product metrics: how many engineers and teams use the platform and keep coming back to it. A one-time visit with no return is not adoption.
Satisfaction measured across different ML domains: fraud detection and image generation usually hurt in different places.
The platform team works on its own platform, not just supports outside users — otherwise UX blind spots go unnoticed.
The platform evolves together with product teams, not in isolation: a detached primitive goes stale before anyone starts using it.

Variety of ML domains

The platform has to support domains with very different requirements for data, compute, latency, and reproducibility. One universal abstraction for every domain does not work here.

Research workloads

Recommendation systems

Computer vision

Image generation

LLM systems

Applied NLP

Fraud detection

Risk scoring

Speech recognition

Speech synthesis

Practical checklist

Separate the engineer's interactive workflow from production pipelines, but connect them with one shared artifact contract — otherwise experiment and production drift apart.
Design cross-cluster portability and backups of working data from the start: cheaper than rescuing lost artifacts after the fact.
Define a default path for training, inference, and monitoring, then treat non-standard cases as extensions so exceptions don't erode the standard.
Test the usability of new features with real teams before broad rollout: removing resistance to leaving SSH up front is cheaper than arguing after a refusal.
Judge the platform by delivery speed and reproducibility alongside reliability, not by reliability alone — a reliable but slow platform loses to SSH.

References

Related chapters

Brief overview of the T-Bank data platform - Data is the foundation of an ML platform: this chapter shows how its flows and management work at bank scale.
Evolution of T-Bank Architecture - How the bank moved from boxed solutions toward its own platform practices.
ML System Design (short summary) - How to design an ML system end to end, from signals and metrics to production release.
AI Engineering (short summary) - How to build AI applications, integrations, and live operational workflows.
Hands-On Large Language Models (short summary) - A practical foundation for LLM systems, data, and operational patterns.
ML Lifecycle: From Data and Training to Production and Feedback Loops - A core chapter on the ML lifecycle that the platform needs to support as one integrated product.
Human-in-the-Loop, Data Quality, and the Operational AI Loop - Shows how manual review and the feedback loop become part of the platform's day-to-day operating model.
Fraud / Risk Scoring ML System - An applied ML case where latency, feature data, and manual review requirements become especially visible.