System Design Space
Knowledge graphSettings

Updated: April 5, 2026 at 6:39 PM

ML platform in T-Bank: the common good or better not needed

medium

Analysis of an interview about the evolution of the ML platform at T-Bank: how teams moved from manual SSH workflows to platform engineering, shared data flows, and mature model operations.

A platform view of ML matters most when the challenge is not one model, but how dozens of teams build and run models at scale.

The chapter shows how data workflows, engineering experience, and standardized release practices come together into one operating model.

For interviews, it gives strong material on platform ownership, organizational design, and maturity in ML operations.

Practical value of this chapter

Platform value

See an ML platform as a product for engineers, not just a bag of tools.

Organizational design

Understand how team structure and platform practices affect model delivery speed.

Operational maturity

See how standardization reduces chaos around releasing and running ML systems.

Interview material

Get a real platform story instead of abstract theory.

Source

Yellow AI Club Talks

An interview about how T-Bank evolved its ML platform from an SSH-driven workflow into a mature platform product.

Watch interview

The ML platform at T-Bank is framed here as an infrastructure product that helps teams move from manual SSH-based work toward a platform model. The platform takes on compute, backups, observability, and repeatable workflows so engineers can focus on models and product value instead of constant manual operations.

The goal is not to hide complexity behind magic, but to provide understandable self-service, strong developer experience, and reproducibility for many different ML teams working at the same time.

Who participated in the interview

Host

Daniil Gavrilov

Head of a research team at T-Bank.

Guest

Mikhail Chebakov

Head of ML platform development at T-Bank.

Platform evolution

Early stage

SSH clusters and manual management

Teams worked directly on servers over SSH. That felt transparent and controllable, but it did not scale well and made experiments harder to reproduce.

First platform step

Simple orchestrator

A task-planning and resource-allocation layer appeared. It improved server utilization and reduced the amount of manual work.

Mature stage

ML platform as a product

The focus shifted to platform primitives for data and workflows, self-service, and standard paths for building, releasing, and operating models.

Three Key Domains of an ML Platform

1. Engineering experience

Interactive work for an engineer with a short feedback cycle.

Fast experiments, easy environment setup, predictable tooling behavior.

2. Production pipelines

Automation of robust ML processes with an emphasis on repeatability and safe delivery.

Standard pipelines, versioned artifacts, quality checks.

3. Deployment and operations

A reliable live environment where ML systems create measurable product and business value.

Service objectives, monitoring, degradation modes, cost and capacity management.

ExperimentsPipelinesProduct value

Most important function: data management

A critical element was the ability to create working folders/dataspaces accessible from anywhere in the cluster, with automatic backups.

This reduces the risk of losing experimental artifacts, simplifies the processing of non-standard data, and helps move work between computational loops.

Why teams resist migrating from SSH

A sense of complete control

The SSH approach feels clear and transparent: the engineer sees the environment directly and can quickly adapt familiar tools.

The hidden cost of this approach

At scale, this leads to problems with reproducibility, data loss, and the complexity of running multiple manual scripts.

Platform Design Principles

Making the right path simple

The platform should steer users toward good defaults: reproducibility, logging, backups, and safe releases.

Making the wrong path difficult

If a scenario creates risk, such as data loss, unreproducible runs, or manual operations, the platform should make that path harder or block it.

UX is as important as architecture

Technical flexibility alone does not make a system usable: capabilities should be easy to find and understand without reading long manuals.

How to measure the effectiveness of an ML platform

A platform matters not only when it speeds up experiments. It also has to support predictable rollouts, a short feedback loop, and understandable latency in live scenarios.

  • Basic product metrics: how many engineers and teams use the platform and keep coming back to it.
  • Regular satisfaction surveys across different ML domains.
  • Whether the platform team actively uses the platform itself, not only other teams.
  • Joint development with product teams instead of building the platform in isolation.

Variety of ML domains

The platform has to support domains with very different requirements for data, compute, latency, and reproducibility. One universal abstraction for every domain does not work here.

Research workloads
Recommendation systems
Computer vision
Image generation
LLM systems
Applied NLP
Fraud detection
Risk scoring
Speech recognition
Speech synthesis

Practical checklist

  • Separate the engineer's interactive workflow from production pipelines, but connect them with one shared artifact contract.
  • Design cross-cluster portability and backup of working data from the start.
  • Define a default path for training, inference, and monitoring, then treat non-standard cases as extensions.
  • Test the usability of new features with real teams before broad rollout to reduce resistance to leaving the SSH model behind.
  • Evaluate the platform not only by reliability, but also by delivery speed and reproducibility.

References

Related chapters

Enable tracking in Settings