System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

ML platform in T-Bank: the common good or better not needed

mid

Analysis of an interview about the development of the ML platform at T-Bank: from SSH circuits to platform engineering, data workflows and production operation.

Source

Yellow AI Club Talks

Interview about the philosophy, evolution and practical compromises of building an ML platform at T-Bank.

Watch interview

ML platform in T-Bank is considered as an infrastructure product that should be almost invisible in the daily work of teams, but is critical for scaling ML production. Key idea: Encapsulate operational complexity (resources, resiliency, monitoring, reproducibility) so engineers focus on models and product value.

Who participated in the interview

Leading

Daniil Gavrilov

Head of the Research team (T-Bank).

Guest

Mikhail Chebakov

Head of ML platform development (T-Bank).

Platform evolution

Early stage

SSH clusters and manual management

The teams worked directly on the servers via SSH. This gave control, but did not scale well and made reproducibility of experiments difficult.

First platform step

Simple orchestrator

A layer of task planning and resource allocation appeared, which increased server utilization and reduced the share of manual operations.

Mature stage

ML platform as a product

The focus has shifted to data/workflow primitives, self-service and standardized paths for developing, producing and operating models.

Three Key Domains of an ML Platform

1. Engineering experience

Interactive work of one engineer with a minimal feedback cycle.

Fast experiments, convenient launch of environments, predictable UX.

2. Production conveyors

Automate robust ML processes with a focus on repeatability.

Standardized pipelines, versioning of artifacts, quality control.

3. Deployment and Operation

A reliable runtime circuit where ML solutions bring measurable business benefits.

SLO, monitoring, degradation, cost and capacity management.

ExperimentsPipelinesProduction value

Most important function: data management

A critical element was the ability to create working folders/dataspaces accessible from anywhere in the cluster, with automatic backups.

This reduces the risk of losing experimental artifacts, simplifies the processing of non-standard data, and helps move work between computational loops.

Why teams resist migrating from SSH

Feeling of complete control

The SSH approach is clear and transparent: the engineer sees the environment directly and quickly adapts open-source tools.

The hidden cost of this approach

At scale, this leads to problems with reproducibility, data loss, and the complexity of running multiple manual scripts.

Platform Design Principles

Making the right path simple

The platform should guide the user to good default practices: reproducibility, logging, backup and secure deployments.

Making the wrong way difficult

If a scenario leads to risks (data loss, unrepeatable startup, manual operation), the platform should complicate this path or block it.

UX is as important as architecture

A technically flexible solution does not equal user-friendly: functions must be discoverable and understandable without reading long documentation.

How to measure the effectiveness of an ML platform

  • Basic adoption metrics: number of users, teams, retention.
  • Periodic surveys and measurements of satisfaction in various ML areas.
  • Dogfooding: using the platform by the platform team itself.
  • Co-development with product teams instead of platform isolation.

Variety of ML directions

The platform simultaneously supports areas with different requirements for data, hardware, latency and reproducibility. A universal abstraction without domain-awareness does not work here.

R&D
RecSys
CV
Generating Images
LLM
Applied NLP
Antifraud
Risk scoring
Speech recognition
Speech synthesis

Practical checklist

  • Separate the interactive DevEx circuit from the production pipeline, but connect them with a single artifact contract.
  • Design for cross-cluster portability and backup of production data right away.
  • Fix the golden path for standard tasks (training, inference, monitoring), and design non-standard scenarios as extensions.
  • Test the UX of new features on real teams before mass rollout to reduce resistance to migration from the SSH approach.
  • Evaluate the platform not only by uptime, but also by the speed of ML delivery and reproducibility of results.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov