System Design Space
Knowledge graphSettings

Updated: February 21, 2026 at 11:59 PM

Evolution of SRE: implementation of an AI assistant in T-Bank

hard

Analysis of Ivan Yurchenko’s report on platformization of incident management, SRE AI assistant, LogAnalyzer and response quality metrics.

Source

Report by Ivan Yurchenko

Platform Engineering Night: evolution of incident management and implementation of AI assistant in SRE processes.

Watch the performance

Evolution of SRE at T-Bank This report shows through the transition from platformization and automation to robotization with the help of AI. Speech by Ivan Yurchenko (FineDog Growth) published April 28, 2025 and is focused on introducing assistants into the full lifecycle of incidents: from detection to post-analysis.

Context of the speech

Speaker

Ivan Yurchenko

Head of FineDog Growth at T-Bank.

Conference

Platform Engineering Night

Report on the implementation of AI assistants in the practice of SRE teams.

Performance date

April 28, 2025

Date of publication of the speech on YouTube.

The evolution of incident management tools

1. Platformization

Combining disparate tools into a single incident management framework with a common model of context and responsibility.

2. Automation

Speeding up routine actions: collecting facts, routing, preparing artifacts for diagnostics and post-analysis.

3. Robotization

AI assistants move from prompts to decision support and anomaly detection until the critical phase of an incident.

Incident lifecycle and the role of AI

Detection

Detection of deviations and collection of primary signals from the observability circuit.

Clustering and prioritization of signals, noise filtering.

Dealing with an incident

Diagnostics, context collection, command synchronization and hypothesis selection.

Runbook suggestions, search for similar cases, assistance with communication.

Post analysis

Recording reasons, solutions, preventive actions and updating the knowledge base.

Automatic generation of postmortem drafts and structuring of conclusions.

AI projects in incident management

Summarizer

The system aggregates events, communications and incident facts, then generates draft post-analyses to speed up the RCA process.

  • Reduces manual toil when preparing post-analyses.
  • Helps identify recurring incident patterns.
  • Increases the consistency of the structure of postmortem documents.

LogAnalyzer

The tool analyzes logs, searches for related incidents and visualizes anomalies to speed up diagnosis.

  • Logs are downloaded from Sage every 5 minutes.
  • Next, preprocessing and text segmentation are performed.
  • TF-IDF and transformers are used for vectorization.
  • Anomalies are displayed in 3D space.

SRE assistant: key scenarios

  • Integration with the Time corporate messenger for entering scenarios from duty channels.
  • Working with incidents: context, status, draft post-analysis and follow-up actions.
  • Searching and retrieving data from internal knowledge bases using the RAG approach.
  • Manage duty and operational requests without leaving the messenger.
  • Orchestration of requests to bots and LLM agents in one user interface.

Quality and efficiency metrics

SRE assistant

Precision

0.54

Recall

0.43

Manual markup estimates approximately 600 queries.

LogAnalyzer

Precision

0.64

Recall

0.85

High recall is important to reduce the risk of missing anomalies.

Development prospects

  • New iterations of the SRE assistant with improved response quality and scenario coverage.
  • Improved metrics: separate control for hallucinations, completeness and period errors.
  • Strengthening the anomaly detection loop and tighter integration with incident workflow.
  • Continued exchange of practices with the professional platform/SRE engineering community.

Practical checklist

  • Start AI in incident management with narrow high-ROI scenarios: summarization, context search, triage.
  • Define the quality contract in advance: precision/recall, completeness of response, acceptable level of hallucinations.
  • Integrate the assistant into existing on-call channels (messenger, tickets, runbooks), rather than into an isolated UI.
  • Design observability for the assistant himself: what prompts are used and why this or that proposal is accepted.
  • Use co-development with SRE teams so that golden paths are useful in real incidents, not just in demos.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov