Evolution of SRE: implementation of an AI assistant in T-Bank

AI automation in SRE is useful only up to the point where it starts creating false confidence.

The T-Bank example shows how an incident-management platform, an AI assistant, LogAnalyzer, and quality metrics form a new operating loop where routine analysis moves to the machine, but the demands on trust, explainability, and escalation get much stricter.

In architecture discussions, the chapter gives you room to talk about autonomy boundaries, recommendation quality, failure of the assistant itself, and the real cost of bringing AI into the on-call workflow.

Practical value of this chapter

Design in practice

Turn guidance on SRE automation evolution and AI assistants in operations into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for SRE automation evolution and AI assistants in operations: release speed, automation level, observability cost, and operational complexity.

Source

Report by Ivan Yurchenko

Platform Engineering Night: evolution of incident management and implementation of AI assistant in SRE processes.

Watch the performance

Evolution of SRE at T-Bank This report shows through the transition from platformization and automation to robotization with the help of AI. Speech by Ivan Yurchenko (FineDog Growth) published April 28, 2025 and is focused on introducing assistants into the full lifecycle of incidents: from detection to post-analysis.

Context of the speech

Speaker

Ivan Yurchenko

Head of FineDog Growth at T-Bank.

Conference

Platform Engineering Night

Report on the implementation of AI assistants in the practice of SRE teams.

Performance date

April 28, 2025

Date of publication of the speech on YouTube.

The evolution of incident management tools

1. Platformization

Combining disparate tools into a single incident management framework with a common model of context and responsibility.

2. Automation

Speeding up routine actions: collecting facts, routing, preparing artifacts for diagnostics and post-analysis.

3. Robotization

AI assistants move from prompts to decision support and anomaly detection until the critical phase of an incident.

Incident lifecycle and the role of AI

Detection

Detection of deviations and collection of primary signals from the observability circuit.

Clustering and prioritization of signals, noise filtering.

Dealing with an incident

Diagnostics, context collection, command synchronization and hypothesis selection.

Runbook suggestions, search for similar cases, assistance with communication.

Post analysis

Recording reasons, solutions, preventive actions and updating the knowledge base.

Automatic generation of postmortem drafts and structuring of conclusions.

AI projects in incident management

Summarizer

The system aggregates events, communications and incident facts, then generates draft post-analyses to speed up the RCA process.

Reduces manual toil when preparing post-analyses.
Helps identify recurring incident patterns.
Increases the consistency of the structure of postmortem documents.

LogAnalyzer

The tool analyzes logs, searches for related incidents and visualizes anomalies to speed up diagnosis.

Logs are downloaded from Sage every 5 minutes.
Next, preprocessing and text segmentation are performed.
TF-IDF and transformers are used for vectorization.
Anomalies are displayed in 3D space.

SRE assistant: key scenarios

Integration with the Time corporate messenger for entering scenarios from duty channels.
Working with incidents: context, status, draft post-analysis and follow-up actions.
Searching and retrieving data from internal knowledge bases using the RAG approach.
Manage duty and operational requests without leaving the messenger.
Orchestration of requests to bots and LLM agents in one user interface.

Quality and efficiency metrics

SRE assistant

Precision

0.54

Recall

0.43

Manual markup estimates approximately 600 queries.

LogAnalyzer

Precision

0.64

Recall

0.85

High recall is important to reduce the risk of missing anomalies.

Development prospects

New iterations of the SRE assistant with improved response quality and scenario coverage.
Improved metrics: separate control for hallucinations, completeness and period errors.
Strengthening the anomaly detection loop and tighter integration with incident workflow.
Continued exchange of practices with the professional platform/SRE engineering community.

Practical checklist

Start AI in incident management with narrow high-ROI scenarios: summarization, context search, triage.
Define the quality contract in advance: precision/recall, completeness of response, acceptable level of hallucinations.
Integrate the assistant into existing on-call channels (messenger, tickets, runbooks), rather than into an isolated UI.
Design observability for the assistant himself: what prompts are used and why this or that proposal is accepted.
Use co-development with SRE teams so that golden paths are useful in real incidents, not just in demos.

References

Related chapters

Why do we need reliability and SRE? - The foundation of SLO, incident response and reliability practices.
Observability & Monitoring Design - Signals, alerts and runbooks for production incidents.
Technoshow “Dropped”: episode 1 - Practical incident case in the T-Bank data platform.
ML platform in T-Bank: the common good or better not needed - Platform engineering compromises and DevEx for ML areas.
AI in SDLC: the path from assistants to agents by Alexander Polomodov - Context for the evolution of AI tools in engineering processes.