Source
Report by Ivan Yurchenko
Platform Engineering Night: evolution of incident management and implementation of AI assistant in SRE processes.
Evolution of SRE at T-Bank This report shows through the transition from platformization and automation to robotization with the help of AI. Speech by Ivan Yurchenko (FineDog Growth) published April 28, 2025 and is focused on introducing assistants into the full lifecycle of incidents: from detection to post-analysis.
Context of the speech
Speaker
Ivan Yurchenko
Head of FineDog Growth at T-Bank.
Conference
Platform Engineering Night
Report on the implementation of AI assistants in the practice of SRE teams.
Performance date
April 28, 2025
Date of publication of the speech on YouTube.
The evolution of incident management tools
1. Platformization
Combining disparate tools into a single incident management framework with a common model of context and responsibility.
2. Automation
Speeding up routine actions: collecting facts, routing, preparing artifacts for diagnostics and post-analysis.
3. Robotization
AI assistants move from prompts to decision support and anomaly detection until the critical phase of an incident.
Incident lifecycle and the role of AI
Detection
Detection of deviations and collection of primary signals from the observability circuit.
Clustering and prioritization of signals, noise filtering.
Dealing with an incident
Diagnostics, context collection, command synchronization and hypothesis selection.
Runbook suggestions, search for similar cases, assistance with communication.
Post analysis
Recording reasons, solutions, preventive actions and updating the knowledge base.
Automatic generation of postmortem drafts and structuring of conclusions.
AI projects in incident management
Summarizer
The system aggregates events, communications and incident facts, then generates draft post-analyses to speed up the RCA process.
- Reduces manual toil when preparing post-analyses.
- Helps identify recurring incident patterns.
- Increases the consistency of the structure of postmortem documents.
LogAnalyzer
The tool analyzes logs, searches for related incidents and visualizes anomalies to speed up diagnosis.
- Logs are downloaded from Sage every 5 minutes.
- Next, preprocessing and text segmentation are performed.
- TF-IDF and transformers are used for vectorization.
- Anomalies are displayed in 3D space.
SRE assistant: key scenarios
- Integration with the Time corporate messenger for entering scenarios from duty channels.
- Working with incidents: context, status, draft post-analysis and follow-up actions.
- Searching and retrieving data from internal knowledge bases using the RAG approach.
- Manage duty and operational requests without leaving the messenger.
- Orchestration of requests to bots and LLM agents in one user interface.
Quality and efficiency metrics
SRE assistant
Precision
0.54
Recall
0.43
Manual markup estimates approximately 600 queries.
LogAnalyzer
Precision
0.64
Recall
0.85
High recall is important to reduce the risk of missing anomalies.
Development prospects
- New iterations of the SRE assistant with improved response quality and scenario coverage.
- Improved metrics: separate control for hallucinations, completeness and period errors.
- Strengthening the anomaly detection loop and tighter integration with incident workflow.
- Continued exchange of practices with the professional platform/SRE engineering community.
Practical checklist
- Start AI in incident management with narrow high-ROI scenarios: summarization, context search, triage.
- Define the quality contract in advance: precision/recall, completeness of response, acceptable level of hallucinations.
- Integrate the assistant into existing on-call channels (messenger, tickets, runbooks), rather than into an isolated UI.
- Design observability for the assistant himself: what prompts are used and why this or that proposal is accepted.
- Use co-development with SRE teams so that golden paths are useful in real incidents, not just in demos.
References
- YouTube: Evolution of SRE: implementation of an AI assistant in T-Bank
- Telegram: post #3598 (book_cube)
- Conference Platform Engineering Night
- FineDog: T-Bank incident management platform
- Sage: observability platform of T-Bank
- Telegram: AI and Platform Engineering (#3490)
- Telegram: AI assistant for code (#3515)
- Telegram: AI assistants when working with code (#3518)
- Telegram: review of reliability processes in T-Bank (#3556)
