A strong incident review matters not because of the drama, but because it turns a painful failure into a better model for design decisions.
The analysis of a two-week T-Bank data-platform incident shows how metadata loss, Kafka-based recovery, and data contracts expose real platform fragility.
For retrospectives and design reviews, the chapter helps teams discuss blameless analysis, metadata survivability, contract quality, and the limits of automation through a real prolonged failure.
Practical value of this chapter
Design in practice
Turn guidance on data platform incident review and metadata recovery into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for data platform incident review and metadata recovery: release speed, automation level, observability cost, and operational complexity.
Technoshow “Dropped”: episode 1
Live broadcast with an honest blameless review of a two-week data-platform incident: from the failure mechanics to recovery, architecture changes, and organizational takeaways.
Source
Telegram: Book Cube
The original episode review with an emphasis on SRE practices and management implications.
Release context
The episode is framed as a blameless postmortem: it focuses on failure mechanics, recovery, architecture changes, and process changes rather than personal blame. That makes it useful for engineers and managers who need a shared language for risk and reliability.
Guest and context
Alexander Krasheninnikov
- Practitioner in the field of data platforms, ETL and DWH.
- Guest of the first episode, talking about a real two-week incident.
- Focus on engineering and organizational insights following service recovery.
Incident chain
Metadata loss and read degradation
Metadata loss led to Trino failures and cascading degradation for data consumers.
Recovery was not enough
There was no backup for the required point in time, and CDC-based metadata recovery was too slow for business consumers.
Critical incident and change of approach
The team moved to a new Kafka-based architecture with data contracts, schema unification, and parallel ingestion.
Reconciliation, discrepancy fixes, and recovery
Then came data reconciliation, discrepancy fixes, partial service recovery, and historical backfill from reserves.
Related chapter
Data Pipeline / ETL / ELT Architecture
Data-pipeline design: orchestration, recovery, and data quality.
What matters to engineers
- CDC pipelines are convenient in normal operations, but they need recovery scenarios and data-integrity checks designed upfront.
- The Outbox pattern lowers the risk of losing events between services, but it does not replace schema and contract-version discipline.
- Data contracts and schema unification are critical to keeping ETL/DWH paths resilient during failover.
- Pipeline reliability needs dedicated data observability: connectors, consumer lag, and replication failures must be visible before users complain.
Primary source
Google SRE: Postmortem Culture
Basic principles of blameless culture and learning from failures.
Management perspective
- A blameless postmortem reduces internal turbulence and moves the conversation toward systemic improvements instead of blame.
- A shared risk language, including incident severity, error budgets, and release freezes, helps align business and engineering expectations.
- Data SLOs should be phrased in user terms first, then decomposed into technical metrics and alerts.
- A data-platform incident is not only a technical failure. It is also a management signal about process maturity.
Primary source
Google SRE Workbook: Implementing SLOs
How to ground user expectations into measurable goals and alerts.
SLI/SLO for the data platform
Data speed
freshness, end-to-end latency, consumer lag
Integrity
duplicates, omissions, discrepancies in control samples
Connector stability
error rate, restart rate, time to recover
Example user-facing SLO: “data in the mart is no more than X minutes old 99.9% of the time,” after which the goal is decomposed into alerts, risk, and a recovery plan.
References
Related chapters
- Observability & Monitoring Design - Covers metrics, alerts, and diagnostics that matter during prolonged data-platform incidents.
- Site Reliability Engineering - Foundational SRE practices: blameless postmortems, error budgets, and reliability management.
- The Site Reliability Workbook - Hands-on patterns for SLI/SLO design and operational rollout in product teams.
- Data Pipeline / ETL / ELT Architecture - Data pipeline architecture, recovery patterns, and platform-level data quality controls.
- Consistency and idempotency - Helps reduce duplicates, state divergence, and replay-related failures in recovery flows.

