Base
Site Reliability Engineering
SLO, error budgets and operating culture from Google.
Reliability and fault tolerance are not only about uptime numbers. This is about how to build operations, plan releases, monitor the system and properly respond to incidents. This section helps you understand how to make your service predictable in the face of growth and constant change.
Why does an engineer need this knowledge?
Reliability as a product
The user remembers not the architecture, but the fact that the service works stably at the right time.
Operations = daily success
Operations are the processes, access, duties and routines that keep the system running.
Releases without fear
Checkpoints, feature flags and safe rollouts save nerves and money.
CI/CD as a value stream
Understanding the CI/CD value delivery pipeline affects the speed and ease of delivery of changes, including incident fixes.
Observability instead of guesswork
Metrics, logs and traces provide a clear picture of what is happening and why.
Incidents as growth
Post-mortems and improvements turn failures into system lessons.
Section map: key directions
SRE and SLO
Error budgets, balance of speed and stability.
Continuous releases
CI/CD, verifiable rollouts and anti-crisis practices.
Observability stack
System signals and tools that reveal them.
Incidents and safety
Runbooks, postmortems and the discipline of reaction.
Reliability on the client
Mobile releases, feature flags and telemetry.
What will this section give in practice?
- Ability to formulate SLO/SLA and manage error budgets.
- Ability to organize safe releases and rollbacks.
- Understanding how to build observability: metrics, logs, traces and alerts.
- Process of working with incidents: on-call, runbooks, post-mortems and improvements.
If you need a quick entry, start with SRE Book And Grokking Continuous Delivery.
