Mobile reliability is difficult because a large share of failure lives on the device, on the network, and in the release channel rather than in the data center.
The chapter shows how staged rollout, feature flags, client telemetry, and backend-impact awareness form a distinct mobile SRE practice.
In interviews, it helps you discuss mobile release risk, client-side observability, and why the client-server model changes the reliability playbook.
Practical value of this chapter
Design in practice
Turn guidance on mobile reliability, client telemetry, and safe app releases into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.
Decision quality
Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.
Interview articulation
Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.
Trade-off framing
Make trade-offs explicit for mobile reliability, client telemetry, and safe app releases: release speed, automation level, observability cost, and operational complexity.
Source
Brief overview in Russian
My book review on Tell Me About Tech
Engineering Reliable Mobile Applications
Authors: Kristine Chen, Venkat Patnala, Devin Carraway, Pranjal Deo
Publisher: O'Reilly Media, 2019
Length: 35 pages
Google's playbook for reliable mobile apps: client telemetry, staged rollout, feature flags, version support, and client impact on backend load.
This chapter frames mobile SRE through mobile app reliability: client telemetry, service level indicators, service level objectives, staged rollout, feature flags, the lack of a normal rollback path, support for older versions, and the client's impact on backend load. The key idea is simple: in a mobile client-server product, reliability does not fail only on the server. Part of the failure surface lives on the device, in the network, in the release channel, and in the app version.
What makes mobile reliability different
What makes mobile reliability different
SRE Book
Site Reliability Engineering
The basics of SLI, SLO, SLA, and error budgets
SLI and SLO for mobile applications
Server logs are not enough for a mobile app. Users may experience crashes, weak connectivity, stale versions, wrong configuration, or slow devices. Measurement starts with client telemetry and crash-rate data.
SLI
A service level indicator defines what the team measures: successful sessions, crashes, latency of a critical action, data freshness, or synchronization quality.
SLO
A service level objective sets the acceptable risk for a user journey. For mobile, that objective should account for device class, app version, and network quality.
Monitoring a mobile application
In server environments, a change usually shows up quickly. In mobile, signals arrive more slowly: users update gradually, apps may send data in batches, and some devices stay offline for a long time. Metrics should therefore include not only the error, but also version, configuration, and release channel.
Low-latency error ratios
Design metrics with trustworthy denominators so the team can distinguish a real regression from normal traffic fluctuation after a release.
Configuration state as a dimension
Include configuration state in client metrics so you can see which devices have received a fix, flag, or experiment variant.
White-box monitoring
Signals from instrumented application code: errors, operation duration, configuration state, and user journey quality.
Black-box monitoring
Checks that mimic a user from the outside: opening a scenario, making a network request, polling periodically, or running a synthetic probe.
These approaches complement each other: internal signals explain the cause, while external behavior shows what users see.
CI/CD
Grokking Continuous Delivery
Continuous delivery practices
Change management
Change management is critical for mobile apps: rollback is almost impossible, and some post-release failures cannot be fixed instantly. In the worst case, a bad binary release can brick devices.
Staged rollout for a mobile release
Employees, testers, and dogfooding cohorts use the version first.
In mobile, a normal rollback is rarely available. Fixes usually arrive through a new version or by disabling behavior with a feature flag.
Case
A/B platform design
Experimentation architecture for web and mobile applications
Feature flags and A/B testing
The mobile ecosystem is heterogeneous: CPU, memory, network bandwidth, OS version, and device quality all vary. If you read metrics immediately after release, the data may be biased because early adopters with newer devices often upgrade first.
Google recommendation
Separate app release from behavior rollout. Ship the binary ahead of time, then enable behavior through feature flags and A/B experiments.
Verify that turning a flag off does not break the application and actually restores safe behavior.
If an upgrade leaves irreversible side effects, plan a control group and a placebo scenario so the experiment remains valid.
Supporting older versions
Frequent releases create a long tail of app versions on user devices. Teams need a support policy that states which versions still receive backend compatibility, critical fixes, and clear upgrade messaging.
Support horizon
The support horizon should be explicit, for example one or two years. Otherwise the team keeps old API contracts, configurations, and client behavior alive forever.
Resilience
Release It!
Patterns that protect systems from cascading failures
Impact on the backend
Client-side changes can sharply alter backend load. A new caching or synchronization policy can increase request volume by an order of magnitude and trigger failures in dependent services.
Before release, test how the client change affects server limits, retries, queues, and dependency degradation.
Hope is not a mobile strategy
The authors highlight several practices from Google's experience:
Design
Design mobile apps to tolerate unexpected input, control-plane mistakes, stale configuration, and unreliable networks. Changes should be enabled gradually and measured with real signals.
Monitor
Measure critical user interactions, responsiveness, data freshness, and crash rate. Success criteria should reflect user expectations, not only server health.
Release
Enable behavior through feature flags so the team can evaluate it with experiments, limit exposure, and turn off risk without shipping a new binary.
Understand
Account for backend impact: the client can create a thundering herd or an expensive shift in load. Design safe feedback patterns between the app and backend services.
Key takeaways
Related chapters
- Site Reliability Engineering - Covers the SRE basics behind this chapter: SLOs, error budgets, toil, on-call work, and postmortems.
- The Site Reliability Workbook - Turns SRE ideas into practical templates for alerting, incident response, and operating processes.
- Building Secure and Reliable Systems - Shows how reliability and security requirements come together in production-grade systems.
