Source
Brief overview in Russian
My book review on Tell Me About Tech
Engineering Reliable Mobile Applications
Authors: Kristine Chen, Venkat Patnala, Devin Carraway, Pranjal Deo
Publisher: O'Reilly Media, 2019
Length: 35 pages
Mobile SRE from Google: staged rollout, feature flags, client telemetry and impact on the backend.
OriginalFeatures of mobile applications for SRE
Features of Mobile SRE
SRE Book
Site Reliability Engineering
Basics of SLI/SLO/SLA and error budgets
Measuring indicators
For mobile apps, accessibility is a difficult question to answer. There are definitely not enough server logs for this - you need client telemetry to measure and ensure visibility. Even without extensive telemetry, you can rely on crash statistics.
SLI (Service Level Indicators)
We record what and how we measure. On the client side, we provide instrumentation and send the necessary events to the backend, where we calculate the indicators. Sent events can participate in the calculation of different SLIs.
SLO (Service Level Objectives)
With high-quality SLIs, we can set certain SLO levels that we strive for. It is important to take into account the specifics of mobile devices.
Real time monitoring
SRE teams love real-time monitoring. But in the mobile world resolution time increased, since changes are delivered in polling mode. It may take time for client metrics to stabilize after changes are submitted. watch.
Low-latency error ratios
Design metrics with high-confidence denominators to control for normal traffic fluctuations. This allows you to monitor changes immediately after sending.
Configuration state as dimension
Metrics from telemetry should include configuration status as a measurement. This allows you to filter telemetry from devices that have received the desired fix.
White-Box Monitoring
Metrics that publish data about the internal workings of an application. Requires code instrumentation.
Black-Box Monitoring
Checking the external, visible behavior of the application. For example, periodic samples.
Both approaches are complementary - only together they provide a fairly reliable idea of the state of the application.
CI/CD
Grokking Continuous Delivery
Continuous Delivery Practices
Change management
Using change management best practices is critical: rollback is almost impossible, and some problems found in production are fatal (for example, “bricked” devices).
Staged Rollout / Phased Releases
Internal testers and dogfooding
Unlike server-side deployment, in the mobile world only roll forward is possible - rollback through a new version
Case
A/B platform design
Experimentation system architecture for web and mobile applications
Feature Flags and A/B testing
Mobile applications operate in a very diverse ecosystem, where all parameters may differ from device to device (CPU, memory, network bandwidth). If you focus on metrics immediately after release, you can get distorted data — new versions are installed first by enthusiasts with powerful devices.
Google Recommendation
Separate the release of new applications from the launch of new behavior. Trigger behavioral changes through A/B tests using feature flags.
It is important to test that the rolling back flag will not break the application
When upgrading, there may be side effects that cannot be eliminated - you can organize a “placebo effect” for the old application for the correctness of the experiment
Support for older versions
A large number of releases leads to a long tail of old versions on customer devices. A clear support policy is required.
Support horizon
Support for older versions should have a clear horizon - for example, one or two years. Otherwise, maintaining the entire zoo of old versions will be too expensive and ineffective.
Sustainability
Release It!
Protection patterns against cascade failures
Impact on backend services
Changes to client code can have significant consequences on the server side. For example, changing the caching policy can increase the number of requests by an order of magnitude, which can lead to denial of service to backend systems.
It is important to understand how changes on the client side relate to changes in the nature of use of related services, and to test before publishing that these changes will not be fatal.
SRE: Hope Is Not a Mobile Strategy
The authors highlight the following best practices from Google's experience:
Design
Design mobile applications robust to unexpected input data, capable of recovering from management errors and rolling out changes in a controlled, metric-driven way.
Monitor
Monitor the application in production, measuring critical user interactionsand key health metrics (responsiveness, data freshness, crashes). Success criteria should be directly related to user expectations.
Release
Roll out changes carefully via feature flags, so that they can be evaluated through experimentation and rolled back independently of binary releases.
Understand
Understand and prepare for impact of the application on servers. Prevent known problematic patterns (eg thundering herd). Establish development and release practices that avoid problematic feedback patterns between applications and services.
Related content from Google
Site Reliability Engineering
Google SRE Book
Basics of SRE practices: SLO, error budgets, toil, on-call and postmortems.
The Site Reliability Workbook
Google SRE Workbook
Practical continuation with specific examples and case studies.
Building Secure and Reliable Systems
Google, 2020
How to combine security and reliability in production systems.
