Progressive Delivery: Canary, Blue-Green, and Feature Flags

A big all-at-once release is a bet on luck: either you get away with it, or you have an incident across the whole audience. Progressive delivery turns that bet into a controlled experiment, revealing risk gradually and under metric observation.

This chapter covers three mechanisms of safe rollout: blue-green with an instant environment switch, canary with a traffic share that grows through metric gates, and feature flags that separate deploy from release. On top of them sits automated rollback tied to SLO and error-budget burn rate.

Neighboring chapters on SLI/SLO/SLA and incident discipline provide the signals and the failure response; this one is about not reaching failure in the first place, and how to implement it in Kubernetes via Argo Rollouts and Flagger.

Practical value of this chapter

Design in practice

Turn guidance on Safe change rollout: canary, blue-green, feature flags, and automated SLO rollback into concrete operational decisions: alert interfaces, runbook boundaries, and rollback strategy.

Decision quality

Evaluate architecture via SLO, error budget, MTTR, and critical-path resilience rather than feature completeness alone.

Interview articulation

Frame answers around the reliability lifecycle: degradation signal, response, root-cause isolation, recovery, and prevention loop.

Trade-off framing

Make trade-offs explicit for Safe change rollout: canary, blue-green, feature flags, and automated SLO rollback: release speed, automation level, observability cost, and operational complexity.

Source

Danilo Sato — CanaryRelease (martinfowler.com)

The classic definition of a canary release: a new version is first seen by a small subset of users, and only on healthy signals does its share grow.

Перейти на сайт

Progressive delivery is a set of practices for safely shipping changes to production where risk is revealed gradually and under observation, not with a single switch. This chapter covers blue-green deployment, canary release, feature flags, and automated rollback on SLO. The neighboring chapter SLI / SLO / SLA and error budgets provides the signals and budget the stop criteria rely on, while incident management discipline describes what to do when a rollout breaks through the guardrails anyway. This chapter is about not getting to an incident in the first place.

Why: release risk and separating deploy from release

A big all-at-once release is a bet: either you got lucky, or you have an incident across the whole audience. Progressive delivery turns the bet into a controlled experiment. The first step is to separate two notions that are often conflated.

Deploy

A new artifact lands on the servers and is running, but that does not mean anyone uses it yet. It is a technical fact: the binary is installed, the process is alive, the smoke test passed.

Release

The change becomes visible to users and starts serving their traffic. This is where risk appears, and this is exactly what we want to reveal gradually rather than with a single switch.

Why separate them

While deploy and release are one step, any rollback means rolling back the binary. Split them and you gain slack: the code is already on prod, kept dark, and turning it on for an audience is a separate, controlled step. That is the essence of progressive delivery.

Three strategies for revealing change

Blue-Green

Environment switch

Two full environments, blue and green. Traffic is switched from one to the other instantly and rolled back just as fast. The cost is duplicated capacity.

Canary

Percentage of traffic

The new version first gets 1-5% of traffic, metrics are compared against a baseline, and the share grows in steps only while signals stay healthy. Rollback means lowering the weight to zero.

Feature flags

Switches in code

The change hides behind a flag and is turned on without a redeploy — by percentage of audience, by segment, or via an instant kill switch.

How a canary opens up: from 5% to 100%

Between steps sits an automated gate: the weight grows only while signals stay healthy. Any deviation returns traffic to the stable version. The specific step numbers are an example, not a standard.

Blue-green: instant switch and the cost of duplication

Blue-green deployment keeps two full, identical environments. At any moment one is live (say blue), and the new release is prepared and finally tested in green. When green is ready, the router switches traffic to it; if something goes wrong, it switches back to blue. This achieves zero-downtime deployment with near-instant rollback. The price is duplicated capacity: during the switch you need two full copies. Blue-green answers can we roll back quickly, but on its own it reveals the change in one jump — to everyone at once — which is why it is often combined with a canary phase.

Canary: percentage of traffic and automated analysis

A canary release is a risk-reduction technique: the new version first gets a small share of traffic, its behavior is observed, and the share grows gradually. The name comes from mining, where a canary was an early indicator of dangerous gas. The key part of a mature canary is the automated analysis that compares the new version against a baseline of the same configuration, not against a historical average.

Deploy the artifact and run smoke tests on canary instances before any live traffic is routed.
Send a small traffic weight (for example 5%) to the new version through traffic splitting in the load balancer or service mesh.
Compare canary metrics against the baseline: error rate, latency percentiles, resource saturation — automated canary analysis.
On healthy signals, raise the weight in steps (5 → 25 → 50 → 100%); on deviation, automatically roll the weight back to zero.

Feature flags: Hodgson's four categories

In the article on martinfowler.com (author: Pete Hodgson), feature toggles fall into four categories by dynamism and lifetime. Their main value is separating deploy from release: code can be shipped early and turned on through a gradual rollout by percentage of audience or segment, and just as instantly killed via a kill switch.

Release toggles

Hide not-yet-finished code in trunk-based development so an incomplete feature does not block shipping everything else. They live days to weeks and should be removed once fully rolled out.

Ops toggles

Operational switches: a kill switch for a heavy feature under load, graceful degradation of a non-critical path. Some of them are long-lived and operated by on-call.

Experiment toggles

Split users into cohorts for A/B testing. The decision is made on statistics, not on a simple works-or-not observation.

Permissioning toggles

Open a feature to specific groups: beta testers, internal staff, premium tier. Often long-lived and tied to business rules.

The debt risk of flags is real: teams treat flags as inventory with a carrying cost and seek to keep it as low as possible — a removal task in the backlog, an expiration date, a time bomb, or a limit on the number of flags in the system.

Automated rollback on SLO and burn rate

The stop criterion for a rollout is not it seems bad but a measurable signal. The most reliable one ties the decision to SLO and error-budget burn rate: if the canary spends the budget many times faster than allowed, the step fails and the weight automatically drops to zero.

Tie the stop condition to error-budget burn rate: a fast burn rate (for example 14.4x over an hour) means immediate rollback, not let's wait a bit longer.

Compare the canary against a baseline of the same traffic, not against a historical average — otherwise daily fluctuations produce false alarms.

Define explicit guard metrics: 5xx rate, p99 latency, retry growth. Any one crossing its threshold marks the analysis as failed.

Rollback must be automatic and fast. A manual should-we-roll-back decision at 3 a.m. is exactly the risk that progressive delivery is meant to remove.

Progressive delivery in Kubernetes

In Kubernetes the standard Deployment only knows a rolling update with no metric analysis. That is why progressive delivery uses specialized controllers.

Argo Rollouts

The controller replaces the standard Deployment with a Rollout resource that has canary and blue-green strategies. An AnalysisTemplate runs the metric check and automatically promotes or rolls back a step.

Flagger

An operator on top of a service mesh and ingress (Istio, Linkerd, NGINX, and others): it gradually shifts traffic, runs conformance tests, and decides on metrics — canary, blue-green, A/B rollout.

Metric providers

Signals for the analysis come from Prometheus, Datadog, New Relic, Graphite, InfluxDB, and others. Thresholds on them decide whether a step is successful or failed.

Precise traffic splitting for canaries and blue-green is convenient to implement at the service mesh layer: it sets weights between versions and collects metrics through a sidecar, which is covered in detail in the neighboring service mesh chapter.

A/B experiments vs progressive delivery: where the line is

The mechanism is shared — traffic and flag management — but the goals differ. Confusing them leads to bad decisions: an experiment cannot be aborted on a guard metric, and a canary cannot be held for a week to reach significance.

Progressive delivery answers the engineering question is this change safe. An A/B test answers the product question which variant is better on a business metric.

A canary lives minutes to hours and ends in promotion or rollback. An experiment runs days to weeks until statistical significance is reached.

A canary watches reliability guard metrics (errors, latency). An experiment watches business metrics (conversion, retention).

The mechanism is shared — traffic and flag management — but the goals differ. One flag cannot serve as both a kill switch and an experiment without confusing the decisions.

Trade-offs and common mistakes

Zombie flags: release toggles left in the code after full rollout. Each one is a branch that must be tested and kept in mind. Flags are inventory with a carrying cost.

A canary without enough traffic: at 1% of a low-traffic service the metrics are statistically noisy, and the analysis either stays silent or raises false alarms.

Manual rollback instead of automatic: while a human investigates, the error budget is already spent. A rollback should be cheaper than deliberating about it.

A canary without a baseline: comparing the new version against history instead of a parallel baseline yields false conclusions under daily and weekly load swings.

Recommendations

Separate deploy from release: ship code dark behind a flag and turn it on as a separate, controlled step.

Automate the metric gate: raising the canary weight must depend on signals, not on an engineer looked and it seems fine.

Tie stop criteria to an error-budget policy and burn rate, not to arbitrary thresholds.

Create flags with a removal date: a removal task in the backlog, an expiration, or a limit on the number of flags in the system. Flag hygiene is part of the Definition of Done.

References

Source map: Fowler/Hodgson/Sato anchor the terminology for feature flags, blue-green, and canary releases; Argo Rollouts documents progressive delivery in Kubernetes; the SRE Book anchors release-engineering discipline. Canary stop thresholds, traffic steps, and metric sets are system-specific policy, not a universal recipe.

Related chapters

SLI / SLO / SLA and Error Budgets - Provides the metrics and error budget that automated canary analysis and rollout stop criteria rely on.
Incident Management Discipline - Explains what to do when a rollout breaks through the guardrails anyway: response, roles, and the post-incident review.
Observability & Monitoring Design - Covers the metrics, logs, and traces — the signals without which canary analysis and metric gates are blind.
Service Mesh Architecture - Describes network-level traffic management through which canaries and blue-green get precise traffic splitting.