System Design Space
Knowledge graphSettings

Updated: June 24, 2026 at 1:28 PM

Backpressure and Flow Control

expert

How backpressure keeps a fast producer from drowning a slow consumer: bounded queues as a pressure signal, the four reactions to overload (block, shed, degrade, scale), the pull/credit scheme and Reactive Streams request(n), TCP flow and congestion control with Little's Law, service-level concurrency limits, and backpressure in Kafka, Flink, and gRPC/HTTP/2.

Any pipeline is a producer, a consumer, and a buffer between them. As long as the producer is no faster on average than the consumer, the buffer breathes. But once their speeds diverge, an unbounded queue stops losing data and starts accumulating latency: response time grows, memory grows, and the system dies with OOM or chokes on bufferbloat.

Backpressure is the "slow down" signal that flows upstream along the pipeline: from an overloaded consumer back to a fast producer. A bounded queue makes overload explicit and local, and from there a system has exactly four reactions: block (pay with latency), shed the excess (pay with loss), degrade quality, or scale out (pay with money).

The same credits and windows recur at every level — TCP rwnd and slow start, request(n) from Reactive Streams, credit-based flow control in Flink, HTTP/2 windows, consumer lag in Kafka, and Netflix's adaptive concurrency limits. The chapter ties these to the neighboring chapters on resilience and distributed queues, and rests on observability: without queue and lag metrics, overload is only visible after the fact.

Practical value of this chapter

Bound queues instead of growing the buffer

A large buffer only postpones OOM and worsens the latency tail. Replace an unbounded queue with a bounded one: overflow is itself the backpressure signal. By Little's Law (L = λ × W), bounding queue length strictly bounds latency at a fixed throughput.

Choose your payment currency deliberately

Under overload you cannot optimize everything at once. Block where data must not be lost (ETL, logs), drop where freshness matters and the latency budget is alive (metrics, requests near a timeout). Drop early and by priority: head drop of stale items beats tail drop of fresh ones.

Prefer pull and the credit scheme

The pull model (Kafka poll, request(n) from Reactive Streams) gives backpressure by construction: no request, no send. Credits and windows are one idea at every level: TCP rwnd, HTTP/2 windows, Flink credits. Push without a separate backpressure mechanism silently accumulates the queue to failure.

At the service level: adaptive limits and observability

A fixed RPS limit does not know current capacity; a concurrency limit is closer to the essence of saturation. Netflix concurrency-limits tunes it from rising latency (Vegas, Gradient2), like cwnd in TCP. Everything rests on metrics: queue lengths, consumer lag, pool saturation, and the fraction of shed load.

Neighbor chapter

Resilience patterns

Timeouts, circuit breakers, and bulkheads protect the system once overload has already happened. Backpressure acts earlier — it stops a fast producer from drowning a slow consumer.

Compare the approaches

The neighbor chapter on resilience collects defensive patterns for when something has already gone wrong: a timeout cuts off hung calls, a circuit breaker stops hammering a downed dependency, and a bulkhead keeps one overloaded pool from dragging the others down with it. That is a reaction to failure.

This chapter is about the layer below and before. Backpressure is the "slow down" signal that flows upstream along the pipeline: from an overloaded consumer back to a fast producer. The goal is to keep a mismatch in their speeds from turning into an endlessly growing queue, rising latency, and eventually OOM. Overload is better prevented than later quenched with circuit breakers.

The natural buffer between producer and consumer is the message queue from the neighboring chapter on distributed queues. That is exactly where you see why a bounded queue and consumer lag are not a bug but a pressure mechanism: a full queue is itself the signal that it is time to slow the input or shed the excess.

Backpressure and flow control is about an honest answer to "what do we do when more data arrives than we can process". There are exactly four options: slow down the source, shed some load, degrade quality, or scale out. What you cannot do is pretend the buffer is infinite.

Where the buffer lives

Distributed message queue

The queue is the pipeline's main buffer. The queues chapter shows how bounded capacity and consumer lag become a pressure signal.

Читать обзор

The problem: speed mismatch and the "just add a buffer" myth

Any pipeline is a producer, a consumer, and a buffer between them. As long as the producer's average rate stays below the consumer's rate, the buffer breathes: it fills and drains. Trouble begins when the producer is steadily faster. Then the buffer grows monotonically, and the only question is what bursts first.

Unbounded queue → rising latency

If the queue is unbounded, it does not lose data — it accumulates time. Every new item waits for all the previous ones, and the latency from arrival to processing grows linearly with queue length. The system formally "works," but answers ever more slowly.

Memory → OOM and cascade

A growing queue is growing memory. Sooner or later the process hits its limit and dies with OOM, losing the whole buffer at once. The restart meets the same load and dies again — overload turns into a crash loop.

Bufferbloat → useless freshness

Large buffers all along the path (bufferbloat) keep stale data in flight. By processing time the request is often already canceled by a client-side timeout: we spend resources on answers nobody is waiting for.

Why "just add a buffer" does not work: a buffer smooths bursts but does not cure a steady excess of speed. If the producer is faster on average, a buffer of any size only postpones the failure and increases the latency up to that point. An infinite buffer does not exist, and a large finite one is just a deferred OOM with a worse latency tail. The cure is not buffer size but a signal back: backpressure.

Backpressure: a bounded queue as the pressure mechanism

Backpressure is not a separate component but a property of the protocol between adjacent links of a pipeline. The idea is simple: the consumer tells the producer how much it can accept, and the producer does not send more. The "slow down" signal propagates up the chain — from the slowest link to the fastest.

A bounded queue is the signal

Replace an unbounded queue with a bounded one, and backpressure appears on its own. When the queue is full, an attempt to enqueue one more item must block, be rejected, or wait. Any of these outcomes is pressure: the producer learns that the consumer cannot keep up exactly at the moment of overflow, not after minutes of memory growth.

Pressure flows against the data

Data flows down (producer → consumer), the capacity signal flows up (consumer → producer). If the consumer slows, its input queue fills, its source stops reading too, and link by link the pressure reaches the very first producer. Ideally it then either slows down at the source or honestly rejects at the entrance.

Key idea: backpressure makes overload explicit and local. Instead of "memory is growing somewhere and latency drifts across the whole system," you get a specific overflowing link and a specific signal at the entrance. From there it is an engineering decision — what to do with the rejected load — rather than a fight with hidden degradation.

Four strategies under overload

When the queue is full, a system has exactly four principled reactions. They are not mutually exclusive — real systems combine them across layers — but each pays in its own currency: latency, loss, quality, or money.

A full queue: four exits

A fast producer has hit the bounded queue of a slow consumer. What to do with the excess — and what it costs.

Four strategies under overload: block, drop, degrade, scale outProducerfastQueue is fullConsumerslow1. Blockproducer waits — pay with latency2. Dropshed excess — pay with loss3. Degradesample/coarsen — pay with accuracy4. Scale+consumers — pay with moneybackpressure: "slow down"

1. Block / throttle

The producer waits until space frees up in the queue. Nothing is lost, and pressure is honestly passed upstream. The price is rising latency and the risk that blocking spreads to those who should not block (e.g. a thread serving other requests). Good inside a process and for hand-rolled pipelines, dangerous at the boundary with external clients.

2. Shed (load shedding)

The excess is rejected. Variants differ in what to drop: tail drop (reject new items at the entrance), head drop (discard the oldest, already-stale ones), by priority (sacrifice unimportant traffic first). Loss in exchange for bounded latency and a live service. Rejecting cheaply and early beats rejecting after expensive work.

3. Sample / degrade

A special case of shedding: process not everything but a representative sample, or coarsen the answer. Metrics and traces are sampled, analytics run on part of the stream, search returns an incomplete but fast result. Accuracy drops, but the service stays within the latency budget. Fits where an approximate answer on time beats an exact one late.

4. Scale out

Add consumers and raise total throughput. The radical cure, but not instant: autoscaling reacts in tens of seconds or minutes, while overload arrives in seconds. So scaling is a strategy about the future, while block/drop is about here and now, while new capacity spins up.

Pull in action

Distributed message queue

The consumer's pull model from the queues chapter is backpressure by construction: the consumer decides when and how much to fetch.

Читать обзор

Pull vs push: the credit scheme and Reactive Streams

Where the "slow down" signal comes from depends on who controls the pace. In the push model the producer sends as soon as it is ready — and backpressure has to be bolted on separately. In the pull model the consumer requests data itself, and pressure is built in by construction: no request, no send.

Push: fast, but risky

The producer does not ask permission. That is minimal latency and maximal throughput while the consumer keeps up. As soon as it falls behind, the difference settles in the buffer — and without a separate backpressure mechanism, push silently accumulates the queue to failure.

Pull / credit scheme: pressure by construction

The consumer issues the producer a "credit": a request for N items. The producer sends no more than N and waits for a new credit. This is exactly request(n) — demand drives the flow, and the queue never grows faster than it is drained.

Reactive Streams: a standard for async backpressure

The Reactive Streams specification (version 1.0, adopted into JDK 9 as java.util.concurrent.Flow) formalizes the credit scheme through four interfaces. Publisher produces elements on subscriber demand, Subscriber receives them, Subscription mediates one publisher-subscriber pair, and Processor combines both roles in the middle of a pipeline.

The heart of the mechanism is the Subscription.request(long n) method. By the spec a subscriber must signal demand via request(n) to receive onNext at all; a publisher may not send more than requested, and demand is additive across calls. So an unbounded stream becomes an on-demand flow with bounded buffers across any async boundary — exactly the "non-blocking back pressure" the spec sets as its goal.

Network foundation

Load balancing

TCP already solves the fast-sender, slow-receiver problem with a window and credits. The same ideas resurface at the application and service level.

Читать обзор

Foundations: TCP, credits, and Little's Law

Backpressure was not invented in reactive libraries — it is a rediscovery of techniques that are decades old. Two foundations are worth keeping in mind: how TCP regulates flow, and how Little's Law ties queue capacity to latency.

TCP flow control: the receive window

In TCP the receiver advertises rwnd in every segment — how many bytes it is still ready to accept. The sender may not keep more than rwnd of unacknowledged data "in flight." This is window-based flow control, akin to a credit scheme: when the receiver buffer fills up, its window zeroes and stops the sender — exactly the "slow down" signal needed against a slow receiver.

TCP congestion control

Flow control protects the receiver, while congestion control protects the network between them. Per RFC 5681 the sender keeps a second window, cwnd, and the pace is bounded by the minimum of cwnd and rwnd. Slow start grows cwnd, probing for capacity, while a packet loss — a congestion signal — cuts the window. Same logic: probe for more, back off under overload.

Little's Law: L = λ × W, where L is the average number of items in the system, λ the arrival rate, and W the average time spent. The direct engineering consequence: at a fixed throughput λ, bounding queue length L strictly bounds latency W — and vice versa. So a bounded queue is not just OOM protection but a way to set an upper bound on latency: want a predictable tail — bound L and shed the excess.

Neighbor chapter

Timeouts and bulkhead

At the service level backpressure meets resilience: concurrency limits, timeouts, and bulkheads are pressure and protection at once.

Читать обзор

At the service level: limits, timeouts, priorities

Between microservices there is no shared buffer with request(n), but the idea is the same: bound the input so a slow service does not drag its callers down. Here backpressure is expressed through limits and timeouts.

Rate limiting vs concurrency limit

A rate limiter caps the flow by frequency (requests per second) — token bucket, leaky bucket, sliding window. But a fixed RPS limit does not know what load the system can take right now. A concurrency limit (how many requests are in flight at once) is closer to the truth: by Little's Law it is the number of concurrent requests, not their rate, that drives saturation.

Adaptive limits: learn from TCP

The Netflix concurrency-limits library carries congestion-control logic over to services: the concurrency limit is not set by hand but discovered from rising latency. The Vegas (latency-based) and Gradient2 (divergence of moving averages of latency) algorithms catch the moment a queue starts to build and lower the limit — like cwnd on packet loss. This is backpressure that finds its own capacity.

Timeouts and bulkhead

A timeout is head drop in time: a request that has hung longer than its budget is discarded, freeing space and keeping bufferbloat from accumulating stale calls. A bulkhead bounds each resource pool separately, so overload of one dependency does not eat the threads needed by others. Both come from the neighboring chapter on resilience.

Priority queues

When shedding is unavoidable, it matters that you shed the right thing. Priority queues and traffic classes let you sacrifice background and retriable work under pressure while preserving user and payment requests. Load shedding without priorities hits the important and the trivial alike; with priorities it becomes managed degradation.

A practical order of defense at a service's entrance: first a timeout (do not hold the hopeless), then a concurrency limit (do not admit more than you can take), then priority shedding (if it is still too much — sacrifice the unimportant), and only then scaling (for the next wave of load). The circuit breaker from the neighbor chapter sits even further out — for when the dependency has already failed.

Streaming: Kafka, Flink, gRPC / HTTP/2

In streaming systems backpressure is not optional but load-bearing: data flows continuously, and without backpressure any slow operator would blow up the pipeline. Three examples show different implementations of one idea.

SystemFlow mechanismOverload signal
Apache KafkaPull model: the consumer polls the broker itself and commits offsets — producer and consumer are decoupled through the log.Growing consumer lag — the gap between the log end and the committed offset. Lag is the pressure metric.
Apache FlinkCredit-based flow control in the network stack: the receiver announces credits to the sender (1 buffer = 1 credit); the sender sends no more than the available credits.Credits drop to zero — the sender stops; backpressure hits the overloaded logical channel precisely, without blocking its neighbors on the shared TCP.
gRPC / HTTP/2HTTP/2 windowed flow control: every stream and the connection have a window; the receiver sends WINDOW_UPDATE as it reads — the same credits as in TCP.The stream window reaches zero — the sender cannot send DATA frames until the receiver drains and updates the window.

Note the common motif: credits and windows. Flink credits, HTTP/2 windows, TCP rwnd, and Reactive Streams request(n) are one and the same credit scheme at different levels. Kafka stands apart: it does not pass credits but decouples the sides through a log and makes backpressure observable through consumer lag — pressure does not block the producer but is visible as the lag grows.

Neighbor chapter

Observability and monitoring

Backpressure lives on metrics: queue lengths, consumer lag, pool saturation. Without them, overload is visible only after the fact.

Читать обзор

Trade-offs: latency vs loss vs cost

Backpressure does not abolish overload — it turns it from unmanaged into managed. But for that you have to deliberately choose what to pay with. Three currencies compete, and you cannot optimize all three at once.

Block → pay with latency

Nothing is lost, but the producer waits. Fits inside a process and where data must not be lost (payments, logs). Dangerous at the user boundary: blocking turns into a hung request and may propagate upstream in unexpected ways.

Drop → pay with loss

Latency is bounded, the service stays alive, but some data is lost. Fits metrics, telemetry, non-critical requests — where a fresh approximate result beats an exact late one. Drop early (at the entrance) and by priority.

Scale → pay with money

Neither loss nor latency — but more expensive and not instant. Scaling covers steady growth, not second-long spikes: while new capacity spins up you still need block or drop as a buffer in time.

Where to drop, where to block — and how to see it

  • Block where data must not be lost and latency is acceptable: ETL, transaction logs, internal pipelines.
  • Drop where freshness matters and the latency budget is alive: user requests near a timeout, metrics, telemetry — a fast rejection beats a slow success.
  • Drop early and cheaply: rejecting at the entrance is cheaper than after expensive work; head drop of stale items beats tail drop of fresh ones.
  • Observability is mandatory: queue lengths, consumer lag, pool saturation, the fraction of shed load. Backpressure without metrics is overload you only see through rising latency and OOM.

Key takeaways

  • A producer/consumer speed mismatch is inevitable; an unbounded queue does not lose data, it accumulates latency — all the way to OOM and bufferbloat.
  • "Just add a buffer" does not cure a steady excess of speed: a large buffer only postpones failure and worsens the latency tail.
  • Backpressure is the "slow down" signal flowing upstream; a bounded queue makes it explicit and local.
  • Under overload there are four reactions: block (pay with latency), shed (pay with loss), degrade, or scale out (pay with money).
  • The pull model and the request(n) credit scheme give backpressure by construction; Reactive Streams formalizes it through Publisher/Subscriber/Subscription.
  • Credits and windows are one idea at every level: TCP rwnd, HTTP/2 windows, Flink credits; Kafka makes pressure observable through consumer lag.
  • At the service level backpressure is concurrency limits (adaptive, as at Netflix), timeouts, bulkheads, and priority shedding — all resting on observability.

Sources and further reading

Source map: Reactive Streams supports the non-blocking back-pressure contract and request(n); Flink supports credit-based flow control in a streaming system; Netflix concurrency-limits supports adaptive concurrency limits; RFC 5681 supports TCP congestion control. Queue thresholds, window sizes, drop/degrade policy, and concurrency limits are per-system operating parameters, not universal constants.

Related chapters

  • Resilience patterns - The neighbor chapter on how a system holds up under failure: timeouts, circuit breakers, bulkheads, and retries. Backpressure is the missing layer that keeps overload from happening in the first place.
  • Distributed message queue - A queue is the main buffer between producer and consumer. This is where a bounded queue and consumer lag turn into a natural "slow down" signal.
  • Load balancing - Scaling and spreading requests is one of the four reactions to overload. A balancer spreads the flow, but without backpressure it just drowns the consumers more evenly.
  • Observability and monitoring - Backpressure lives on metrics: queue lengths, consumer lag, saturation. Without observability, overload is visible only after the fact — through rising latency and OOM.

Enable tracking in Settings