Load Balancing — System Design Space

Load balancing stops being just a proxy in front of a service as soon as degradation, rollout, and failover become real concerns.

This chapter ties together the L4 vs L7 choice, instance health management, gradual traffic ramp-up, global routing, and service mesh behavior into one operational picture.

In interviews and architecture discussions, that framing helps you explain not only how traffic is distributed, but also how the system survives releases, failures, and growth.

Practical value of this chapter

Request Path

Treat balancing as the full request path: entry point, instance health, timeouts, and degradation behavior.

Layer Choice

Separate L4 and L7 responsibilities by traffic shape, routing needs, and the cost of extra logic.

Operational Stability

Plan gradual traffic ramp-up, connection draining, and failover behavior so releases do not turn into incidents.

Interview Framing

Tie business requirements to balancing choices and explain which failures the design is supposed to survive.

Request route

Traffic Path Through Balancing Layers

Load balancing is a stack of decisions: the global entry point selects a region, local balancers distribute requests, and the service mesh refines internal routing.

Global entry

DNS, GSLB, or anycast selects the nearest or preferred entry point.

region and failover

L4 / L7

A transport or application balancer chooses the pool by protocol, route, and health.

where to send the request

Instance health

Health checks, gradual ramp-up, and connection draining protect against sharp drops.

whether traffic is safe

Service mesh

Inside the cluster, routing policies, retries, and mTLS apply to service-to-service traffic.

internal control

Main idea

A good balancing design describes not only the normal route, but also behavior during releases, degradation, and failover.

Reference

Envoy LB Overview

Detailed guidance on load balancing policies, outlier detection, and locality-aware traffic steering.

Open source

Load balancing matters not because it distributes requests, but because it keeps the request path predictable during degradation, rollout, and failover. Spreading requests across nodes is only half the job. The rest is decided up front: who makes the routing decision, how the system notices a degrading instance, and what happens to traffic while the infrastructure underneath it changes. Leave those answers unwritten and the first rollout or dependency failure turns into a user-visible incident.

Decision Point

Choose where routing decisions happen: L4 transport or L7 application layer.

Health Policy

Combine active + passive checks to detect both hard failures and soft degradation.

Graceful Rollout

Use slow start and connection draining in every release, not just during incidents.

Global Steering

Separate global region steering from local balancing inside each selected region.

A practical 4-step playbook

Lock in your L4/L7 model

Step 1

For each ingress flow, decide whether L7 policy control is needed or if minimal L4 latency/overhead should be prioritized.

Define a health contract

Step 2

Define active checks, passive signals, and ejection thresholds so degraded instances leave rotation before errors spread.

Enforce rollout-safe mechanics

Step 3

Apply slow start, readiness gating, and connection draining so deployments do not break long-lived traffic.

Split global and local balancing

Step 4

Use DNS GSLB or Anycast for inter-region routing and regional LB or mesh for fine-grained control inside a region.

L4 vs L7: choosing the balancing layer

What this is about: choosing where routing decisions are made: at the transport layer (L4) or at the application layer (L7).

Why this comes first: this is the base decision for the rest of the chapter. Pick a layer and you lock in the routing rules you can write, the cost of handling each request, how much of the traffic you can see, and the resilience of the whole setup. Reversing it later is expensive: you are not changing a setting, you are moving where the decision gets made at all.

Best-fit scenarios: L4 usually fits stateful TCP services such as DB/cache/broker workloads, while L7 fits HTTP/gRPC APIs with canaries, path routing, and richer routing policy.

Criteria	L4	L7
OSI layer	L4 (TCP/UDP): routing by IP and port without looking into HTTP semantics.	L7 (HTTP/gRPC): routing by path, host, headers, and cookies.
Routing flexibility	Lower: usually simple strategies such as hash, leastconn, or round-robin with little request awareness.	Higher: canaries, A/B flows, sticky sessions, rate limits, and richer policy-based routing.
Performance profile	Lower CPU and latency overhead, strong for high-throughput TCP traffic.	More per-request logic, but much finer control over how traffic is handled.
Typical use cases	Databases, Redis, MQTT, binary protocols, and TCP paths where minimal latency matters most.	API gateways, web apps, gRPC services, and product-level routing policies.

L4 example (HAProxy, TCP)

Best for PostgreSQL, Redis, and other stateful TCP services where HTTP-aware routing is irrelevant.

frontend ft_postgres
  bind *:5432
  mode tcp
  default_backend bk_postgres

backend bk_postgres
  mode tcp
  balance leastconn
  option tcp-check
  default-server inter 2s fall 3 rise 2
  server pg-1 10.0.1.11:5432 check
  server pg-2 10.0.1.12:5432 check

L7 example (Nginx, HTTP)

Reach for this when the route depends on path, headers, and product rules — API gateways and per-request policy decisions.

upstream api_pool {
  least_conn;
  server 10.0.2.11:8080 max_fails=3 fail_timeout=10s;
  server 10.0.2.12:8080 max_fails=3 fail_timeout=10s;
}

server {
  listen 80;

  location /api/ {
    proxy_pass http://api_pool;
    proxy_set_header X-Request-Id $request_id;
  }

  location /static/ {
    proxy_pass http://static-service:8080;
  }
}

Health checks and safe instance removal

What this is about: backend instance lifecycle in the load balancer: when to add instances, when to eject them, and how to remove them safely.

Why this matters: even the best algorithm fails if traffic still goes to degraded or not-yet-warmed instances. This is where 5xx/timeout spikes are reduced during deploys and failover.

Best-fit scenarios: autoscaling on Kubernetes, rolling deploys, blue/green deploys, and long-lived connections (WebSocket/gRPC streams) where abrupt shutdown causes user-visible failures.

Active health checks

The load balancer probes health endpoints/TCP handshakes itself. You can tune interval, timeout, rise/fall and remove degraded instances before hard failure.

Passive health checks

Degradation is inferred from live traffic: 5xx, timeouts, resets, and growing latency. Useful when an endpoint is technically up but already overloaded in practice.

Grace period / slow start

New pods receive only a limited share of traffic at first. This reduces cold-start spikes and avoids instant ejection while caches, JIT, and connection pools are still warming up.

Connection draining

When removing an instance from rotation, stop sending new connections but let in-flight traffic finish. This lowers client-visible errors during deploys and failover.

HAProxy: slow start and health policy

backend app
  balance leastconn
  option httpchk GET /healthz
  http-check expect status 200
  default-server inter 2s fall 3 rise 2 slowstart 30s
  server app-1 10.0.3.11:8080 check
  server app-2 10.0.3.12:8080 check

Kubernetes: readiness and graceful shutdown

spec:
  terminationGracePeriodSeconds: 40
  containers:
    - name: app
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8080
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 20"]

Global load balancing across regions (GSLB)

What this is about: balancing across regions and data centers, not just across instances inside one cluster.

Why we should care: local LB does not solve global latency or regional outages. Without GSLB, users may still be routed to a distant or already degraded region.

Best-fit scenarios: multi-region products, global B2C traffic, strict RTO/RPO requirements, regional compliance constraints, and active disaster recovery.

DNS-based GSLB

Authoritative DNS selects a region by latency/geo/weight/health and returns the closest endpoint.

Pros: Simple integration and a strong baseline for multi-region routing.

Limitations: Reaction speed is bounded by TTL and resolver caching behavior, which can delay failover.

When to use: Web/API traffic where seconds-to-tens-of-seconds failover is acceptable.

Anycast

The same IP is announced from multiple PoPs/regions; BGP directs traffic to the topologically closest edge.

Pros: Fast global distribution and strong resilience at the edge.

Limitations: Less L7 control at the DNS-answer level; requires mature networking, observability, and careful anti-flap operations.

When to use: Edge/L4 ingress, DNS, DDoS-resilient front doors, globally distributed APIs with minimal RTT.

Service mesh in Kubernetes

What this is about: balancing service-to-service traffic in microservices, where each internal RPC call becomes its own balancing decision.

Why this appears here: after L4/L7, health policy, and GSLB, the next step is showing how the same principles scale inside Kubernetes via sidecar proxies and control-plane-managed policy.

Best-fit scenarios: dozens/hundreds of services, unified traffic policy, canary/traffic splitting, mTLS, and centralized retries, outlier detection, and locality failover controls.

In a mesh, balancing runs inside sidecar proxies, typically Envoy. Istio publishes endpoints and policy through the control plane, while the data plane applies load balancing, retries, outlier detection, and locality failover on each service call.

Istio DestinationRule

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments
spec:
  host: payments.default.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: LEAST_REQUEST
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s

Istio locality failover

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: checkout-locality
spec:
  host: checkout.default.svc.cluster.local
  trafficPolicy:
    localityLbSetting:
      enabled: true
      failover:
        - from: us-east1
          to: us-central1

Recommendations

Start with L7 for product HTTP APIs, but keep an L4 path for stateful TCP services (DB/cache).
Combine active and passive health checks: active catches hard failures, passive catches degraded behavior.
Apply grace period and connection draining in every rollout, not only during incidents.
For global routing, split responsibilities: DNS GSLB for region choice, local LB/mesh for intra-region balancing.

Common mistakes

Using round-robin by default without validating traffic shape, p99, and connection duration.
Treating Kubernetes readiness/liveness as a complete substitute for L7 passive health checks.
Setting long DNS TTLs while expecting fast failover in multi-region incidents.
Skipping connection draining during deploys and causing 5xx spikes on long-lived requests.

References

Related chapters

Design principles for scalable systems - provides the scaling trade-off background that load-balancing decisions rely on.
Load balancing algorithms - compares Round Robin, Least Connections, and Consistent Hashing across different workload patterns.
Service Discovery - covers how to discover healthy instances, which keeps load balancer target sets accurate.
Service Mesh Architecture - shows how L7 balancing, retries, and outlier detection are applied inside a mesh data plane.
Multi-region / Global Systems - extends this chapter with deeper regional failover and global traffic-routing strategies.
DNS - explains DNS steering constraints such as TTL, resolver caching, and failover reaction speed.
Kubernetes Fundamentals - adds operational context for readiness, lifecycle hooks, and safe traffic shifting during rollouts.
API Gateway - demonstrates an applied L7 case where routing policy becomes part of product architecture.