Rate Limiter — System Design Space

A rate limiter looks like a counter problem until burst traffic, shared state, and limiter failure turn it into a fairness and survivability problem.

The chapter helps compare Token Bucket, Leaky Bucket, and Sliding Window, choose the enforcement point, store quota state, and reason about failure behavior.

For interviews and architecture discussions, it quickly reveals the quality of engineering judgment around abuse prevention, critical-path placement, and the cost of a bad default policy.

Control Plane

Focus on policy, limits, routing, and stable edge behavior under variable load.

Data Path

Keep latency and throughput predictable while traffic and burst pressure increase.

Failure Modes

Cover fail-open/fail-close behavior, graceful degradation, and safe fallback paths.

Ops Ready

Show monitoring for saturation, retry storms, and practical operational guardrails.

Rate Limiter is a classic case about protecting a public API and distributing shared capacity fairly. On the surface it looks like a simple request counter, but in practice you have to decide where the limit should be enforced, how state is shared across instances, and what happens when the limiter itself starts failing or becomes unavailable.

In an interview the problem quickly runs into trade-offs: how tightly to hold the limit, how much latency to add on the critical path, and how to behave when the protective layer itself fails. Every one of these questions carries a cost.

Chapter 4

Alex Xu: Rate Limiter

Detailed breakdown of algorithms and architectural decisions in the book

Читать обзор

Why rate limiting matters

What happens without it

DDoS attacks — service overload with malicious traffic
Cascading failures — one client overloads the entire service
Overuse of resources — uncontrolled API consumption
Uneven load — “noisy neighbors” eat capacity others were counting on

What rate limiting gives you

Service protection — overload prevention
Fair distribution — quotas for each client
Saving money — control over the use of paid APIs
Predictability — stable performance

Typical Requirements

Functional

FR1
Limit requests by IP address, user ID, or API key
FR2
Customizable limits for different endpoints
FR3
Informative response when the limit is exceeded (429)
FR4
Operate correctly in a distributed environment

Non-functional

NFR1
Low latency — <1ms overhead per request
NFR2
High availability — limiter failure must not equal service failure
NFR3
Limit accuracy (up to ±5% error is acceptable in a distributed setup)
NFR4
Horizontal scaling

Rate limiting algorithms

Each classic algorithm makes different trade-offs around accuracy, memory usage, and burst behavior:

Interactive visualization

1. Token Bucket

A bucket is refilled at a constant rate, and each request consumes one token.

Advantages

Simple implementation
Allows burst traffic
Memory efficient (2 numbers per user)

Drawbacks

Hard to choose bucket size
Race conditions in distributed setup

# Parameters:

bucket_size = 10 # max tokens

refill_rate = 2 # tokens/sec

Algorithm visualization

Tokens:

5.0 / 5

Refill: 1 token/sec

t = 0.0s

Request history:

Press "Start" and send requests

Architecture choices

Related case

API Gateway

Full API Gateway breakdown: routing, security, BFF patterns, and technology comparison.

Read the case

Where should the rate limiter live?

Client-side

The client only controls itself, so the limit is easy to bypass. Fine for trimming extra requests early, not as actual protection.

API Gateway

The most common central enforcement point in front of application services. AWS API Gateway, Kong, Envoy.

In-service

Flexible on rules, but the cost is duplicated logic in every service and harder counter coordination.

Zhiyong Tan

Acing SDI: Rate Limiter

Comparison of deployment options for rate limiting in real systems

Читать обзор

Implementation with Redis

The limit counter is read and updated on every request, so the store has to be fast and safe under concurrent access. Redis covers both: data in memory and atomic operations.

Sliding Window Counter on Redis

-- Lua script for atomic updates
local key = KEYS[1]
local window = tonumber(ARGV[1]) -- window size in seconds
local limit = tonumber(ARGV[2]) -- request limit
local now = tonumber(ARGV[3]) -- current timestamp

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count requests in the current window
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add a new request
    redis.call('ZADD', key, now, now .. '-' .. math.random())
    redis.call('EXPIRE', key, window)
    return 1 -- allowed
else
    return 0 -- rejected
end

Race-condition protection

Problem: Concurrent requests read the same count.

Solution: Lua scripts execute atomically in Redis.

Alternative: Redis MULTI/EXEC or Redlock for distributed locking.

Synchronization model

Centralized: One Redis cluster serves all application instances.

Eventual consistency: Local counters + periodic synchronization.

Trade-off: Accuracy versus latency.

HTTP response headers

# If the request is successful:

X-RateLimit-Limit: 100

X-RateLimit-Remaining: 42

X-RateLimit-Reset: 1640995200

# If the limit is exceeded:

HTTP/1.1 429 Too Many Requests

Retry-After: 30

Deep dive: multi-tier rate limiting

Real systems usually enforce limits at several levels at once:

Limit levels

Global: 10M requests/day for the entire service
Per-user: 1000 requests/hour per user
Per-endpoint: 100 requests/minute on `/api/search`
Per-IP: protection against overload from a single source

Graceful Degradation

Soft limit: Warning in headers
Hard limit: 429 + Retry-After
Redis down: Fail-open (skip) or fail-close?
Monitoring: Alert for traffic anomalies

What to emphasize in an interview

1
Start with the requirements: what traffic profile do you expect, how much error is acceptable, and do you need distributed enforcement?
2
Select the algorithm: Token Bucket works well for bursts, while Sliding Window Counter is a better fit when you need a more accurate view of current load.
3
Make the trade-offs explicit: memory versus accuracy, latency versus stricter coordination between instances.
4
Redis + Lua: atomic operations remove counter-update races.
5
Failure behavior: what should happen if the rate limiter itself becomes unavailable?

For additional practice and implementation comparison, review System Design Primer: Rate Limiter and Cloudflare Rate Limiting Rules.

Related chapters

System Design Primer (short summary) - provides a baseline framework for requirements, constraints, and bottleneck analysis in rate-limiter design.
System Design Interview - Alex Xu - contains the classic interview breakdown of rate limiting algorithms and practical trade-offs.
Acing the System Design Interview - adds practical deployment options for rate limiting in distributed systems.
API Gateway - shows the primary production layer where rate limiting is typically enforced centrally.
API Security Patterns - connects rate limiting with broader API protection: auth, abuse control, and policy enforcement.
Caching Strategies - helps choose storage and caching approaches for counters without adding too much latency to the critical path.
Resilience Patterns - expands on failure behavior: when to allow requests, when to block them, and how to degrade safely during limiter outages.
Notification System - demonstrates how multi-tier limits protect downstream systems from burst traffic.
Web Crawler - illustrates why controlled request pacing is required for safe and polite external crawling.
Payment System - adds a critical-systems context where limits protect expensive and sensitive financial flows.