Web Crawler — System Design Space

A web crawler is not about downloading the whole internet. It is about deciding what to crawl now, what to postpone, and how not to overwhelm someone else’s site.

The chapter ties together the URL frontier, prioritization, host-level pacing, recrawling, and the path from a raw page to a search index.

For interviews and engineering discussions, this case is useful because it tests whether you can design for an effectively unbounded input space and a highly heterogeneous external world.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Source

System Design Interview

Web Crawler is one of the core cases for discussing the frontier, politeness, and scaling.

Open review

Web Crawler is a long-running system that has to expand coverage without turning someone else’s site into your next incident. That is why the architecture is built around a frontier, per-domain politeness, and recrawling pages at different cadences.

Requirements

Functional

Accept a seed set of URLs and gradually expand coverage over time.

Respect robots.txt, crawl-delay, and per-host pacing limits.

Extract content and links, normalize URLs, and remove duplicates.

Support different recrawl cadences for fast-changing and stable pages.

Feed the result into the search indexing pipeline.

Non-functional

Scale: billions of URLs

Frontier and storage must scale horizontally.

Freshness: minutes to hours

Important and frequently changing pages should be revisited noticeably faster.

Politeness: fairness across domains

One host must not be flooded with a dense burst of requests.

Reliability: resume after failures

Queues and checkpointing should not lose crawl progress.

High-Level Architecture

Web Crawler: System Map

frontier, scheduling, fetch, parse, and storage flow

Control Plane

Seed Sources

sitemaps and seed lists

URL Frontier

priority queue

Scheduler

host budgets

Policy Layer

robots and pacing

Crawl Event Log

fetch outcomes

Data Plane

Web Sites

HTTP content

Fetcher Pool

parallel workers

Parser and Extraction

links and metadata

Dedup Layer

seen URLs

Content Store

documents and snapshots

Indexing Pipeline

search ingestion

New URLs from the parser and deduplication layer are fed back into the frontier.

Control Plane

Seed Sources -> URL Frontier -> Scheduler -> Policy Layer

priorities and pacing

Crawl Event Log

fetch outcomes and retries

Data Plane

Web Sites -> Fetchers -> Parser

fetch and parse path

Dedup -> Content Store -> Indexing

storage and publication

The crawler is split into a control plane and a data plane: one handles priorities and pacing, the other handles fetch, parse, storage, and indexing.

As in many distributed cases, it helps to separate the control plane from the data plane. That lets you scale prioritization and host-level pacing independently from the fetch and parse work itself.

Crawl Flow

Crawl Flow Explorer

Switch between the initial fetch path and the recrawl path, then play through the key steps.

URL Frontier

pick next candidate

Scheduler

budget and lease

Fetcher Pool

HTTP request and response

Store and Indexing

persist and publish events

Parser and Dedup

links and canonicalization

URL Frontier

pick next candidate

Scheduler

budget and lease

Fetcher Pool

HTTP request and response

Parser and Dedup

links and canonicalization

Store and Indexing

persist and publish events

Fetch path: play the scenario to see the full cycle from frontier to indexing publication.

It is important to separate URL canonicalization from document deduplication: the first removes redundant address variants, while the second keeps the same content from being pushed into storage and indexing more than once. Recrawling is then a constant balance between freshness and network cost.

Fetch Path: notes

The frontier issues URLs based on priority, crawl depth, and fairness across hosts.
The scheduler checks the host budget and enforces crawl-delay before dispatching work.
The fetcher retrieves the page, and the parser extracts content and links.
After canonicalization, newly discovered URLs return to the frontier.
The document and metadata move into the indexing pipeline.

Recrawl Path: notes

A freshness scorer selects recrawl candidates from the stale window and importance signals.
Fast-changing pages are revisited more often; stable ones move on a slower, more batched schedule.
Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
The crawl result is written to the crawl event log for retries and observability.
The URL returns to the frontier with an updated next_fetch_at.

Simplified Data Model

frontier_queue: url_hash, priority, next_fetch_at, host_key, depth
host_budget: host_key, tokens, refill_rate, crawl_delay_ms
seen_urls: url_hash, first_seen_at, last_fetch_status
page_store: page_id, url, content_pointer, fetched_at, checksum
crawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled

Reliability and Common Mistakes

Crawler reliability is not just about queues. It also depends on checkpoints, per-host circuit breakers, and adaptive retry backoff that reacts to the kind of failure you hit.

Working patterns

Idempotent processing: revisiting the same URL does not corrupt state.
Checkpoint the frontier and queue offsets for restart without losing progress.
Per-host circuit breaker for problem domains.
Adaptive retry policy: exponential backoff + jitter by error type.
Frontier sharding by host hash to reduce lock contention.

Common mistakes

Global FIFO without per-host limits: quickly leads to bans and blocks.
Lack of URL canonicalization (utm, trailing slash, case) causes a duplicate explosion.
Recrawling everything at the same cadence without importance or freshness scoring.
Storing all content only in RAM queues without a durable layer.
Ignoring robots.txt and legal/compliance restrictions.

RFC

Robots Exclusion Protocol

Official specification of robots.txt and rules for crawler interaction with sites.

Open RFC

The key trade-off is throughput versus politeness and legal limits. In interviews, it is worth showing how you cap per-domain speed, plan recrawls, and stay resilient when external sources fail.

Related chapters

Search System - Shows how web crawling feeds indexing, query processing, and ranking.
Distributed File System - Helps discuss durable storage for page snapshots, crawl artifacts, and failure recovery.
Rate Limiter - Deepens the discussion around per-host pacing, retry backoff, and protecting external sources from overload.
CDN - Helps reason about traffic geography and network latency that shape crawling and recrawl strategy.