Knowledge graphSettings

Updated: April 9, 2026 at 2:10 PM

Web Crawler

medium

Classic task: URL frontier management, robots.txt compliance, per-host pacing, deduplication, recrawling, and feeding pages into the search index.

A web crawler is not about downloading the whole internet. It is about deciding what to crawl now, what to postpone, and how not to overwhelm someone else’s site.

The chapter ties together the URL frontier, prioritization, host-level pacing, recrawling, and the path from a raw page to a search index.

For interviews and engineering discussions, this case is useful because it tests whether you can design for an effectively unbounded input space and a highly heterogeneous external world.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Source

System Design Interview

Web Crawler is one of the core cases for discussing the frontier, politeness, and scaling.

Open review

Web Crawler is a long-running system that has to expand coverage without turning someone else’s site into your next incident. That is why the architecture is built around a frontier, per-domain politeness, and recrawling pages at different cadences.

Requirements

Functional

Accept a seed set of URLs and gradually expand coverage over time.

Respect robots.txt, crawl-delay, and per-host pacing limits.

Extract content and links, normalize URLs, and remove duplicates.

Support different recrawl cadences for fast-changing and stable pages.

Feed the result into the search indexing pipeline.

Non-functional

Scale: billions of URLs

Frontier and storage must scale horizontally.

Freshness: minutes to hours

Important and frequently changing pages should be revisited noticeably faster.

Politeness: fairness across domains

One host must not be flooded with a dense burst of requests.

Reliability: resume after failures

Queues and checkpointing should not lose crawl progress.

High-Level Architecture

Web Crawler: System Map

frontier, scheduling, fetch, parse, and storage flow

Control Plane

Seed Sources -> URL Frontier -> Scheduler -> Policy Layer
priorities and pacing
Crawl Event Log
fetch outcomes and retries

Data Plane

Web Sites -> Fetchers -> Parser
fetch and parse path
Dedup -> Content Store -> Indexing
storage and publication

The crawler is split into a control plane and a data plane: one handles priorities and pacing, the other handles fetch, parse, storage, and indexing.

As in many distributed cases, it helps to separate the control plane from the data plane. That lets you scale prioritization and host-level pacing independently from the fetch and parse work itself.

Crawl Flow

Crawl Flow Explorer

Switch between the initial fetch path and the recrawl path, then play through the key steps.

1
URL Frontier
pick next candidate
2
Scheduler
budget and lease
3
Fetcher Pool
HTTP request and response
4
Parser and Dedup
links and canonicalization
5
Store and Indexing
persist and publish events
Fetch path: play the scenario to see the full cycle from frontier to indexing publication.

It is important to separate URL canonicalization from document deduplication: the first removes redundant address variants, while the second keeps the same content from being pushed into storage and indexing more than once. Recrawling is then a constant balance between freshness and network cost.

Fetch Path: notes

  • The frontier issues URLs based on priority, crawl depth, and fairness across hosts.
  • The scheduler checks the host budget and enforces crawl-delay before dispatching work.
  • The fetcher retrieves the page, and the parser extracts content and links.
  • After canonicalization, newly discovered URLs return to the frontier.
  • The document and metadata move into the indexing pipeline.

Recrawl Path: notes

  • A freshness scorer selects recrawl candidates from the stale window and importance signals.
  • Fast-changing pages are revisited more often; stable ones move on a slower, more batched schedule.
  • Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
  • The crawl result is written to the crawl event log for retries and observability.
  • The URL returns to the frontier with an updated next_fetch_at.

Simplified Data Model

  • frontier_queue: url_hash, priority, next_fetch_at, host_key, depth
  • host_budget: host_key, tokens, refill_rate, crawl_delay_ms
  • seen_urls: url_hash, first_seen_at, last_fetch_status
  • page_store: page_id, url, content_pointer, fetched_at, checksum
  • crawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled

Reliability and Common Mistakes

Crawler reliability is not just about queues. It also depends on checkpoints, per-host circuit breakers, and adaptive retry backoff that reacts to the kind of failure you hit.

Working patterns

  • Idempotent processing: revisiting the same URL does not corrupt state.
  • Checkpoint the frontier and queue offsets for restart without losing progress.
  • Per-host circuit breaker for problem domains.
  • Adaptive retry policy: exponential backoff + jitter by error type.
  • Frontier sharding by host hash to reduce lock contention.

Common mistakes

  • Global FIFO without per-host limits: quickly leads to bans and blocks.
  • Lack of URL canonicalization (utm, trailing slash, case) causes a duplicate explosion.
  • Recrawling everything at the same cadence without importance or freshness scoring.
  • Storing all content only in RAM queues without a durable layer.
  • Ignoring robots.txt and legal/compliance restrictions.

RFC

Robots Exclusion Protocol

Official specification of robots.txt and rules for crawler interaction with sites.

Open RFC

The key trade-off is throughput versus politeness and legal limits. In interviews, it is worth showing how you cap per-domain speed, plan recrawls, and stay resilient when external sources fail.

Related chapters

  • Search System - Shows how web crawling feeds indexing, query processing, and ranking.
  • Distributed File System - Helps discuss durable storage for page snapshots, crawl artifacts, and failure recovery.
  • Rate Limiter - Deepens the discussion around per-host pacing, retry backoff, and protecting external sources from overload.
  • CDN - Helps reason about traffic geography and network latency that shape crawling and recrawl strategy.

Enable tracking in Settings