System Design Space
Knowledge graphSettings

Updated: March 2, 2026 at 9:17 AM

Web Crawler

mid

Classic task: distributed URL frontier, politeness/robots.txt, deduplication, re-crawl and ingestion into the search index.

Source

System Design Interview

Web Crawler is one of the basic cases for discussing frontier, politeness and scaling.

Open review

Web Crawler solves two problems at the same time: to maximize Internet coverage and not to overload other people’s services. Therefore, the architecture is built around three ideas: distributed frontier, strict politeness across domains, and incremental content updating.

Requirements

Functional

Accept a starter set of seed URLs and regularly expand your coverage.

Comply with robots.txt, crawl-delay and per-host rate limits.

Extract content and links, normalize URLs and remove duplicates.

Support re-crawl with different frequencies for hot and cold pages.

Export data to the indexing pipeline (search layer).

Non-functional

Scale: billions of URLs

Frontier and storage must scale horizontally.

Freshness: minutes-hours

Popular pages should be crawled quickly.

Politeness: per-domain fairness

You cannot overload one host with massive requests.

Reliability: resume after failures

Queues and checkpointing should not lose crawl progress.

High-Level Architecture

Web Crawler: High-Level Map

frontier + scheduling + fetch/parse/storage pipeline

Control Plane

Seed Sources -> URL Frontier -> Scheduler -> Policy Engine
prioritization + politeness
Crawl Event Log
fetch outcomes + retries

Data Plane

Web Sites -> Fetchers -> Parser
fetch + parse path
Dedup -> Content Store -> Indexing
storage + publication

The crawler is split into a control plane (frontier, policy, scheduling) and a data plane (fetch, parse, storage, indexing).

As in other distributed cases, the key pattern is explicit separation control plane And data plane. This allows you to separately scale the scheduling/politeness logic and the fetch pipeline itself.

Crawl Flow

Crawl Flow Explorer

Read/write-flow style: switch scenarios and play key steps.

1
URL Frontier
pick next candidate
2
Scheduler
budget + lease
3
Fetcher Pool
HTTP request/response
4
Parser + Dedup
links + canonicalization
5
Store + Indexing
persist + publish events
Fetch path: play the scenario to see the full cycle from frontier to indexing publication.

Fetch Path: operational notes

  • Frontier issues URLs based on priority score and per-host fairness.
  • Scheduler applies host budget and crawl-delay before issuing a lease.
  • Fetcher requests a page, parser retrieves content and links.
  • After canonicalization, the new URLs are returned to the frontier loop.
  • The document and metadata are published in the indexing pipeline.

Recrawl Path: operational notes

  • Freshness scorer selects URLs to re-crawl based on the stale window.
  • For hot pages, re-crawl occurs more often, for cold pages - less often and in batch mode.
  • Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
  • The crawl result is written to the crawl event log for retry/observability.
  • The URL is returned to the frontier with the next_fetch_at updated.

Data Model (simplified)

  • frontier_queue: url_hash, priority, next_fetch_at, host_key, depth
  • host_budget: host_key, tokens, refill_rate, crawl_delay_ms
  • seen_urls: url_hash, first_seen_at, last_fetch_status
  • page_store: page_id, url, content_pointer, fetched_at, checksum
  • crawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled

Reliability and anti-patterns

Production patterns

  • Idempotent fetch pipeline: URL reprocessing does not break state.
  • Checkpoint frontier and queue offsets for lossless restart.
  • Per-host circuit breaker for problem domains.
  • Adaptive retry policy: exponential backoff + jitter by error type.
  • Frontier sharding by host hash to reduce lock contention.

Common mistakes

  • Global FIFO without per-host limits: quickly leads to bans and blocks.
  • Lack of canonicalization of URL (utm, trailing slash, case) -> explosion of duplicates.
  • Re-traversing “everything is the same” without importance/freshness scoring.
  • Storing all content only in RAM queues without a durable layer.
  • Ignoring robots.txt and legal/compliance restrictions.

RFC

Robots Exclusion Protocol

Official specification of robots.txt and rules for crawler interaction with sites.

Open RFC

Key trade-off: maximum throughput vs politeness/legal compliance. In the interview, it is important to show how you limit speed across domains and remain resilient in the face of failures.

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov