Knowledge graphSettings

Updated: March 25, 2026 at 4:52 AM

Web Crawler

medium

Classic task: distributed URL frontier, politeness/robots.txt, deduplication, re-crawl and ingestion into the search index.

A web crawler is not a problem about downloading the whole internet. It is about frontier scheduling, politeness, deduplication, and deciding what to crawl again and when.

The chapter helps break down the URL frontier, distributed queues, per-domain rate limits, the parsing pipeline, and the path from raw pages to an index.

For interviews and engineering discussions, this case is useful because it tests whether you can design for an effectively unbounded input space and a highly heterogeneous external world.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Source

System Design Interview

Web Crawler is one of the basic cases for discussing frontier, politeness and scaling.

Open review

Web Crawler solves two problems at the same time: to maximize Internet coverage and not to overload other people’s services. Therefore, the architecture is built around three ideas: distributed frontier, strict politeness across domains, and incremental content updating.

Requirements

Functional

Accept a starter set of seed URLs and regularly expand your coverage.

Comply with robots.txt, crawl-delay and per-host rate limits.

Extract content and links, normalize URLs and remove duplicates.

Support re-crawl with different frequencies for hot and cold pages.

Export data to the indexing pipeline (search layer).

Non-functional

Scale: billions of URLs

Frontier and storage must scale horizontally.

Freshness: minutes-hours

Popular pages should be crawled quickly.

Politeness: per-domain fairness

You cannot overload one host with massive requests.

Reliability: resume after failures

Queues and checkpointing should not lose crawl progress.

High-Level Architecture

Web Crawler: High-Level Map

frontier + scheduling + fetch/parse/storage pipeline

Control Plane

Seed Sources -> URL Frontier -> Scheduler -> Policy Engine
prioritization + politeness
Crawl Event Log
fetch outcomes + retries

Data Plane

Web Sites -> Fetchers -> Parser
fetch + parse path
Dedup -> Content Store -> Indexing
storage + publication

The crawler is split into a control plane (frontier, policy, scheduling) and a data plane (fetch, parse, storage, indexing).

As in other distributed cases, the key pattern is explicit separation of control plane and data plane. This allows you to scale scheduling/politeness logic and the fetch pipeline independently.

Crawl Flow

Crawl Flow Explorer

Read/write-flow style: switch scenarios and play key steps.

1
URL Frontier
pick next candidate
2
Scheduler
budget + lease
3
Fetcher Pool
HTTP request/response
4
Parser + Dedup
links + canonicalization
5
Store + Indexing
persist + publish events
Fetch path: play the scenario to see the full cycle from frontier to indexing publication.

Fetch Path: operational notes

  • Frontier issues URLs based on priority score and per-host fairness.
  • Scheduler applies host budget and crawl-delay before issuing a lease.
  • Fetcher requests a page, parser retrieves content and links.
  • After canonicalization, the new URLs are returned to the frontier loop.
  • The document and metadata are published in the indexing pipeline.

Recrawl Path: operational notes

  • Freshness scorer selects URLs to re-crawl based on the stale window.
  • For hot pages, re-crawl occurs more often, for cold pages - less often and in batch mode.
  • Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
  • The crawl result is written to the crawl event log for retry/observability.
  • The URL is returned to the frontier with the next_fetch_at updated.

Data Model (simplified)

  • frontier_queue: url_hash, priority, next_fetch_at, host_key, depth
  • host_budget: host_key, tokens, refill_rate, crawl_delay_ms
  • seen_urls: url_hash, first_seen_at, last_fetch_status
  • page_store: page_id, url, content_pointer, fetched_at, checksum
  • crawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled

Reliability and anti-patterns

Production patterns

  • Idempotent fetch pipeline: URL reprocessing does not break state.
  • Checkpoint frontier and queue offsets for lossless restart.
  • Per-host circuit breaker for problem domains.
  • Adaptive retry policy: exponential backoff + jitter by error type.
  • Frontier sharding by host hash to reduce lock contention.

Common mistakes

  • Global FIFO without per-host limits: quickly leads to bans and blocks.
  • Lack of canonicalization of URL (utm, trailing slash, case) -> explosion of duplicates.
  • Re-traversing “everything is the same” without importance/freshness scoring.
  • Storing all content only in RAM queues without a durable layer.
  • Ignoring robots.txt and legal/compliance restrictions.

RFC

Robots Exclusion Protocol

Official specification of robots.txt and rules for crawler interaction with sites.

Open RFC

Key trade-off: maximum throughput vs politeness/legal compliance. In the interview, it is important to show how you limit speed across domains and remain resilient in the face of failures.

Related chapters

  • Search System - shows how the crawler pipeline feeds indexing, query processing, and ranking layers.
  • Distributed File System - adds durable storage patterns for crawl artifacts, snapshots, and failure recovery.
  • Rate Limiter - deepens politeness controls: per-host pacing, retry backoff, and source protection.
  • CDN - helps reason about edge traffic behavior that influences crawling strategy and freshness planning.

Enable tracking in Settings