A web crawler is not about downloading the whole internet. It is about deciding what to crawl now, what to postpone, and how not to overwhelm someone else’s site.
The chapter ties together the URL frontier, prioritization, host-level pacing, recrawling, and the path from a raw page to a search index.
For interviews and engineering discussions, this case is useful because it tests whether you can design for an effectively unbounded input space and a highly heterogeneous external world.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Source
System Design Interview
Web Crawler is one of the core cases for discussing the frontier, politeness, and scaling.
Web Crawler is a long-running system that has to expand coverage without turning someone else’s site into your next incident. That is why the architecture is built around a frontier, per-domain politeness, and recrawling pages at different cadences.
Requirements
Functional
Accept a seed set of URLs and gradually expand coverage over time.
Respect robots.txt, crawl-delay, and per-host pacing limits.
Extract content and links, normalize URLs, and remove duplicates.
Support different recrawl cadences for fast-changing and stable pages.
Feed the result into the search indexing pipeline.
Non-functional
Scale: billions of URLs
Frontier and storage must scale horizontally.
Freshness: minutes to hours
Important and frequently changing pages should be revisited noticeably faster.
Politeness: fairness across domains
One host must not be flooded with a dense burst of requests.
Reliability: resume after failures
Queues and checkpointing should not lose crawl progress.
High-Level Architecture
Web Crawler: System Map
frontier, scheduling, fetch, parse, and storage flowControl Plane
Data Plane
The crawler is split into a control plane and a data plane: one handles priorities and pacing, the other handles fetch, parse, storage, and indexing.
As in many distributed cases, it helps to separate the control plane from the data plane. That lets you scale prioritization and host-level pacing independently from the fetch and parse work itself.
Crawl Flow
Crawl Flow Explorer
Switch between the initial fetch path and the recrawl path, then play through the key steps.
It is important to separate URL canonicalization from document deduplication: the first removes redundant address variants, while the second keeps the same content from being pushed into storage and indexing more than once. Recrawling is then a constant balance between freshness and network cost.
Fetch Path: notes
- The frontier issues URLs based on priority, crawl depth, and fairness across hosts.
- The scheduler checks the host budget and enforces crawl-delay before dispatching work.
- The fetcher retrieves the page, and the parser extracts content and links.
- After canonicalization, newly discovered URLs return to the frontier.
- The document and metadata move into the indexing pipeline.
Recrawl Path: notes
- A freshness scorer selects recrawl candidates from the stale window and importance signals.
- Fast-changing pages are revisited more often; stable ones move on a slower, more batched schedule.
- Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
- The crawl result is written to the crawl event log for retries and observability.
- The URL returns to the frontier with an updated next_fetch_at.
Simplified Data Model
frontier_queue: url_hash, priority, next_fetch_at, host_key, depthhost_budget: host_key, tokens, refill_rate, crawl_delay_msseen_urls: url_hash, first_seen_at, last_fetch_statuspage_store: page_id, url, content_pointer, fetched_at, checksumcrawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled
Reliability and Common Mistakes
Crawler reliability is not just about queues. It also depends on checkpoints, per-host circuit breakers, and adaptive retry backoff that reacts to the kind of failure you hit.
Working patterns
- Idempotent processing: revisiting the same URL does not corrupt state.
- Checkpoint the frontier and queue offsets for restart without losing progress.
- Per-host circuit breaker for problem domains.
- Adaptive retry policy: exponential backoff + jitter by error type.
- Frontier sharding by host hash to reduce lock contention.
Common mistakes
- Global FIFO without per-host limits: quickly leads to bans and blocks.
- Lack of URL canonicalization (utm, trailing slash, case) causes a duplicate explosion.
- Recrawling everything at the same cadence without importance or freshness scoring.
- Storing all content only in RAM queues without a durable layer.
- Ignoring robots.txt and legal/compliance restrictions.
RFC
Robots Exclusion Protocol
Official specification of robots.txt and rules for crawler interaction with sites.
The key trade-off is throughput versus politeness and legal limits. In interviews, it is worth showing how you cap per-domain speed, plan recrawls, and stay resilient when external sources fail.
Related chapters
- Search System - Shows how web crawling feeds indexing, query processing, and ranking.
- Distributed File System - Helps discuss durable storage for page snapshots, crawl artifacts, and failure recovery.
- Rate Limiter - Deepens the discussion around per-host pacing, retry backoff, and protecting external sources from overload.
- CDN - Helps reason about traffic geography and network latency that shape crawling and recrawl strategy.
