A web crawler is not a problem about downloading the whole internet. It is about frontier scheduling, politeness, deduplication, and deciding what to crawl again and when.
The chapter helps break down the URL frontier, distributed queues, per-domain rate limits, the parsing pipeline, and the path from raw pages to an index.
For interviews and engineering discussions, this case is useful because it tests whether you can design for an effectively unbounded input space and a highly heterogeneous external world.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Source
System Design Interview
Web Crawler is one of the basic cases for discussing frontier, politeness and scaling.
Web Crawler solves two problems at the same time: to maximize Internet coverage and not to overload other people’s services. Therefore, the architecture is built around three ideas: distributed frontier, strict politeness across domains, and incremental content updating.
Requirements
Functional
Accept a starter set of seed URLs and regularly expand your coverage.
Comply with robots.txt, crawl-delay and per-host rate limits.
Extract content and links, normalize URLs and remove duplicates.
Support re-crawl with different frequencies for hot and cold pages.
Export data to the indexing pipeline (search layer).
Non-functional
Scale: billions of URLs
Frontier and storage must scale horizontally.
Freshness: minutes-hours
Popular pages should be crawled quickly.
Politeness: per-domain fairness
You cannot overload one host with massive requests.
Reliability: resume after failures
Queues and checkpointing should not lose crawl progress.
High-Level Architecture
Web Crawler: High-Level Map
frontier + scheduling + fetch/parse/storage pipelineControl Plane
Data Plane
The crawler is split into a control plane (frontier, policy, scheduling) and a data plane (fetch, parse, storage, indexing).
As in other distributed cases, the key pattern is explicit separation of control plane and data plane. This allows you to scale scheduling/politeness logic and the fetch pipeline independently.
Crawl Flow
Crawl Flow Explorer
Read/write-flow style: switch scenarios and play key steps.
Fetch Path: operational notes
- Frontier issues URLs based on priority score and per-host fairness.
- Scheduler applies host budget and crawl-delay before issuing a lease.
- Fetcher requests a page, parser retrieves content and links.
- After canonicalization, the new URLs are returned to the frontier loop.
- The document and metadata are published in the indexing pipeline.
Recrawl Path: operational notes
- Freshness scorer selects URLs to re-crawl based on the stale window.
- For hot pages, re-crawl occurs more often, for cold pages - less often and in batch mode.
- Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
- The crawl result is written to the crawl event log for retry/observability.
- The URL is returned to the frontier with the next_fetch_at updated.
Data Model (simplified)
frontier_queue: url_hash, priority, next_fetch_at, host_key, depthhost_budget: host_key, tokens, refill_rate, crawl_delay_msseen_urls: url_hash, first_seen_at, last_fetch_statuspage_store: page_id, url, content_pointer, fetched_at, checksumcrawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled
Reliability and anti-patterns
Production patterns
- Idempotent fetch pipeline: URL reprocessing does not break state.
- Checkpoint frontier and queue offsets for lossless restart.
- Per-host circuit breaker for problem domains.
- Adaptive retry policy: exponential backoff + jitter by error type.
- Frontier sharding by host hash to reduce lock contention.
Common mistakes
- Global FIFO without per-host limits: quickly leads to bans and blocks.
- Lack of canonicalization of URL (utm, trailing slash, case) -> explosion of duplicates.
- Re-traversing “everything is the same” without importance/freshness scoring.
- Storing all content only in RAM queues without a durable layer.
- Ignoring robots.txt and legal/compliance restrictions.
RFC
Robots Exclusion Protocol
Official specification of robots.txt and rules for crawler interaction with sites.
Key trade-off: maximum throughput vs politeness/legal compliance. In the interview, it is important to show how you limit speed across domains and remain resilient in the face of failures.
Related chapters
- Search System - shows how the crawler pipeline feeds indexing, query processing, and ranking layers.
- Distributed File System - adds durable storage patterns for crawl artifacts, snapshots, and failure recovery.
- Rate Limiter - deepens politeness controls: per-host pacing, retry backoff, and source protection.
- CDN - helps reason about edge traffic behavior that influences crawling strategy and freshness planning.
