Source
System Design Interview
Web Crawler is one of the basic cases for discussing frontier, politeness and scaling.
Web Crawler solves two problems at the same time: to maximize Internet coverage and not to overload other people’s services. Therefore, the architecture is built around three ideas: distributed frontier, strict politeness across domains, and incremental content updating.
Requirements
Functional
Accept a starter set of seed URLs and regularly expand your coverage.
Comply with robots.txt, crawl-delay and per-host rate limits.
Extract content and links, normalize URLs and remove duplicates.
Support re-crawl with different frequencies for hot and cold pages.
Export data to the indexing pipeline (search layer).
Non-functional
Scale: billions of URLs
Frontier and storage must scale horizontally.
Freshness: minutes-hours
Popular pages should be crawled quickly.
Politeness: per-domain fairness
You cannot overload one host with massive requests.
Reliability: resume after failures
Queues and checkpointing should not lose crawl progress.
High-Level Architecture
Web Crawler: High-Level Map
frontier + scheduling + fetch/parse/storage pipelineControl Plane
Data Plane
The crawler is split into a control plane (frontier, policy, scheduling) and a data plane (fetch, parse, storage, indexing).
As in other distributed cases, the key pattern is explicit separation control plane And data plane. This allows you to separately scale the scheduling/politeness logic and the fetch pipeline itself.
Crawl Flow
Crawl Flow Explorer
Read/write-flow style: switch scenarios and play key steps.
Fetch Path: operational notes
- Frontier issues URLs based on priority score and per-host fairness.
- Scheduler applies host budget and crawl-delay before issuing a lease.
- Fetcher requests a page, parser retrieves content and links.
- After canonicalization, the new URLs are returned to the frontier loop.
- The document and metadata are published in the indexing pipeline.
Recrawl Path: operational notes
- Freshness scorer selects URLs to re-crawl based on the stale window.
- For hot pages, re-crawl occurs more often, for cold pages - less often and in batch mode.
- Conditional fetch (ETag/Last-Modified) reduces traffic and CPU.
- The crawl result is written to the crawl event log for retry/observability.
- The URL is returned to the frontier with the next_fetch_at updated.
Data Model (simplified)
frontier_queue: url_hash, priority, next_fetch_at, host_key, depthhost_budget: host_key, tokens, refill_rate, crawl_delay_msseen_urls: url_hash, first_seen_at, last_fetch_statuspage_store: page_id, url, content_pointer, fetched_at, checksumcrawl_events: fetch_started/fetch_succeeded/fetch_failed/retry_scheduled
Reliability and anti-patterns
Production patterns
- Idempotent fetch pipeline: URL reprocessing does not break state.
- Checkpoint frontier and queue offsets for lossless restart.
- Per-host circuit breaker for problem domains.
- Adaptive retry policy: exponential backoff + jitter by error type.
- Frontier sharding by host hash to reduce lock contention.
Common mistakes
- Global FIFO without per-host limits: quickly leads to bans and blocks.
- Lack of canonicalization of URL (utm, trailing slash, case) -> explosion of duplicates.
- Re-traversing “everything is the same” without importance/freshness scoring.
- Storing all content only in RAM queues without a durable layer.
- Ignoring robots.txt and legal/compliance restrictions.
RFC
Robots Exclusion Protocol
Official specification of robots.txt and rules for crawler interaction with sites.
Key trade-off: maximum throughput vs politeness/legal compliance. In the interview, it is important to show how you limit speed across domains and remain resilient in the face of failures.
