System Design Space
Knowledge graphSettings

Updated: April 30, 2026 at 7:40 AM

Search System (Google/Elasticsearch)

medium

Classic search-system case: web crawling, an inverted index, BM25 and PageRank ranking, and distributed query execution.

A search system rarely breaks at document write time. It breaks where indexing, ranking, freshness, and the user-facing query path meet.

This chapter breaks down document ingestion, the inverted index, shard routing, ranking cascades, autocomplete, and the consistency window between writes and searchable results.

For interviews and engineering discussions, it is valuable because it forces you to reason about relevance, latency, and operating cost at the same time, not just storage.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Why Search Is Hard

Designing a search system is one of the richest architecture problems in a system-design interview. It combines crawling, indexing, query processing, ranking, freshness, and large-scale distributed execution in a single user-facing path.

Core Subsystems

Web Crawler

Collects pages and decides what to revisit

Indexer

Turns documents into an inverted index

Ranker

Selects and orders the best results

Functional Requirements

Core Features

  • Full-text search of billions of documents
  • Autocompletion and query suggestions
  • Faceted search and filtering
  • Relevant ranking of results
  • Spell correction and synonym processing

Additional features

  • Search by images and videos
  • Personalization of results
  • Localization and language models
  • Real-time indexing of new pages

Non-functional requirements

Search feels fast only when the end-to-end path stays inside a tight tail-latency budget while still handling hundreds of thousands of queries per second.

<200ms
Tail-latency budget for search queries
10B+
Indexed documents
100K QPS
Peak Request Load
99.99%
Availability target

Crawling and Recrawl Planning

Connection

Alex Xu: Web Crawler

A detailed analysis of the crawler architecture in the book System Design Interview

Читать обзор

A crawler does more than fetch pages. It decides what to visit next, when to revisit a page, and how to avoid overloading other sites while processing billions of URLs every day.

Crawler Architecture

Select a stage to highlight key components

Priority Queue
Politeness Manager
Freshness Scorer
Worker Pool
DNS Resolver
robots.txt Cache
Parser (HTML)
Duplicate Detector
URL Extractor

URL Frontier and Crawl Policy

The URL frontier controls priority, revisit cadence, and domain-level load:

  • Priority queue — important pages first
  • Politeness — cap requests per second for each domain
  • Freshness — revisit frequency for each page

Duplicate Detection

Duplicate handling prevents wasted storage and ranking noise:

  • URL deduplication — Bloom Filter for seen URLs
  • Content Hash — SimHash/MinHash
  • Near-duplicate detection — LSH for similar pages

💡 Politeness and robots.txt

Each crawler must comply with robots.txt and avoid overloading origin servers. Google uses separate queues per domain with explicit request-rate controls. A typical interval between requests to the same domain is 1-10 seconds.

Inverted Index

Theory

Database Internals

Deep Dive into Data Structures for Indexing

Читать обзор

The Inverted Index is the fundamental data structure for full-text search. Instead of storing “document → words,” it stores “word → documents.”

Inverted Index Structure

Forward Index (traditional):
┌─────────────┬──────────── ──────────────────────────┐
│ Doc ID │ Terms │
├─────────────┼──────────── ──────────────────────────┤
│ doc_1 │ "system", "design", "interview" │
│ doc_2 │ "design", "patterns", "book" │
│ doc_3 │ "system", "architecture", "book" │
└─────────────┴──────────── ──────────────────────────┘

Inverted Index:
┌─────────────────┬───────────── ────────────────────────────────┐
│ Term │ Posting List │
├─────────────────┼───────────── ────────────────────────────────┤
│ "system" │ [(doc_1, pos:[0]), (doc_3, pos:[0])] │
│ "design" │ [(doc_1, pos:[1]), (doc_2, pos:[0])] │
│ "interview" │ [(doc_1, pos:[2])] │
│ "patterns" │ [(doc_2, pos:[1])] │
│ "book" │ [(doc_2, pos:[2]), (doc_3, pos:[2])] │
│ "architecture" │ [(doc_3, pos:[1])] │
└─────────────────┴───────────── ────────────────────────────────┘

Posting List Entry:
┌─────────────────────────────── ───────────────────────────────┐
│ doc_id: uint64 │
│ term_frequency: uint16 │
│ positions: [uint32, ...] // for phrase queries │
│ field_lengths: {title: 5, body: 1000, ...} │
└─────────────────────────────── ───────────────────────────────┘

Index Components

  • Dictionary (Lexicon)

    A dictionary of all unique terms is stored in memory

  • Posting Lists

    Lists documents for each term on disk

  • Skip Lists

    Speeding up the intersection of posting lists

Storage optimizations

  • Delta Encoding

    Storing doc_id differences instead of absolute values

  • Variable Byte Encoding

    Compress small numbers into fewer bytes

  • Block Compression

    PFOR, Simple9, Roaring Bitmaps

Text Processing Pipeline

Select a stage to inspect the transformation output

Input

"The System Design Interview is AMAZING!"

Output

["The","System","Design","Interview","is","AMAZING!"]

Stage Summary

Split text into individual tokens

Query Processing

Query handling starts with parsing and expansion, then quickly narrows the candidate set before the heavier ranking stages take over.

Query Processing Flow

Switch stages to inspect processing details

Current Stage

Query Parser

Tokens, operators, and query fields

Example

Input: "system design patterns"Tokens: ["system","design","patterns"]Operator: ANDFields: title?, body?

Latency Budget

Parser + expansion: < 20ms, retrieval: < 50ms, ranking: < 100ms

⚡ Optimization: the WAND algorithm

WAND (Weighted AND) speeds up Top-K retrieval by avoiding a full scan of every candidate document:

  • Each term keeps an upper bound for the score it can still contribute
  • Documents that cannot possibly enter the Top-K are skipped early
  • In practice, this often gives a multi-x speedup over a naive scan

Ranking and Relevance

Search quality depends on how well the system orders candidates after retrieval. Modern engines therefore use a cascade: cheap lexical scores first, then richer graph signals and heavier neural models.

BM25 Formula

Basic formula for text ranking:

BM25(D, Q) = Σ IDF(qi) × 
  (f(qi, D) × (k1 + 1)) / 
  (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:
  f(qi, D) = term frequency in document
  |D| = document length
  avgdl = average document length
  k1 = 1.2 (term saturation)
  b = 0.75 (length normalization)

BM25 is an improved lexical-ranking formula that extends TF-IDF with saturation and length normalization

PageRank (simplified)

Assessing page authority based on incoming links:

PR(A) = (1-d) + d × Σ PR(Ti)/C(Ti)

Where:
  d = damping factor (0.85)
  Ti = pages linking to A
  C(Ti) = outbound links from Ti

Iterative computation:
  1. Initialize all PR = 1/N
  2. Iterate until convergence
  3. Normalize results

The classic Google signal for measuring link authority

Multi-Stage Ranking Pipeline

Switch stages to inspect ranking details

Current Stage

Stage 1: Candidate Generation

Inverted index lookup + BM25

Output

~10,000 candidates

Latency Budget

≈ 50ms

Highlights

  • BM25 scoring
  • Top-10K retrieval
  • Low latency
100+
Ranking signals in Google
BERT
Neural re-ranking models
LTR
Learning-to-rank pipelines

Distributed Search Architecture

Theory

DDIA: Partitioning

Data partitioning for scaling

Читать обзор

A search engine at Google scale needs a distributed layout where sharding affects fanout cost, rebalancing behavior, and the latency budget of every query.

Sharding Strategies

Document-based Sharding
Shard 1: docs 1-1M
Shard 2: docs 1M-2M
Shard 3: docs 2M-3M
...

Query → all shards → merge results

Pros:
+ Simplicity
+ Easy document ingestion
Cons:
- Every query hits all shards
Term-based Sharding
Shard 1: terms A-F
Shard 2: terms G-M
Shard 3: terms N-Z
...

Query → only the necessary shards

Pros:
+ Fewer shards per request
Cons:
- Hotspots on popular terms
- More complex document writes

Distributed Query Processing

Highlight routing, shard work, or final merge

Load Balancer

Distributes incoming traffic

Query Router

Parses the query and fans it out

Aggregator

Merges Top-K and applies final ranking

Shard 1

Replicas A/B/C

Shard 2

Replicas A/B/C

Shard 3

Replicas A/B/C

1. Router parses the query
2. Parallel fanout to all shards
3. Merge Top-K and apply final rerank

Latency Budget

Router: < 10msShard fanout: < 100msMerge: < 20ms

💡 Elasticsearch architecture

Elasticsearch uses document-based sharding with automatic routing:

shard = hash(document_id) % number_of_primary_shards

• Primary shards — serve writes and reads
• Replica shards — increase read capacity and provide failover
• Typical configuration: 5 primary × 2 replicas = 15 shards

Near-Real-Time Indexing

News and social products live or die by how quickly new documents become searchable. That forces the write path to balance freshness, refresh cost, and durable persistence.

Tiered Index Architecture

Switch tiers to inspect their role and characteristics

Active Tier

Real-time Tier

In-memory index for hot data

Storage

RAM

Freshness

Seconds

Focus

Writes + freshness

Latency

< 1s to searchable

Query Fanout

Query → all tiers in parallel → merge → deduplicate → return

Write Path

  • 1.The document lands in Kafka
  • 2.An indexing consumer processes it
  • 3.The update goes into an in-memory buffer
  • 4.Refresh makes it searchable
  • 5.Flush persists it durably on disk

How Elasticsearch refresh works

  • refresh_interval: 1s by default
  • Creates a new searchable segment in memory
  • Keeps the system near real time rather than strictly synchronous
  • Trade-off: refresh cost versus freshness

Autocomplete and Suggestions

Autocomplete has an even tighter latency target than full search, so it usually relies on prefix trees, cached hot prefixes, and precomputed popular completions.

Trie-based Autocomplete

Select a prefix to inspect trie path and suggestions

Trie Path

rootsys
Prefix length

3 chars

Latency target

< 50ms

Note

Top results are stored in the prefix node

Top Suggestions

  • system design
  • system
  • system administrator
  • system metrics

Top-K Cache

Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.

Optimizations

  • Prefix caching — cache of popular prefixes
  • Compressed Trie — merge one-child chains to save space
  • Top-K precomputation — store the best completions directly in trie nodes

Data Collection

  • Query logs aggregated over time
  • Filtering offensive requests
  • Personalization from user history

What to cover in an interview

Topics you should definitely cover

  • Inverted index versus forward index and the trade-offs between them
  • BM25 and TF-IDF, plus what each score is actually capturing
  • Sharding strategy — document vs term
  • How the latency budget is split across retrieval and ranking stages
  • Near-real-time indexing versus batch refresh

Common follow-up questions

  • How to process phrase queries?
  • How to implement fuzzy search?
  • How to add personalization?
  • How to deal with spam in results?
  • How to measure search quality?

Key Takeaways

1

Inverted Index

The core structure that makes large-scale full-text retrieval practical

2

Ranking Cascade

Progressively heavier stages balance relevance quality against latency

3

Document Sharding

Usually the most straightforward way to scale search in practice

4

Tiered Storage

Separate real-time, recent, and historical layers keep fresh content searchable

5

Polite Crawling

robots.txt compliance and per-domain limits are part of the design, not an afterthought

Related chapters

Enable tracking in Settings