Search System (Google/Elasticsearch)

A search system rarely breaks at document write time. It breaks where indexing, ranking, freshness, and the user-facing query path meet.

This chapter breaks down document ingestion, the inverted index, shard routing, ranking cascades, autocomplete, and the consistency window between writes and searchable results.

For interviews and engineering discussions, it is valuable because it forces you to reason about relevance, latency, and operating cost at the same time, not just storage.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Why Search Is Hard

Search rarely fails on a single subsystem — it fails at the seams between them. A single user-facing path pulls together crawling, indexing, query processing, ranking, freshness, and large-scale distributed execution. That is exactly why interviewers like the problem: it quickly shows whether an engineer can hold several trade-offs in their head at once.

Core Subsystems

Web Crawler

Collects pages and decides what to revisit

Indexer

Turns documents into an inverted index

Ranker

Selects and orders the best results

Functional Requirements

Core Features

Full-text search of billions of documents
Autocompletion and query suggestions
Faceted search and filtering
Relevant ranking of results
Spell correction and synonym processing

Additional features

Search by images and videos
Personalization of results
Localization and language models
Real-time indexing of new pages

Non-functional requirements

Search feels fast only when the end-to-end path stays inside a tight tail-latency budget while still handling hundreds of thousands of queries per second.

<200ms

Tail-latency budget for search queries

10B+

Indexed documents

100K QPS

Peak Request Load

99.99%

Availability target

Crawling and Recrawl Planning

Connection

Alex Xu: Web Crawler

A detailed analysis of the crawler architecture in the book System Design Interview

Читать обзор

A crawler holds three tense constraints at once: what to visit next, when to revisit a page, and how to avoid overloading other sites. At Google scale that means billions of URLs a day, and pushing on any one constraint costs the other two — chasing freshness eats into politeness, and politeness slows down coverage.

Crawler Architecture

Select a stage to highlight key components

Priority Queue

Politeness Manager

Freshness Scorer

Worker Pool

DNS Resolver

robots.txt Cache

Parser (HTML)

Duplicate Detector

URL Extractor

URL Frontier and Crawl Policy

The URL frontier decides what the crawler sees first and how gently it treats other people's servers:

Priority queue — important pages first
Politeness — cap requests per second for each domain
Freshness — revisit frequency for each page

Duplicate Detection

Duplicate handling prevents wasted storage and ranking noise:

URL deduplication — Bloom Filter for seen URLs
Content Hash — SimHash/MinHash
Near-duplicate detection — LSH for similar pages

💡 Politeness and robots.txt

Each crawler must comply with robots.txt and avoid overloading origin servers. Google uses separate queues per domain with explicit request-rate controls. A typical interval between requests to the same domain is 1-10 seconds.

Inverted Index

Theory

Database Internals

Deep Dive into Data Structures for Indexing

Читать обзор

Scanning every document on each query is impossible, so search flips the data ahead of time. The inverted index stores not “document → words” but the reverse, “word → documents” — and that lets a query pull a candidate list straight from its terms without touching the rest of the collection.

Inverted Index Structure

Forward Index (traditional):
┌─────────────┬──────────── ──────────────────────────┐
│ Doc ID │ Terms │
├─────────────┼──────────── ──────────────────────────┤
│ doc_1 │ "system", "design", "interview" │
│ doc_2 │ "design", "patterns", "book" │
│ doc_3 │ "system", "architecture", "book" │
└─────────────┴──────────── ──────────────────────────┘

Inverted Index:
┌─────────────────┬───────────── ────────────────────────────────┐
│ Term │ Posting List │
├─────────────────┼───────────── ────────────────────────────────┤
│ "system" │ [(doc_1, pos:[0]), (doc_3, pos:[0])] │
│ "design" │ [(doc_1, pos:[1]), (doc_2, pos:[0])] │
│ "interview" │ [(doc_1, pos:[2])] │
│ "patterns" │ [(doc_2, pos:[1])] │
│ "book" │ [(doc_2, pos:[2]), (doc_3, pos:[2])] │
│ "architecture" │ [(doc_3, pos:[1])] │
└─────────────────┴───────────── ────────────────────────────────┘

Posting List Entry:
┌─────────────────────────────── ───────────────────────────────┐
│ doc_id: uint64 │
│ term_frequency: uint16 │
│ positions: [uint32, ...] // for phrase queries │
│ field_lengths: {title: 5, body: 1000, ...} │
└─────────────────────────────── ───────────────────────────────┘

Index Components

Dictionary (Lexicon)
A dictionary of all unique terms is stored in memory
Posting Lists
Lists documents for each term on disk
Skip Lists
Speeding up the intersection of posting lists

Storage optimizations

Delta Encoding
Storing doc_id differences instead of absolute values
Variable Byte Encoding
Compress small numbers into fewer bytes
Block Compression
PFOR, Simple9, Roaring Bitmaps

Text Processing Pipeline

Select a stage to inspect the transformation output

Input

"The System Design Interview is AMAZING!"

Output

["The","System","Design","Interview","is","AMAZING!"]

Stage Summary

Split text into individual tokens

Query Processing

Query handling starts with parsing and expansion, then quickly narrows the candidate set before the heavier ranking stages take over.

Query Processing Flow

Switch stages to inspect processing details

Current Stage

Query Parser

Tokens, operators, and query fields

Example

Input: "system design patterns"Tokens: ["system","design","patterns"]Operator: ANDFields: title?, body?

Latency Budget

Parser + expansion: < 20ms, retrieval: < 50ms, ranking: < 100ms

⚡ Optimization: the WAND algorithm

WAND (Weighted AND) speeds up Top-K retrieval by avoiding a full scan of every candidate document:

Each term keeps an upper bound for the score it can still contribute
Documents that cannot possibly enter the Top-K are skipped early
In practice, this often gives a multi-x speedup over a naive scan

Ranking and Relevance

A complete index still does not make the results good: the user only looks at the top rows, and ordering decides everything. Modern engines therefore use a cascade: cheap lexical scores first, then richer graph signals and heavier neural models.

BM25 Formula

The cheap first layer that picks candidates before the heavy models run:

BM25(D, Q) = Σ IDF(qi) × 
  (f(qi, D) × (k1 + 1)) / 
  (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:
  f(qi, D) = term frequency in document
  |D| = document length
  avgdl = average document length
  k1 = 1.2 (term saturation)
  b = 0.75 (length normalization)

BM25 is an improved lexical-ranking formula that extends TF-IDF with saturation and length normalization

PageRank (simplified)

Assessing page authority based on incoming links:

PR(A) = (1-d) + d × Σ PR(Ti)/C(Ti)

Where:
  d = damping factor (0.85)
  Ti = pages linking to A
  C(Ti) = outbound links from Ti

Iterative computation:
  1. Initialize all PR = 1/N
  2. Iterate until convergence
  3. Normalize results

The classic Google signal for measuring link authority

Multi-Stage Ranking Pipeline

Switch stages to inspect ranking details

Current Stage

Stage 1: Candidate Generation

Inverted index lookup + BM25

Output

~10,000 candidates

Latency Budget

≈ 50ms

Highlights

BM25 scoring
Top-10K retrieval
Low latency

100+

Ranking signals in Google

BERT

Neural re-ranking models

LTR

Learning-to-rank pipelines

Distributed Search Architecture

Theory

DDIA: Partitioning

Data partitioning: how to cut the index so it grows with the load

Читать обзор

A search engine at Google scale needs a distributed layout where sharding affects fanout cost, rebalancing behavior, and the latency budget of every query.

Sharding Strategies

Document-based Sharding

Shard 1: docs 1-1M
Shard 2: docs 1M-2M
Shard 3: docs 2M-3M
...

Query → all shards → merge results

Pros:
+ Simplicity
+ Easy document ingestion
Cons:
- Every query hits all shards

Term-based Sharding

Shard 1: terms A-F
Shard 2: terms G-M
Shard 3: terms N-Z
...

Query → only the necessary shards

Pros:
+ Fewer shards per request
Cons:
- Hotspots on popular terms
- More complex document writes

Distributed Query Processing

Request path: ingress, shard fan-out, local Top-K, and final merge

Ingress and routing

Load Balancer

Receives the search request and sends it to an available query router.

Query Router

Parses the query, selects shards, and sets the end-to-end deadline.

Fan-out

Parallel shards

Each shard computes local candidates from its slice of the index.

Shard 1

replicas A/B/CTop-K

Shard 2

replicas A/B/CTop-K

Shard 3

replicas A/B/CTop-K

Fan-in

Merge and response

Top-K Aggregator

Collects local lists, removes duplicates, and applies final reranking.

User Response

Returns the sorted result list with search metadata.

Latency Budget

Routing: < 10msShard fan-out: < 100msMerge and rerank: < 20ms

💡 Elasticsearch architecture

Elasticsearch uses document-based sharding with automatic routing:

shard = hash(document_id) % number_of_primary_shards

• Primary shards — serve writes and reads
• Replica shards — increase read capacity and provide failover
• Typical configuration: 5 primary × 2 replicas = 15 shards

Near-Real-Time Indexing

News and social products live or die by how quickly new documents become searchable. That forces the write path to balance freshness, refresh cost, and durable persistence.

Tiered Index Architecture

Switch tiers to inspect their role and characteristics

Active Tier

Real-time Tier

In-memory index for hot data

Storage

RAM

Freshness

Seconds

Focus

Writes + freshness

Latency

< 1s to searchable

Query Fanout

Query → all tiers in parallel → merge → deduplicate → return

Write Path

1.The document lands in Kafka
2.An indexing consumer processes it
3.The update goes into an in-memory buffer
4.Refresh makes it searchable
5.Flush persists it durably on disk

How Elasticsearch refresh works

refresh_interval: 1s by default
Creates a new searchable segment in memory
Keeps the system near real time rather than strictly synchronous
Trade-off: refresh cost versus freshness

Autocomplete and Suggestions

Autocomplete has an even tighter latency target than full search, so it usually relies on prefix trees, cached hot prefixes, and precomputed popular completions.

Trie-based Autocomplete

Select a prefix to inspect trie path and suggestions

Trie Path

rootsys

Prefix length

3 chars

Latency target

< 50ms

Note

Top results are stored in the prefix node

Top Suggestions

system design
system
system administrator
system metrics

Top-K Cache

Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.

Optimizations

Prefix caching — cache of popular prefixes
Compressed Trie — merge one-child chains to save space
Top-K precomputation — store the best completions directly in trie nodes

Data Collection

Query logs aggregated over time
Filtering offensive requests
Personalization from user history

What to cover in an interview

Topics you should definitely cover

Inverted index versus forward index and the trade-offs between them
BM25 and TF-IDF, plus what each score is actually capturing
Sharding strategy — document vs term
How the latency budget is split across retrieval and ranking stages
Near-real-time indexing versus batch refresh

Common follow-up questions

How to process phrase queries?
How to implement fuzzy search?
How to add personalization?
How to deal with spam in results?
How to measure search quality?

Key Takeaways

Inverted Index

The foundation of search: without it, a query over billions of documents becomes a full scan

Ranking Cascade

Progressively heavier stages balance relevance quality against latency

Document Sharding

Usually the most straightforward way to scale search in practice

Tiered Storage

Separate real-time, recent, and historical layers keep fresh content searchable

Polite Crawling

robots.txt compliance and per-domain limits are part of the design, not an afterthought

References

Sergey Brin, Lawrence Page — The Anatomy of a Large-Scale Hypertextual Web Search Engine (Google Research, 1998)Elastic — How full-text search works: inverted index, analysis, and BM25 (Elastic Docs)Shane Connelly — Practical BM25, Part 2: The BM25 Algorithm and its Variables (Elastic Blog, 2018)Apache Lucene — Index File Formats: term dictionary, postings lists, doc values (Lucene 9.9 Docs)

Related chapters

System Design Interview: An Insider's Guide (short summary) - provides the classic interview flow for Search System: requirements, scale assumptions, architecture, and trade-offs.
Database Internals: A Deep Dive (short summary) - goes deeper into index data structures, posting-list compression, and storage internals used in search engines.
Designing Data-Intensive Applications, 2nd Edition (short summary) - strengthens distributed-data fundamentals around replication, consistency, and fault tolerance in search clusters.
Web Crawler - covers the ingestion side of search: URL frontier, politeness policies, deduplication, and recrawl strategy.
Hacking the System Design Interview (short summary) - helps fold a complex search design into a clear interview conversation with honestly named trade-offs.
System design case studies examples - puts the search-system case into broader context and makes it easier to compare with other architecture problems.