A search system rarely breaks at document write time. It breaks where indexing, ranking, freshness, and the user-facing query path meet.
This chapter breaks down document ingestion, the inverted index, shard routing, ranking cascades, autocomplete, and the consistency window between writes and searchable results.
For interviews and engineering discussions, it is valuable because it forces you to reason about relevance, latency, and operating cost at the same time, not just storage.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Why Search Is Hard
Designing a search system is one of the richest architecture problems in a system-design interview. It combines crawling, indexing, query processing, ranking, freshness, and large-scale distributed execution in a single user-facing path.
Core Subsystems
Collects pages and decides what to revisit
Turns documents into an inverted index
Selects and orders the best results
Functional Requirements
Core Features
- Full-text search of billions of documents
- Autocompletion and query suggestions
- Faceted search and filtering
- Relevant ranking of results
- Spell correction and synonym processing
Additional features
- Search by images and videos
- Personalization of results
- Localization and language models
- Real-time indexing of new pages
Non-functional requirements
Search feels fast only when the end-to-end path stays inside a tight tail-latency budget while still handling hundreds of thousands of queries per second.
Crawling and Recrawl Planning
Connection
Alex Xu: Web Crawler
A detailed analysis of the crawler architecture in the book System Design Interview
A crawler does more than fetch pages. It decides what to visit next, when to revisit a page, and how to avoid overloading other sites while processing billions of URLs every day.
Crawler Architecture
Select a stage to highlight key components
URL Frontier and Crawl Policy
The URL frontier controls priority, revisit cadence, and domain-level load:
- Priority queue — important pages first
- Politeness — cap requests per second for each domain
- Freshness — revisit frequency for each page
Duplicate Detection
Duplicate handling prevents wasted storage and ranking noise:
- URL deduplication — Bloom Filter for seen URLs
- Content Hash — SimHash/MinHash
- Near-duplicate detection — LSH for similar pages
💡 Politeness and robots.txt
Each crawler must comply with robots.txt and avoid overloading origin servers. Google uses separate queues per domain with explicit request-rate controls. A typical interval between requests to the same domain is 1-10 seconds.
Inverted Index
Theory
Database Internals
Deep Dive into Data Structures for Indexing
The Inverted Index is the fundamental data structure for full-text search. Instead of storing “document → words,” it stores “word → documents.”
Inverted Index Structure
Forward Index (traditional):
┌─────────────┬──────────── ──────────────────────────┐
│ Doc ID │ Terms │
├─────────────┼──────────── ──────────────────────────┤
│ doc_1 │ "system", "design", "interview" │
│ doc_2 │ "design", "patterns", "book" │
│ doc_3 │ "system", "architecture", "book" │
└─────────────┴──────────── ──────────────────────────┘
Inverted Index:
┌─────────────────┬───────────── ────────────────────────────────┐
│ Term │ Posting List │
├─────────────────┼───────────── ────────────────────────────────┤
│ "system" │ [(doc_1, pos:[0]), (doc_3, pos:[0])] │
│ "design" │ [(doc_1, pos:[1]), (doc_2, pos:[0])] │
│ "interview" │ [(doc_1, pos:[2])] │
│ "patterns" │ [(doc_2, pos:[1])] │
│ "book" │ [(doc_2, pos:[2]), (doc_3, pos:[2])] │
│ "architecture" │ [(doc_3, pos:[1])] │
└─────────────────┴───────────── ────────────────────────────────┘
Posting List Entry:
┌─────────────────────────────── ───────────────────────────────┐
│ doc_id: uint64 │
│ term_frequency: uint16 │
│ positions: [uint32, ...] // for phrase queries │
│ field_lengths: {title: 5, body: 1000, ...} │
└─────────────────────────────── ───────────────────────────────┘Index Components
- Dictionary (Lexicon)
A dictionary of all unique terms is stored in memory
- Posting Lists
Lists documents for each term on disk
- Skip Lists
Speeding up the intersection of posting lists
Storage optimizations
- Delta Encoding
Storing doc_id differences instead of absolute values
- Variable Byte Encoding
Compress small numbers into fewer bytes
- Block Compression
PFOR, Simple9, Roaring Bitmaps
Text Processing Pipeline
Select a stage to inspect the transformation output
Input
"The System Design Interview is AMAZING!"Output
["The","System","Design","Interview","is","AMAZING!"]Stage Summary
Split text into individual tokens
Query Processing
Query handling starts with parsing and expansion, then quickly narrows the candidate set before the heavier ranking stages take over.
Query Processing Flow
Switch stages to inspect processing details
Current Stage
Query Parser
Tokens, operators, and query fields
Example
Input: "system design patterns"Tokens: ["system","design","patterns"]Operator: ANDFields: title?, body?Latency Budget
Parser + expansion: < 20ms, retrieval: < 50ms, ranking: < 100ms
⚡ Optimization: the WAND algorithm
WAND (Weighted AND) speeds up Top-K retrieval by avoiding a full scan of every candidate document:
- Each term keeps an upper bound for the score it can still contribute
- Documents that cannot possibly enter the Top-K are skipped early
- In practice, this often gives a multi-x speedup over a naive scan
Ranking and Relevance
Search quality depends on how well the system orders candidates after retrieval. Modern engines therefore use a cascade: cheap lexical scores first, then richer graph signals and heavier neural models.
BM25 Formula
Basic formula for text ranking:
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl)) Where: f(qi, D) = term frequency in document |D| = document length avgdl = average document length k1 = 1.2 (term saturation) b = 0.75 (length normalization)
BM25 is an improved lexical-ranking formula that extends TF-IDF with saturation and length normalization
PageRank (simplified)
Assessing page authority based on incoming links:
PR(A) = (1-d) + d × Σ PR(Ti)/C(Ti) Where: d = damping factor (0.85) Ti = pages linking to A C(Ti) = outbound links from Ti Iterative computation: 1. Initialize all PR = 1/N 2. Iterate until convergence 3. Normalize results
The classic Google signal for measuring link authority
Multi-Stage Ranking Pipeline
Switch stages to inspect ranking details
Current Stage
Stage 1: Candidate Generation
Inverted index lookup + BM25
Output
~10,000 candidates
Latency Budget
≈ 50ms
Highlights
- BM25 scoring
- Top-10K retrieval
- Low latency
Distributed Search Architecture
Theory
DDIA: Partitioning
Data partitioning for scaling
A search engine at Google scale needs a distributed layout where sharding affects fanout cost, rebalancing behavior, and the latency budget of every query.
Sharding Strategies
Document-based Sharding
Shard 1: docs 1-1M Shard 2: docs 1M-2M Shard 3: docs 2M-3M ... Query → all shards → merge results Pros: + Simplicity + Easy document ingestion Cons: - Every query hits all shards
Term-based Sharding
Shard 1: terms A-F Shard 2: terms G-M Shard 3: terms N-Z ... Query → only the necessary shards Pros: + Fewer shards per request Cons: - Hotspots on popular terms - More complex document writes
Distributed Query Processing
Highlight routing, shard work, or final merge
Load Balancer
Distributes incoming traffic
Query Router
Parses the query and fans it out
Aggregator
Merges Top-K and applies final ranking
Shard 1
Replicas A/B/C
Shard 2
Replicas A/B/C
Shard 3
Replicas A/B/C
Latency Budget
💡 Elasticsearch architecture
Elasticsearch uses document-based sharding with automatic routing:
shard = hash(document_id) % number_of_primary_shards • Primary shards — serve writes and reads • Replica shards — increase read capacity and provide failover • Typical configuration: 5 primary × 2 replicas = 15 shards
Near-Real-Time Indexing
News and social products live or die by how quickly new documents become searchable. That forces the write path to balance freshness, refresh cost, and durable persistence.
Tiered Index Architecture
Switch tiers to inspect their role and characteristics
Active Tier
Real-time Tier
In-memory index for hot data
Storage
RAM
Freshness
Seconds
Focus
Writes + freshness
Latency
< 1s to searchable
Query Fanout
Query → all tiers in parallel → merge → deduplicate → return
Write Path
- 1.The document lands in Kafka
- 2.An indexing consumer processes it
- 3.The update goes into an in-memory buffer
- 4.Refresh makes it searchable
- 5.Flush persists it durably on disk
How Elasticsearch refresh works
- refresh_interval: 1s by default
- Creates a new searchable segment in memory
- Keeps the system near real time rather than strictly synchronous
- Trade-off: refresh cost versus freshness
Autocomplete and Suggestions
Autocomplete has an even tighter latency target than full search, so it usually relies on prefix trees, cached hot prefixes, and precomputed popular completions.
Trie-based Autocomplete
Select a prefix to inspect trie path and suggestions
Trie Path
3 chars
< 50ms
Top results are stored in the prefix node
Top Suggestions
- system design
- system
- system administrator
- system metrics
Top-K Cache
Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.
Optimizations
- Prefix caching — cache of popular prefixes
- Compressed Trie — merge one-child chains to save space
- Top-K precomputation — store the best completions directly in trie nodes
Data Collection
- Query logs aggregated over time
- Filtering offensive requests
- Personalization from user history
What to cover in an interview
Topics you should definitely cover
- Inverted index versus forward index and the trade-offs between them
- BM25 and TF-IDF, plus what each score is actually capturing
- Sharding strategy — document vs term
- How the latency budget is split across retrieval and ranking stages
- Near-real-time indexing versus batch refresh
Common follow-up questions
- How to process phrase queries?
- How to implement fuzzy search?
- How to add personalization?
- How to deal with spam in results?
- How to measure search quality?
Key Takeaways
Inverted Index
The core structure that makes large-scale full-text retrieval practical
Ranking Cascade
Progressively heavier stages balance relevance quality against latency
Document Sharding
Usually the most straightforward way to scale search in practice
Tiered Storage
Separate real-time, recent, and historical layers keep fresh content searchable
Polite Crawling
robots.txt compliance and per-domain limits are part of the design, not an afterthought
Related chapters
- System Design Interview: An Insider's Guide (short summary) - provides the classic interview flow for Search System: requirements, scale assumptions, architecture, and trade-offs.
- Database Internals: A Deep Dive (short summary) - goes deeper into index data structures, posting-list compression, and storage internals used in search engines.
- Designing Data-Intensive Applications, 2nd Edition (short summary) - strengthens distributed-data fundamentals around replication, consistency, and fault tolerance in search clusters.
- Web Crawler - covers the ingestion side of search: URL frontier, politeness policies, deduplication, and recrawl strategy.
- Hacking the System Design Interview (short summary) - helps present complex search architecture clearly in interview format with explicit trade-off communication.
- System design case studies examples - puts the search-system case into broader context and makes it easier to compare with other architecture problems.
