Introduction
Designing a search engine is one of the most complex tasks in a System Design interview. This task covers the entire spectrum: from data collection (crawling) to indexing, ranking results and scaling to petabytes of information.
Key Components of a Search Engine
Collection and crawling of web pages
Creating an inverted index
Ranking results
Functional Requirements
Core Features
- Full-text search of billions of documents
- Autocompletion and query suggestions
- Faceted search and filtering
- Relevant ranking of results
- Spell correction and synonym processing
Additional features
- Search by images and videos
- Personalization of results
- Localization and language models
- Real-time indexing of new pages
Non-functional requirements
Web Crawling
Connection
Alex Xu: Web Crawler
A detailed analysis of the crawler architecture in the book System Design Interview
Web Crawler is an automated system for crawling and scraping web pages. At Google's scale, this means processing billions of URLs every day.
Crawler Architecture
Select a stage to highlight key components
URL Frontier
URL queue for crawling with prioritization and load control on domains:
- Priority Queue - important pages first
- Politeness — no more than N requests/sec per domain
- Freshness — frequency of page re-crawling
Duplicate Detection
Duplicate detection to save resources:
- URL Deduplication - Bloom Filter for URL
- Content Hash — SimHash/MinHash
- Near-Duplicate - LSH for similar pages
💡 Politeness and robots.txt
Each crawler must comply with robots.txt and not overload the servers. Google uses separate queues for each domain with rate limiting. The typical interval between requests to one domain is 1-10 seconds.
Inverted Index
Theory
Database Internals
Deep Dive into Data Structures for Indexing
The Inverted Index is the fundamental data structure for full-text search. Instead of storing “document → words,” it stores “word → documents.”
Inverted Index Structure
Forward Index (traditional):
┌─────────────┬──────────── ──────────────────────────┐
│ Doc ID │ Terms │
├─────────────┼──────────── ──────────────────────────┤
│ doc_1 │ "system", "design", "interview" │
│ doc_2 │ "design", "patterns", "book" │
│ doc_3 │ "system", "architecture", "book" │
└─────────────┴──────────── ──────────────────────────┘
Inverted Index:
┌─────────────────┬───────────── ────────────────────────────────┐
│ Term │ Posting List │
├─────────────────┼───────────── ────────────────────────────────┤
│ "system" │ [(doc_1, pos:[0]), (doc_3, pos:[0])] │
│ "design" │ [(doc_1, pos:[1]), (doc_2, pos:[0])] │
│ "interview" │ [(doc_1, pos:[2])] │
│ "patterns" │ [(doc_2, pos:[1])] │
│ "book" │ [(doc_2, pos:[2]), (doc_3, pos:[2])] │
│ "architecture" │ [(doc_3, pos:[1])] │
└─────────────────┴───────────── ────────────────────────────────┘
Posting List Entry:
┌─────────────────────────────── ───────────────────────────────┐
│ doc_id: uint64 │
│ term_frequency: uint16 │
│ positions: [uint32, ...] // for phrase queries │
│ field_lengths: {title: 5, body: 1000, ...} │
└─────────────────────────────── ───────────────────────────────┘Index Components
- Dictionary (Lexicon)
A dictionary of all unique terms is stored in memory
- Posting Lists
Lists documents for each term on disk
- Skip Lists
Speeding up the intersection of posting lists
Storage optimizations
- Delta Encoding
Storing doc_id differences instead of absolute values
- Variable Byte Encoding
Compress small numbers into fewer bytes
- Block Compression
PFOR, Simple9, Roaring Bitmaps
Text Processing Pipeline
Select a stage to inspect the transformation output
Input
"The System Design Interview is AMAZING!"Output
["The","System","Design","Interview","is","AMAZING!"]Stage Summary
Split text into individual tokens
Query Processing
Request processing includes parsing, expansion, and efficient traversal of posting lists.
Query Processing Flow
Switch stages to inspect processing details
Current Stage
Query Parser
Tokens, operators, and query fields
Example
Input: "system design patterns"Tokens: ["system","design","patterns"]Operator: ANDFields: title?, body?Latency Budget
Parser + expansion: < 20ms, retrieval: < 50ms, ranking: < 100ms
⚡ Optimization: WAND Algorithm
WAND (Weighted AND) - an algorithm for efficient Top-K retrieval without completely scanning all documents:
- Each term has an upper bound score
- Skipping documents that cannot be included in Top-K
- Up to 10x speedup compared to naive approach
Ranking and Relevance
Ranking is a key component that determines search quality. Modern systems use a multi-level approach: from simple formulas to ML models.
BM25 Formula
Basic formula for text ranking:
BM25(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl)) Where: f(qi, D) = term frequency in document |D| = document length avgdl = average document length k1 = 1.2 (term saturation) b = 0.75 (length normalization)
BM25 - improved version of TF-IDF with saturation and length normalization
PageRank (simplified)
Assessing page authority based on incoming links:
PR(A) = (1-d) + d × Σ PR(Ti)/C(Ti) Where: d = damping factor (0.85) Ti = pages linking to A C(Ti) = outbound links from Ti Iterative computation: 1. Initialize all PR = 1/N 2. Iterate until convergence 3. Normalize results
Classic Google algorithm for web ranking
Multi-Stage Ranking Pipeline
Switch stages to inspect ranking details
Current Stage
Stage 1: Candidate Generation
Inverted index lookup + BM25
Output
~10,000 candidates
Latency Budget
≈ 50ms
Highlights
- BM25 scoring
- Top-10K retrieval
- Low latency
Distributed Architecture
Theory
DDIA: Partitioning
Data partitioning for scaling
A search engine the size of Google requires a distributed architecture with index sharding and replication for fault tolerance.
Sharding Strategies
Document-based Sharding
Shard 1: docs 1-1M Shard 2: docs 1M-2M Shard 3: docs 2M-3M ... Query → all shards → merge results Pros: + Simplicity + Adding documents Cons: - Each request → all shards
Term-based Sharding
Shard 1: terms A-F Shard 2: terms G-M Shard 3: terms N-Z ... Query → only the necessary shards Pros: + Fewer shards per request Cons: - Hotspots (popular thermal baths) - Complex adding docs
Distributed Query Processing
Подсветите этапы: router, shards или финальный merge
Load Balancer
Распределение входящего трафика
Query Router
Parsing + fanout
Aggregator
Merge + re-rank
Shard 1
Replica A/B/C
Shard 2
Replica A/B/C
Shard 3
Replica A/B/C
Latency Budget
💡 Elasticsearch Architecture
Elasticsearch uses document-based sharding with automatic routing:
shard = hash(document_id) % number_of_primary_shards • Primary shards - writing and reading • Replica shards - read only, failover • Typical configuration: 5 primary × 2 replicas = 15 shards
Real-time Indexing
For searching news and social networks, indexing speed is critical—seconds from publication to appearance in search.
Tiered Index Architecture
Переключайте слой, чтобы увидеть его роль и характеристики
Active Tier
Real-time Tier
In-memory index (hot data)
Storage
RAM
Freshness
Seconds
Focus
Writes + freshness
Latency
< 1s to searchable
Query Fanout
Query → все tiers параллельно → merge → deduplicate → return
Write Path
- 1.The document is sent to Kafka
- 2.Indexer consumer processes
- 3.Writing to in-memory buffer
- 4.Refresh → searchable
- 5.Flush → durable on disk
Elasticsearch Refresh
- refresh_interval: 1s (default)
- Creates a new segment in memory
- Near real-time search
- Trade-off: refresh cost vs freshness
Autocomplete and Typeahead
Autocompletion requires even lower latency (<50ms) and works based on prefix trees and precomputed popular queries.
Trie-based Autocomplete
Select a prefix to inspect trie path and suggestions
Trie Path
3 chars
< 50ms
Top results are stored in the prefix node
Top Suggestions
- system design
- system
- system administrator
- system metrics
Top-K Cache
Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.
Optimizations
- Prefix caching — cache of popular prefixes
- Compressed Trie - merging one-child nodes
- Top-K precomputation — we store the top 10 in nodes
Data Collection
- Query logs → time aggregation
- Filtering offensive requests
- Personalization (user story)
Interview tips
What must be discussed
- Inverted index vs forward index — trade-offs
- BM25/TF-IDF - understanding the formula
- Sharding strategy — document vs term
- Latency budget allocation
- Near real-time vs batch indexing
Frequent follow-up questions
- How to process phrase queries?
- How to implement fuzzy search?
- How to add personalization?
- How to deal with spam in results?
- How to measure search quality?
Key Findings
Inverted Index
Fundamental data structure for efficient full-text search
Multi-stage Ranking
Cascade from simple formulas to heavy ML models to balance latency and quality
Document Sharding
Preferred strategy for most systems - easier to scale
Tiered Storage
Division into real-time, recent and historical tiers for fresh content
Politeness in Crawling
Mandatory compliance with robots.txt and rate limiting
