Search autocomplete (typeahead) is a panel of top-N query candidates that refreshes as you type, and it shows up almost everywhere: search boxes, browser address bars, messenger fields, and store product search. It looks like a small feature, but behind it sits a standalone architecture with an extremely tight per-keystroke latency budget.
Unlike the neighboring full-text search case, here we rank query candidates by prefix rather than documents by query. The core is a prefix structure (trie, compressed trie, or FST) with precomputed top-K at nodes, a hot read path through an edge cache, and a heavy offline rebuild of the dictionary from aggregated query frequencies.
In interviews this case is valued for its clean online/offline split, clear memory-versus-latency trade-offs, and a wealth of practical details: debounce, sharding by prefix, freshness, personalization, fuzzy typo handling, and log privacy. Working through it gives you a ready framework for reasoning about any suggestion system.
Pipeline Thinking
Ingestion, partitioning, deduplication, and stage latency drive system behavior.
Serving Layer
Index and cache-locality decisions directly shape user-facing query latency.
Consistency Window
Explicitly define where eventual consistency is acceptable and where it is not.
Cost vs Freshness
Balance update frequency with compute/storage cost and operational complexity.
Problem Framing and Boundaries
The user hasn't finished typing yet, but the system already has to guess what they want. Search autocomplete (typeahead) is a panel of top-N query candidates that refreshes as you type: the prefix "sys" is entered, and "system design", "system administrator", and so on appear at once. Suggestions save keystrokes and help users phrase their intent; they show up in search boxes, browser address bars, the "To:" fields of messengers, and product search in stores. The hard part here is not the output — it is that you must answer on every keystroke, faster than the person types.
Boundary with the search system
The neighboring case, "Search System", is about full-text search: inverted index, document ranking, and retrieval over an already-submitted query. THIS case is about what happens before submission: prefix search, extremely low per-keystroke latency, and ranking of query candidates (not documents). Suggestions guess what the user is most likely looking for; search answers the query they finally pick.
What is NOT in scope
Retrieval and result ranking belong to search-system-case
"Did you mean…" after a search is a separate subsystem
Vector matching by meaning is a different class of problem
Functional Requirements
Core Features
- Top-N (usually 5–10) suggestions for the typed prefix
- Rank candidates by popularity (query frequency)
- Account for freshness: a surge of interest in a new query reaches suggestions
- Highlight the matched prefix in the suggestion list
- Filter spam, abuse, and unsafe content
Extended Features
- Personalization by user history and context
- Typo tolerance (fuzzy / edit distance)
- Multilingual support and different keyboard layouts
- Matching on words inside the query, not just at the start of the string
Non-functional Requirements
A suggestion must appear before the next keystroke, or it is useless. So the key metric is per-keystroke p99 latency, not the time of a full search. Meanwhile the load measured in QPS is several times higher than search itself: every character of the query triggers its own suggestion call.
Why latency matters more than in search
Suggestions compete with typing speed. If the user types faster than answers arrive, the panel "flickers" with stale options. So the per-keystroke budget is measured in tens of milliseconds (with a p99 ceiling around 100 ms), not hundreds, and availability is built on graceful degradation: if the suggestion service is down, the input box still works — the user just sees no panel. Suggestions need not be strongly consistent: a small delay in updating frequencies is unnoticeable.
Scale Estimates
Connection
CDN and edge cache
Where to absorb most suggestion traffic closer to the user
The numbers below are order-of-magnitude estimates for an interview, not real data from any specific company. The main nonlinear multiplier is query length: each keystroke after the first characters generates a separate suggestion call.
Sizing assumptions
DAU, searches per day, effective calls after debounce, and the peak multiplier are inputs to the interview model. Change them for the product and recompute QPS, memory, and egress; the figures below are not promises about Google, Amazon, or any other specific service.
Back-of-the-envelope load (estimate)
Given (orders of magnitude for an interview):
DAU = 100M users
searches per user/day = 5
searches/day = 500M
average query length = ~20 characters
Keystrokes → suggestion requests:
we usually don't call on every character, but debounce (every N chars)
assume ~6 effective suggestion requests per search
suggestion requests/day = 500M × 6 = 3B
Average suggestion QPS:
3B / 86,400 s ≈ 35,000 QPS
Peak QPS (×3) ≈ 100,000 QPS
Takeaway: suggestions are a hotter path than search itself,
so edge caching and top-K precomputation are critical.All values are marked as estimates. In an interview the reasoning matters more than "exact" figures.
Dictionary size and memory
- Unique queries — tens to hundreds of millions after trimming the "long tail"
- Raw dictionary — at ~20 bytes per query, that is a few gigabytes of text
- Prefix structure — a compressed trie/FST shrinks this several-fold by sharing prefixes
Response throughput
- Response — 10 suggestions × ~30 bytes ≈ 300 bytes plus overhead
- Egress — 100K QPS × ~0.5 KB ≈ 50 MB/s outbound (estimate)
- Cache — a high hit rate on short prefixes drastically reduces backend load
Prefix Data Structures
Connection
Inverted index vs trie
Search relies on an inverted index; suggestions rely on a prefix structure
The base structure for prefix search is the trie (prefix tree): a shared path from the root describes a shared prefix, and the queries themselves live at nodes and leaves. The key idea for speed is to store a ready list of best suggestions (top-K) at the node, so you don't traverse the whole subtree on every keystroke.
How top-K at nodes speeds up the answer
Trie with precomputed top-K at nodes:
(root)
│
s ──► top-K: [system design, system, ...]
│
y ──► top-K: [system design, system, ...]
│
s ──► top-K: [system design, system admin, ...]
(prefix "sys")
Query for "sys":
1. walk nodes s → y → s (O(prefix length))
2. return the ready top-K (O(K), no subtree traversal)
Without top-K at nodes you'd have to traverse the entire
subtree and sort candidates on every keystroke.Storage options
- Plain trie
Simple but memory-hungry: many single-child nodes
- Compressed trie (radix tree)
Merges single-child chains — less memory, same prefix lookup
- FST (finite state transducer)
A minimized automaton that shares both prefixes and suffixes. This is exactly what Lucene and Elasticsearch use for the completion suggester
Memory ↔ latency trade-off
- Top-K at nodes
Memory grows (K options per node), but the answer is ~O(prefix length)
- Compute on the fly
Saves memory but requires subtree traversal and sorting — costlier in latency
- FST in memory
Very compact but higher CPU while walking the automaton — the classic RAM-for-CPU trade
💡 FST in Lucene/Elasticsearch
The completion suggester in Elasticsearch is built on a weighted FST kept entirely in memory. Per Elastic's 2013 published benchmark, on ~2.1M Wikipedia titles the completion suggester answered in well under a millisecond versus several milliseconds for a plain prefix query. The trade is direct: an FST saves RAM and gives fast prefix lookup, but is costlier to build and updated per segment rather than one record at a time.
Architecture: Online and Offline
Connection
Sharding the structure
How to spread the prefix structure and frequency counters across shards
The main architectural idea is to split the hot read path from the heavy update path. On reads we only walk a ready structure and return top-K. Updates (collecting logs, aggregating frequencies, rebuilding the structure) are an offline process that periodically rolls out a new version of the structure.
Trie-based Autocomplete
Select a prefix to inspect trie path and suggestions
Trie Path
3 chars
< 50ms
Top results are stored in the prefix node
Top Suggestions
- system design
- system
- system administrator
- system metrics
Top-K Cache
Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.
Read path (online, hot)
- 1.Keystroke → request to the nearest edge node / cache (CDN)
- 2.Cache miss → the suggestion service picks the right shard by prefix
- 3.Walk down the trie/FST to the prefix node
- 4.Return the precomputed top-K (optionally re-rank per user)
- 5.Write the result to cache with a short TTL
Update path (offline, heavy)
- 1.Collect query logs into a queue / log stream
- 2.Aggregate frequencies over a window (hour/day) with decay for old data
- 3.Filtering: spam, abuse, rare "long tail"
- 4.Offline rebuild of the trie/FST with ready top-K at nodes
- 5.Atomic rollout of the new structure version to read nodes
Sharding by prefix
The structure is split by ranges of leading characters: each node holds its slice of the dictionary. A router uses the first 1–2 characters to send the request to the right shard. This caps per-node memory and spreads load, but creates "hot" shards on popular letters — they are balanced with extra replicas.
shard(prefix) = route_by_first_chars(prefix) prefixes a–f → shard 1 (+ replicas) prefixes g–m → shard 2 (+ replicas) prefixes n–z → shard 3 (+ replicas) Hot letters get more replicas; very short prefixes are served from the edge cache.
Ranking Query Candidates
Unlike search, here we rank query candidates, not documents. The base signal is query frequency; on top of it we layer freshness, personalization, and context. From a short prefix you have to guess intent before the user finishes typing it — so the signals are tuned to surface the likely candidate in the top rows from the first keystrokes.
Ranking signals
- Frequency
How often a query is typed — the candidate's main weight
- Recency
Time decay lifts what is popular now, not a year ago
- Personalization
User history and segment shift the order
- Context
Language, region, device, time of day
Defense against spam and bursts
- Frequency threshold
Rare queries and typos never enter the dictionary
- Block-lists and classifiers
Abuse, unsafe, and adult content are filtered before output
- Anti-manipulation
An anomalous spike of one query from a narrow bot group is damped
- Fresh-tail control
A new query is promoted cautiously so noise isn't sold as a trend
⚡ A simple weighting formula (for interviews)
score(q) = w1 × log(freq(q))
+ w2 × recency_decay(q)
+ w3 × personalization(q, user)
− w4 × spam_penalty(q)
freq — aggregated query frequency
recency — exponential time decay
personal — closeness to user history
spam — penalty for anomalies and block-listsThe formula is illustrative. In practice weights are tuned offline, and a heavier ML model is applied only as an optional re-ranker over the short top-K.
Typos and Fuzziness
Exact prefix search does not forgive typos: "sytsem" will not find "system". Fuzziness is added deliberately, because it is more expensive — it is a trade-off between suggestion quality and the latency budget.
Edit distance
Allow 1–2 edits (insertion, deletion, substitution). On an FST this is implemented via a Levenshtein automaton — but fuzzy expands the search space, so it is usually enabled only for short or rarely-matching prefixes.
N-grams
The query is split into n-grams (e.g. of 3 characters) and matched by intersection. This catches typos and in-string matches, but needs a separate index and produces noisier candidates than a strict prefix.
When to enable fuzzy
Try the exact prefix first; enable fuzzy as a fallback when there are too few exact candidates. That keeps latency low in the typical case while still handling typos.
💡 Fuzzy in the completion suggester
Elasticsearch's completion suggester supports fuzzy queries with a configurable edit distance directly over the FST: edits are allowed starting from a given position (so the start of the word doesn't "drift") and their number is capped. This is a typical trick — keep the start of the prefix strict, and allow typo tolerance only further in.
Deep Dives
Caching and delivery
- Edge cache for short prefixes — the most frequent query starts are served from the CDN, never reaching the backend
- Client-side debounce — don't send a request on every keystroke; wait for a pause in typing
- Incremental delivery — the client extends the shown list rather than redrawing it entirely
- Discarding stale answers — a late response for an old prefix is ignored
Data, languages, privacy
- Deduplication — normalize case, spaces, and equivalent variants of the same query
- Multilingual — separate structures per language/region, accounting for layouts and transliteration
- Log privacy — aggregate frequencies across many users, k-anonymity, drop the rare personal tail
- Counter storage — frequencies live in a key-value store sharded by query
Trade-offs and Common Mistakes
Topics you should definitely cover
- Boundary: query suggestions versus document retrieval
- Memory ↔ latency trade: top-K at nodes versus compute on the fly
- Separating online reads from offline structure rebuilds
- Data freshness versus rebuild and rollout cost
- Graceful degradation: what to show if the suggestion service is down
Common follow-up questions
- How to deliver fresh trends within hours, not days?
- How to add personalization without breaking the cache?
- How to handle typos without a latency spike?
- How to filter abuse and unsafe content?
- How to defend against bots inflating frequencies?
Common interview mistakes
- Confusing suggestions with search and designing an inverted index instead of a prefix structure
- Updating the trie in real time on every keystroke instead of an offline rebuild
- Forgetting edge caching and debounce — and inflating backend QPS
- Ignoring spam filtering and query-log privacy
Key Takeaways
Suggestions ≠ search
Typeahead ranks query candidates by prefix, not documents by query
Prefix structure
A trie/compressed trie/FST with precomputed top-K at nodes answers in roughly the prefix length
Online versus offline
The hot path only reads; frequencies are aggregated and the structure is rebuilt offline
Latency is the key metric
Per-keystroke p99 is measured in tens of milliseconds; edge cache and debounce absorb most load
Quality and safety
Freshness, personalization, spam defense, and log privacy are part of the design, not an add-on
References
Related chapters
- Search System - the neighboring case on full-text search: inverted index, document ranking, and retrieval — where typeahead stops, at query submission.
- CDN (Content Delivery Network) - explains edge caching and geographic distribution that keep per-keystroke suggestion latency low.
- Distributed Key-Value Store - covers sharding and replicas needed to store prefix structures and query-frequency counters.
- Ranking and Recommendation Architecture - goes deeper into personalization and ML signals that extend simple frequency-based candidate ranking.
