Search Autocomplete (Typeahead) — System Design Space

Search autocomplete (typeahead) is a panel of top-N query candidates that refreshes as you type, and it shows up almost everywhere: search boxes, browser address bars, messenger fields, and store product search. It looks like a small feature, but behind it sits a standalone architecture with an extremely tight per-keystroke latency budget.

Unlike the neighboring full-text search case, here we rank query candidates by prefix rather than documents by query. The core is a prefix structure (trie, compressed trie, or FST) with precomputed top-K at nodes, a hot read path through an edge cache, and a heavy offline rebuild of the dictionary from aggregated query frequencies.

In interviews this case is valued for its clean online/offline split, clear memory-versus-latency trade-offs, and a wealth of practical details: debounce, sharding by prefix, freshness, personalization, fuzzy typo handling, and log privacy. Working through it gives you a ready framework for reasoning about any suggestion system.

Pipeline Thinking

Ingestion, partitioning, deduplication, and stage latency drive system behavior.

Serving Layer

Index and cache-locality decisions directly shape user-facing query latency.

Consistency Window

Explicitly define where eventual consistency is acceptable and where it is not.

Cost vs Freshness

Balance update frequency with compute/storage cost and operational complexity.

Problem Framing and Boundaries

The user hasn't finished typing yet, but the system already has to guess what they want. Search autocomplete (typeahead) is a panel of top-N query candidates that refreshes as you type: the prefix "sys" is entered, and "system design", "system administrator", and so on appear at once. Suggestions save keystrokes and help users phrase their intent; they show up in search boxes, browser address bars, the "To:" fields of messengers, and product search in stores. The hard part here is not the output — it is that you must answer on every keystroke, faster than the person types.

Boundary with the search system

The neighboring case, "Search System", is about full-text search: inverted index, document ranking, and retrieval over an already-submitted query. THIS case is about what happens before submission: prefix search, extremely low per-keystroke latency, and ranking of query candidates (not documents). Suggestions guess what the user is most likely looking for; search answers the query they finally pick.

What is NOT in scope

Document retrieval

Retrieval and result ranking belong to search-system-case

Result-page spell check

"Did you mean…" after a search is a separate subsystem

Semantic search

Vector matching by meaning is a different class of problem

Functional Requirements

Core Features

Top-N (usually 5–10) suggestions for the typed prefix
Rank candidates by popularity (query frequency)
Account for freshness: a surge of interest in a new query reaches suggestions
Highlight the matched prefix in the suggestion list
Filter spam, abuse, and unsafe content

Extended Features

Personalization by user history and context
Typo tolerance (fuzzy / edit distance)
Multilingual support and different keyboard layouts
Matching on words inside the query, not just at the start of the string

Non-functional Requirements

A suggestion must appear before the next keystroke, or it is useless. So the key metric is per-keystroke p99 latency, not the time of a full search. Meanwhile the load measured in QPS is several times higher than search itself: every character of the query triggers its own suggestion call.

<100ms

p99 per keystroke (estimate)

~5–8×

QPS relative to search itself (estimate)

99.9%+

Availability (graceful degradation)

hours

Acceptable dictionary freshness (estimate)

Why latency matters more than in search

Suggestions compete with typing speed. If the user types faster than answers arrive, the panel "flickers" with stale options. So the per-keystroke budget is measured in tens of milliseconds (with a p99 ceiling around 100 ms), not hundreds, and availability is built on graceful degradation: if the suggestion service is down, the input box still works — the user just sees no panel. Suggestions need not be strongly consistent: a small delay in updating frequencies is unnoticeable.

Scale Estimates

Connection

CDN and edge cache

Where to absorb most suggestion traffic closer to the user

Читать обзор

The numbers below are order-of-magnitude estimates for an interview, not real data from any specific company. The main nonlinear multiplier is query length: each keystroke after the first characters generates a separate suggestion call.

Sizing assumptions

DAU, searches per day, effective calls after debounce, and the peak multiplier are inputs to the interview model. Change them for the product and recompute QPS, memory, and egress; the figures below are not promises about Google, Amazon, or any other specific service.

Back-of-the-envelope load (estimate)

Given (orders of magnitude for an interview):
  DAU                          = 100M users
  searches per user/day        = 5
  searches/day                 = 500M
  average query length         = ~20 characters

Keystrokes → suggestion requests:
  we usually don't call on every character, but debounce (every N chars)
  assume ~6 effective suggestion requests per search
  suggestion requests/day      = 500M × 6 = 3B

Average suggestion QPS:
  3B / 86,400 s                ≈ 35,000 QPS
Peak QPS (×3)                  ≈ 100,000 QPS

Takeaway: suggestions are a hotter path than search itself,
          so edge caching and top-K precomputation are critical.

All values are marked as estimates. In an interview the reasoning matters more than "exact" figures.

Dictionary size and memory

Unique queries — tens to hundreds of millions after trimming the "long tail"
Raw dictionary — at ~20 bytes per query, that is a few gigabytes of text
Prefix structure — a compressed trie/FST shrinks this several-fold by sharing prefixes

Response throughput

Response — 10 suggestions × ~30 bytes ≈ 300 bytes plus overhead
Egress — 100K QPS × ~0.5 KB ≈ 50 MB/s outbound (estimate)
Cache — a high hit rate on short prefixes drastically reduces backend load

Prefix Data Structures

Connection

Inverted index vs trie

Search relies on an inverted index; suggestions rely on a prefix structure

Читать обзор

The base structure for prefix search is the trie (prefix tree): a shared path from the root describes a shared prefix, and the queries themselves live at nodes and leaves. The key idea for speed is to store a ready list of best suggestions (top-K) at the node, so you don't traverse the whole subtree on every keystroke.

How top-K at nodes speeds up the answer

Trie with precomputed top-K at nodes:

         (root)
           │
           s ──► top-K: [system design, system, ...]
           │
           y ──► top-K: [system design, system, ...]
           │
           s ──► top-K: [system design, system admin, ...]
                 (prefix "sys")

Query for "sys":
  1. walk nodes s → y → s          (O(prefix length))
  2. return the ready top-K        (O(K), no subtree traversal)

Without top-K at nodes you'd have to traverse the entire
subtree and sort candidates on every keystroke.

Storage options

Plain trie
Simple but memory-hungry: many single-child nodes
Compressed trie (radix tree)
Merges single-child chains — less memory, same prefix lookup
FST (finite state transducer)
A minimized automaton that shares both prefixes and suffixes. This is exactly what Lucene and Elasticsearch use for the completion suggester

Memory ↔ latency trade-off

Top-K at nodes
Memory grows (K options per node), but the answer is ~O(prefix length)
Compute on the fly
Saves memory but requires subtree traversal and sorting — costlier in latency
FST in memory
Very compact but higher CPU while walking the automaton — the classic RAM-for-CPU trade

💡 FST in Lucene/Elasticsearch

The completion suggester in Elasticsearch is built on a weighted FST kept entirely in memory. Per Elastic's 2013 published benchmark, on ~2.1M Wikipedia titles the completion suggester answered in well under a millisecond versus several milliseconds for a plain prefix query. The trade is direct: an FST saves RAM and gives fast prefix lookup, but is costlier to build and updated per segment rather than one record at a time.

Architecture: Online and Offline

Connection

Sharding the structure

How to spread the prefix structure and frequency counters across shards

Читать обзор

The main architectural idea is to split the hot read path from the heavy update path. On reads we only walk a ready structure and return top-K. Updates (collecting logs, aggregating frequencies, rebuilding the structure) are an offline process that periodically rolls out a new version of the structure.

Trie-based Autocomplete

Select a prefix to inspect trie path and suggestions

Trie Path

rootsys

Prefix length

3 chars

Latency target

< 50ms

Note

Top results are stored in the prefix node

Top Suggestions

system design
system
system administrator
system metrics

Top-K Cache

Trie nodes store top suggestions (Top-10) to avoid traversing the whole subtree.

Read path (online, hot)

1.Keystroke → request to the nearest edge node / cache (CDN)
2.Cache miss → the suggestion service picks the right shard by prefix
3.Walk down the trie/FST to the prefix node
4.Return the precomputed top-K (optionally re-rank per user)
5.Write the result to cache with a short TTL

Update path (offline, heavy)

1.Collect query logs into a queue / log stream
2.Aggregate frequencies over a window (hour/day) with decay for old data
3.Filtering: spam, abuse, rare "long tail"
4.Offline rebuild of the trie/FST with ready top-K at nodes
5.Atomic rollout of the new structure version to read nodes

Sharding by prefix

The structure is split by ranges of leading characters: each node holds its slice of the dictionary. A router uses the first 1–2 characters to send the request to the right shard. This caps per-node memory and spreads load, but creates "hot" shards on popular letters — they are balanced with extra replicas.

shard(prefix) = route_by_first_chars(prefix)

  prefixes a–f  → shard 1 (+ replicas)
  prefixes g–m  → shard 2 (+ replicas)
  prefixes n–z  → shard 3 (+ replicas)

Hot letters get more replicas;
very short prefixes are served from the edge cache.

Ranking Query Candidates

Unlike search, here we rank query candidates, not documents. The base signal is query frequency; on top of it we layer freshness, personalization, and context. From a short prefix you have to guess intent before the user finishes typing it — so the signals are tuned to surface the likely candidate in the top rows from the first keystrokes.

Ranking signals

Frequency
How often a query is typed — the candidate's main weight
Recency
Time decay lifts what is popular now, not a year ago
Personalization
User history and segment shift the order
Context
Language, region, device, time of day

Defense against spam and bursts

Frequency threshold
Rare queries and typos never enter the dictionary
Block-lists and classifiers
Abuse, unsafe, and adult content are filtered before output
Anti-manipulation
An anomalous spike of one query from a narrow bot group is damped
Fresh-tail control
A new query is promoted cautiously so noise isn't sold as a trend

⚡ A simple weighting formula (for interviews)

score(q) = w1 × log(freq(q))
         + w2 × recency_decay(q)
         + w3 × personalization(q, user)
         − w4 × spam_penalty(q)

freq        — aggregated query frequency
recency     — exponential time decay
personal    — closeness to user history
spam        — penalty for anomalies and block-lists

The formula is illustrative. In practice weights are tuned offline, and a heavier ML model is applied only as an optional re-ranker over the short top-K.

Typos and Fuzziness

Exact prefix search does not forgive typos: "sytsem" will not find "system". Fuzziness is added deliberately, because it is more expensive — it is a trade-off between suggestion quality and the latency budget.

Edit distance

Allow 1–2 edits (insertion, deletion, substitution). On an FST this is implemented via a Levenshtein automaton — but fuzzy expands the search space, so it is usually enabled only for short or rarely-matching prefixes.

N-grams

The query is split into n-grams (e.g. of 3 characters) and matched by intersection. This catches typos and in-string matches, but needs a separate index and produces noisier candidates than a strict prefix.

When to enable fuzzy

Try the exact prefix first; enable fuzzy as a fallback when there are too few exact candidates. That keeps latency low in the typical case while still handling typos.

💡 Fuzzy in the completion suggester

Elasticsearch's completion suggester supports fuzzy queries with a configurable edit distance directly over the FST: edits are allowed starting from a given position (so the start of the word doesn't "drift") and their number is capped. This is a typical trick — keep the start of the prefix strict, and allow typo tolerance only further in.

Deep Dives

Caching and delivery

Edge cache for short prefixes — the most frequent query starts are served from the CDN, never reaching the backend
Client-side debounce — don't send a request on every keystroke; wait for a pause in typing
Incremental delivery — the client extends the shown list rather than redrawing it entirely
Discarding stale answers — a late response for an old prefix is ignored

Data, languages, privacy

Deduplication — normalize case, spaces, and equivalent variants of the same query
Multilingual — separate structures per language/region, accounting for layouts and transliteration
Log privacy — aggregate frequencies across many users, k-anonymity, drop the rare personal tail
Counter storage — frequencies live in a key-value store sharded by query

Trade-offs and Common Mistakes

Topics you should definitely cover

Boundary: query suggestions versus document retrieval
Memory ↔ latency trade: top-K at nodes versus compute on the fly
Separating online reads from offline structure rebuilds
Data freshness versus rebuild and rollout cost
Graceful degradation: what to show if the suggestion service is down

Common follow-up questions

How to deliver fresh trends within hours, not days?
How to add personalization without breaking the cache?
How to handle typos without a latency spike?
How to filter abuse and unsafe content?
How to defend against bots inflating frequencies?

Common interview mistakes

Confusing suggestions with search and designing an inverted index instead of a prefix structure
Updating the trie in real time on every keystroke instead of an offline rebuild
Forgetting edge caching and debounce — and inflating backend QPS
Ignoring spam filtering and query-log privacy

Key Takeaways

Suggestions ≠ search

Typeahead ranks query candidates by prefix, not documents by query

Prefix structure

A trie/compressed trie/FST with precomputed top-K at nodes answers in roughly the prefix length

Online versus offline

The hot path only reads; frequencies are aggregated and the structure is rebuilt offline

Latency is the key metric

Per-keystroke p99 is measured in tens of milliseconds; edge cache and debounce absorb most load

Quality and safety

Freshness, personalization, spam defense, and log privacy are part of the design, not an add-on

References

Elastic — Suggesters: completion suggester, prefix and fuzzy suggestions (Elasticsearch Docs)Alexander Reelsen — You complete me: FST in the completion suggester and latency benchmarks (Elastic Blog, 2013)Apache Lucene — suggest/document: SuggestField, PrefixCompletionQuery, NRTSuggester over a weighted FST (Lucene 9.9 Docs)Mike McCandless — Using Finite State Transducers in Lucene: FST internals and compactness Andrew Gallant — Index 1,600,000,000 Keys with Automata and Rust: FST, prefix and fuzzy queries

Related chapters

Search System - the neighboring case on full-text search: inverted index, document ranking, and retrieval — where typeahead stops, at query submission.
CDN (Content Delivery Network) - explains edge caching and geographic distribution that keep per-keystroke suggestion latency low.
Distributed Key-Value Store - covers sharding and replicas needed to store prefix structures and query-frequency counters.
Ranking and Recommendation Architecture - goes deeper into personalization and ML signals that extend simple frequency-based candidate ranking.