System Design Space
Knowledge graphSettings

Updated: June 21, 2026 at 8:08 PM

Vector Search and Approximate Nearest Neighbors (ANN)

medium

The vector search layer in depth: ANN index families (IVF, HNSW), compression (PQ, IVF-PQ, ScaNN), distance metrics, recall/latency/memory trade-offs, hybrid search with BM25 and RRF, metadata filtering, scaling, and systems (FAISS, pgvector, Milvus, Qdrant, Weaviate).

Vector search matters not because we have embeddings, but because exact nearest-neighbor search (kNN) does not scale: cost grows linearly with collection size and dimensionality. Hence approximate search (ANN) — a deliberate trade of accuracy for speed and memory.

This chapter is about the vector search layer itself: index families (IVF, HNSW), compression (PQ, IVF-PQ, ScaNN), the recall/latency/memory quality metrics, and hybrid search. The neighboring RAG chapter uses vector search as a component, and the LLM inference chapter is about the model's own work.

That lens helps discuss index choice, quantization, metadata filtering, and operations (segments, rebuilds, sharding) as one engineering trade-off rather than the tuning of a single library.

Practical value of this chapter

ANN as a trade-off, not magic

Exact kNN scans every vector: cost grows linearly with count and dimensionality, so it does not survive at scale. ANN deliberately gives up some recall for speed and memory. Any index choice is a point on the recall↔latency↔memory surface: improving two corners of the triangle usually costs the third.

IVF and HNSW: two ways to narrow

IVF partitions the space with k-means into nlist cells and scans only the nprobe nearest — fast to build, memory-efficient, but lossy near cell boundaries. HNSW (Malkov & Yashunin, 2016) builds a multi-layer small-world graph: upper layers give long hops, the bottom layer the precise approximation; M, efConstruction, efSearch give the best recall/latency pair at the cost of RAM and build time.

Memory compression: PQ, IVF-PQ, ScaNN

A raw float32 vector takes 4*d bytes. Product Quantization (Jégou, Douze, Schmid, 2011) splits the vector into m subvectors, encodes each with a codebook byte, and estimates distances from tables. IVF-PQ is the billion-scale workhorse. ScaNN (Guo et al., 2019) uses an anisotropic loss to better preserve inner-product ranking. Recall is always the price, so approximate selection is followed by exact reranking.

Hybrid search and honest recall

The dense path (semantics) captures meaning, sparse BM25 catches exact matches and codes; they are fused via RRF by position. Metadata filtering conflicts with index structure: pre-filtering is precise but costly, post-filtering is fast but risks an empty result. recall@k must be measured honestly — against an exact scan on the same query set and metric — or a fast index silently drops relevant results.

Related chapter

GenAI RAG System Architecture

There vector search is a component inside RAG; here we dissect the search layer itself.

Читать обзор

Embeddings turn text, images, and behavior into points in a high-dimensional space where geometric proximity means semantic similarity. Retrieval reduces to finding a query's nearest neighbors. Exact search (kNN) scans every vector and does not scale: cost grows linearly with collection size and dimensionality. Hence approximate nearest neighbor search (ANN) — a deliberate trade of accuracy for speed and memory.

Here we dissect the vector search layer itself: which ANN index to choose, how to balance recall, latency, and memory, and where to turn on hybrid search. Where the boundary with adjacent chapters runs: the neighboring RAG chapter uses vector search as a component (hybrid search, reranking, context assembly), while LLM inference optimization owns the model's own work. What stays on this layer: indexes, distance metrics, and operating the search.

Distance metrics

Cosine similarity

Compares the angle between vectors and ignores their magnitude. It is the default for text embeddings: vector length carries little meaning, only orientation in the space matters.

Inner product (dot product)

Accounts for both angle and magnitude. Common in recommendations and MIPS (maximum inner product search), where the norm encodes popularity or confidence. On normalized vectors the ranking matches cosine.

Euclidean distance (L2)

Geometric distance between points. Natural for images and numeric features. On L2-normalized vectors, L2 and cosine produce the same neighbor order, so metric choice and normalization must stay aligned.

Rule: the index metric, the embedding training, and normalization must stay aligned. A mismatch is a common and silent cause of poor ranking.

IVF vs HNSW: how the search works

On the left, IVF partitions the space into cells (k-means) and only scans the nprobe cells nearest to the query. On the right, HNSW descends through graph layers: upper layers give long-range hops, the bottom layer gives the precise approximation.

IVF (inverted file) versus HNSW (multi-layer graph)IVF — cells and nprobequerynprobe=2: scan only the highlighted cellsHNSW — graph layersL2L1L0descent from upper layers to the exact neighbor on L0

Index families

Flat index (exact)

Linear scan over all vectors. Gives recall = 1.0 and serves as the quality baseline, but cost grows linearly with collection size and dimensionality. Fine up to hundreds of thousands of vectors or as an honest baseline for recall measurement.

IVF (inverted file)

The space is partitioned into cells via k-means over nlist centroids. A query only scans the nprobe nearest cells. Larger nprobe means higher recall and latency. Fast to build and memory-efficient, but quality drops near cell boundaries.

HNSW (multi-layer graph)

A navigable small-world graph with a layer hierarchy (Malkov & Yashunin, 2016). Parameters: M (edges per node), efConstruction (search width at build), and efSearch (search width at query). Best recall/latency in memory, but the graph is RAM-heavy and slower to build.

Vector compression

Product Quantization (PQ)

The vector is split into m subvectors, each encoded by the nearest centroid from its own codebook (typically 256 per subspace = 1 byte). Jégou, Douze, Schmid (2011): a 4*d-byte vector compresses to m bytes, with distances estimated from precomputed tables.

IVF-PQ

A combination: IVF coarsely selects candidate cells, PQ stores residuals in compressed form. This is the workhorse for billion-scale collections — an order of magnitude less memory than Flat or HNSW, at the cost of some recall.

Scalar Quantization (SQ)

Each float32 component is encoded as int8 (or smaller). Simpler than PQ, ~4x compression, small recall loss. Often the first memory-saving step before more aggressive PQ.

ScaNN (anisotropic quantization)

Guo et al. (2019): the loss penalizes the parallel component of a datapoint's residual more than the orthogonal one, which better preserves inner-product ranking (MIPS). State-of-the-art on public benchmarks under heavy compression.

Quality and operational metrics

Recall@k

target >= 0.95

Fraction of true nearest neighbors the approximate search returns. It is only honest when measured against an exact scan on the same query set and the same distance metric.

QPS and latency

p50 / p99

Measure throughput together with tail latency: an average hides degradation, and the recall vs p99 slider is tuned through the search parameters.

Memory per vector

bytes / vector

A raw float32 vector of dimension d takes 4*d bytes. The HNSW graph adds edge overhead; quantization cuts memory several-fold at the cost of recall.

Build time

build time

HNSW and IVF-PQ build noticeably slower than Flat. Rebuild time and cost decide how often the index can be rebuilt under a stream of updates.

Hybrid search and filtering

  • Dense path: semantic search over the ANN index captures meaning and synonyms, but degrades on exact matches, codes, SKUs, and rare terms.
  • Sparse path: BM25 over an inverted index is precise on keywords and exact matches, but does not understand paraphrasing and synonyms.
  • Fusion: Reciprocal Rank Fusion (RRF) merges two ranked lists by position rather than by scores on incomparable scales. Fusion is often followed by reranking — already the territory of the neighboring RAG chapter.
  • Metadata filtering: pre-filtering narrows candidates before search (precise, but breaks the graph/cell structure and can starve the search), post-filtering filters afterward (fast, but risks returning few or zero results under a strict filter).

Updates and scale

  • Inserts: HNSW supports incremental insertion, while IVF-PQ usually relies on trained centroids and copes worse with distribution drift in new data.
  • Deletes: in many indexes deletion is soft (tombstone) — the vector is marked deleted but stays in the graph or list until a rebuild, so memory and recall degrade over time.
  • Rebuilds and segments: systems keep fresh data in small growing segments and periodically merge and rebuild the index in the background so inserts do not block search.
  • Sharding splits the collection across nodes (scatter-gather over shards), while replication raises availability and read QPS.

Systems: what to choose

FAISS

A library from Facebook AI Research, not a service. The reference set of indexes (Flat, IVF, HNSW, PQ, IVF-PQ) with GPU acceleration. Embed it in your own serving when you need full control and have no requirements for transactions or metadata.

pgvector

A PostgreSQL extension: the vector type and distance operators (cosine, L2, inner product), HNSW and IVFFlat indexes. When vectors live next to relational data, a separate system is extra operational load, and pgvector removes that choice.

Milvus / Qdrant / Weaviate

Purpose-built vector databases with sharding, replication, metadata filtering, and hybrid search out of the box. The choice for high QPS, large collections, and product requirements around filters and updates.

Key trade-offs

  • Recall, latency, and memory form a triangle: improving two corners usually costs the third. Any index choice is a point on this trade-off surface.
  • HNSW gives the best recall/latency pair but is RAM-heavy and slow to build; at billion scale, memory often forces the choice toward IVF-PQ.
  • Quantization (PQ/SQ/ScaNN) saves memory several-fold but lowers recall; the typical pattern is approximate candidate selection over compressed codes followed by exact reranking over raw vectors.
  • Strict metadata filtering conflicts with the ANN index structure: pre-filtering is precise but expensive, post-filtering is fast but can return an empty result set.

Common mistakes

Chasing QPS without measuring recall@k against an exact scan — it is easy to get a fast index that silently drops relevant results.
Metric and normalization mismatch: an L2 index, cosine-trained embeddings, and forgotten normalization yield meaningless ranking.
Treating the index as static: without a delete, rebuild, and segment strategy, recall and memory degrade as updates flow in.
Relying on dense-only search where exact matches matter (codes, names, SKUs) instead of a hybrid scheme with BM25 and fusion.

Recommendations

Start from a Flat baseline on a representative query set and measure recall@k honestly against it while tuning nprobe / efSearch.
Pick the index family by scale: HNSW when RAM is plentiful and latency SLOs are strict, IVF-PQ at billions of vectors with limited memory.
For text search, default to considering a hybrid scheme (dense + BM25 + RRF) rather than the vector path alone.
Design the layer as a live system: segments, background rebuilds, sharding, and replication are part of the architecture, not a deployment detail.

References

Source map: HNSW comes from Malkov/Yashunin; product quantization from Jégou/Douze/Schmid; ScaNN from anisotropic vector quantization; FAISS and pgvector document concrete implementations. Any claim about recall, p99, RAM, and build speed should be re-benchmarked on your own dataset, embedding dimension, filters, and hardware.

Related chapters

  • GenAI RAG System Architecture - The neighboring chapter: there vector search is one component inside RAG, alongside hybrid search, reranking, and context assembly for the LLM.
  • LLM Inference Optimization - About the model's own inference: prefill/decode, KV-cache, batching, and weight quantization — a layer parallel to vector search.
  • Search System Design - Classic search with an inverted index and ranking — the basis of the sparse side of hybrid search.
  • Ranking and Recommendation Architecture - ANN search as the candidate-generation stage in recommendations: similar users and items by embeddings.

Enable tracking in Settings