System Design Space
Knowledge graphSettings

Updated: March 15, 2026 at 9:08 PM

GenAI/RAG System Architecture

medium

Original chapter about production GenAI/RAG architecture: ingestion, retrieval, orchestration, guardrails, evaluation, and latency/cost trade-offs.

This Theme 13 chapter focuses on RAG architecture, retrieval quality, and context control.

In real AI/ML system design, this material helps connect model choices to platform architecture: data pipeline, inference path, operating cost, and reliability requirements.

For system design interviews, the chapter provides a clear language for ML trade-offs: model quality, latency/throughput, data safety, observability, and production evolution.

Practical value of this chapter

Design in practice

Translate guidance on RAG architecture, retrieval quality, and context control into architecture decisions for data flow, model serving, and quality control points.

Decision quality

Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.

Interview articulation

Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.

Trade-off framing

Make trade-offs explicit for RAG architecture, retrieval quality, and context control: experiment speed, quality, explainability, resource budget, and maintenance complexity.

Primary source

RAG paper (2020)

Foundational work that formalized the retrieval-augmented generation approach.

Open paper

GenAI/RAG System Architecture is not a single service around an LLM, but a set of connected planes: data ingestion, retrieval, generation orchestration, guardrails, and operational quality evaluation. The system becomes production-ready only when these planes are designed as one contract for latency, quality, and cost.

Below is a practical blueprint you can use as a baseline for an enterprise AI assistant, knowledge copilot, or customer support bot.

Reference GenAI/RAG architecture

Knowledge ingestion

  • Collect sources (docs, runbooks, tickets, wiki) and assign ownership for each source.
  • Build text cleanup and normalization before indexing: remove duplicates, noise, and stale versions.
  • Use document versioning and incremental reindexing so search quality does not regress after each deploy.

Retrieval plane

  • Combine dense and lexical retrieval to reduce misses for critical fragments.
  • Keep metadata filters (tenant, product, language, ACL) as a mandatory query contract.
  • Introduce reranking after measurement: it is often the best quality boost at moderate cost.

Generation orchestration

  • Treat prompt templates as code: system prompt, policy block, retrieved context, and user intent.
  • Set token and latency budgets before model invocation, otherwise tail latency quickly violates SLO.
  • Add fallback paths: response cache, smaller model, or graceful partial degradation during overload.

Guardrails and governance

  • Validate request and response for PII/secrets, policy violations, and prompt injection.
  • Apply authorization at retrieval time, not after answer generation.
  • Log guardrail decisions with reason codes for investigations and audits.

Evaluation and operations

  • Separate offline eval (retrieval/generation quality) from online eval (task success, CSAT, containment rate).
  • Track cost per resolved task, not only cost per 1K tokens.
  • Maintain replay sets and regression checks for safe updates of embeddings, prompts, and models.

Request path: from question to grounded answer

1) Intent and policy pre-check

~30-80 ms

Normalize the request, detect use-case intent, and run policy checks before retrieval. This is where out-of-domain requests can be filtered early.

2) Retrieval + rerank

~80-250 ms

Fetch relevant context with ACL and user context constraints. Reranking improves top-k precision before generation.

3) Prompt assembly + generation

~300-1500 ms

Build the final prompt contract, invoke the LLM, and enforce max token and stop conditions.

4) Post-check + response shaping

~40-120 ms

Run response checks (safety/compliance), add citations, and format output for the target UI client.

SLO and capacity baseline

Latency

P95 < 2.0s

Break budget into retrieval, model inference, and post-processing components.

Quality

Grounded answer rate > 90%

Measure grounding against retrieved context, not only fluency.

Economics

Cost/task within target range

Control costs via model routing, caching, and context size limits.

Recommendations

  • Design RAG as two systems: data platform (ingestion/index) and online serving (retrieval/generation).
  • Treat retrieval observability as first-class: hit@k, miss reasons, latency, and segmented quality.
  • Stabilize contracts: chunk schema, query filters, prompt slots, and client-facing response format.
  • Before model rollout, run shadow traffic and offline replay on a benchmark task set.

Common pitfalls

  • Evaluating only BLEU/ROUGE without product metrics and grounding checks.
  • Indexing raw data without cleanup, deduplication, and source version control.
  • Trying to fix quality only with prompts while ignoring retrieval quality and data freshness.
  • Applying access control after generation instead of before context retrieval.

Launch mini-checklist

  1. A source catalog exists and each knowledge domain has data owners.
  2. Each use-case has defined quality metrics, latency SLO, and cost guardrails.
  3. Policy checks are enabled at both input and output, with auditable decision logs.
  4. Canary/shadow rollout and replay regression tests are configured.
  5. Fallback paths exist for retrieval, reranker, and LLM provider failures.

References

Related chapters

Enable tracking in Settings

System Design Space

© 2026 Alexander Polomodov