System Design Space
Knowledge graphSettings

Updated: May 30, 2026 at 12:08 PM

GenAI/RAG System Architecture

medium

Original chapter about production RAG architecture: ingestion, retrieval, answer orchestration, guardrails, evaluation, and SLO-versus-cost trade-offs.

Most RAG systems do not break inside the model. They break at the seams between ingestion, retrieval, answer orchestration, and safety checks.

The chapter breaks the live loop into parts and shows how knowledge ingestion, ranking, guardrails, evaluation, and cost control together determine whether an answer is truly useful.

For interviews and architecture discussions, it is useful because it frames RAG not as a quick prototype, but as a system with SLOs, failure modes, and operational trade-offs.

Practical value of this chapter

Design in practice

Translate guidance on production RAG architecture, retrieval quality, and knowledge control into architecture decisions for data flow, model serving, and quality control points.

Decision quality

Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.

Interview articulation

Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.

Trade-off framing

Make trade-offs explicit for production RAG architecture, retrieval quality, and knowledge control: experiment speed, quality, explainability, resource budget, and maintenance complexity.

Primary source

Foundational RAG paper (2020)

The paper that formalized RAG as generation grounded in retrieved context.

Open paper

GenAI/RAG System Architecture is not a single service around an LLM, but a set of connected planes: data ingestion, retrieval, generation orchestration, guardrails, and operational quality evaluation. The system becomes fit for live use only when these planes are designed as one contract for latency, quality, and cost.

Below is a practical blueprint you can use as a baseline for an enterprise AI assistant, internal knowledge assistant, or customer support bot.

Reference GenAI/RAG architecture

The diagram shows the RAG path by layers: from knowledge ingestion and indexing to answer generation, guardrails, and fallback behavior.

Knowledge sources and ingestion
documentationon-call instructionssupport ticketsdata owners
Layer transition
Cleaning, chunking, and index
deduplicationnormalizationchunksdocument versions
Layer transition
Retrieval and access filters
semantic searchlexical searchACLquery filters
Layer transition
Re-ranking and context assembly
rerankertop-kcontext blockscontext limit
Layer transition
Generation and response shaping
system instructionsLLMcitationsclient format
Layer transition
Guardrails and fallback
PII checkspolicy checksfallbackaudit

What to keep under control

It helps to view the RAG path not only as a chain of services, but as a balance of retrieval quality, response latency, cost, and rollout safety.

Retrieval quality

hit ratemiss reasonscitation coveragegrounded-answer rate

Live constraints

p95 latencycontext windowACL correctnessprovider timeout

Safe rollout

replay setshadow rolloutfreshnessregression checks

Request path: from question to grounded answer

The diagram below shows the synchronous RAG path: from early request checks through retrieval and context assembly to a cited answer with final validation and response shaping.

How a question flows through the RAG path

The synchronous path from user request to cited answer

Interactive replayStep 1/5

Active step

Step budget: ~30-80 ms

1. Question intake and pre-checks

The system normalizes the request, identifies the scenario, filters out out-of-domain questions, and runs early policy checks before retrieval.

Grounded online answer path

  • This path is tightly constrained by latency.
  • Retrieval quality influences the outcome as much as the model itself.
  • Access checks and guardrails work both before and after generation.
Latency budgetACLCitationsFallback

SLO and capacity baseline

Latency

P95 < 2.0s

Break budget into retrieval, model inference, and post-processing components.

Quality

Grounded-answer rate > 90%

Measure whether the answer is actually grounded in retrieved context, not only whether it sounds fluent.

Economics

Cost/task within target range

Control costs via model routing, caching, and context size limits.

Recommendations

  • Design RAG as two connected systems: a knowledge platform (ingestion and index) and a live answer path (retrieval and generation).
  • Treat retrieval observability as first-class: hit rate, miss reasons, latency, and segmented quality.
  • Stabilize contracts: chunk structure, query filters, context blocks, and the client-facing response format.
  • Before rolling out a new model, run both shadow traffic and historical replay on a benchmark task set.

Common pitfalls

  • Evaluating only BLEU/ROUGE without product metrics and grounding checks.
  • Indexing raw data without cleanup, deduplication, and source version control.
  • Trying to fix quality only through prompt wording while ignoring retrieval quality and data freshness.
  • Applying access control after generation instead of before context retrieval.

Launch mini-checklist

  1. A source catalog exists and each knowledge domain has data owners.
  2. Each use case has defined quality metrics, a latency SLO, and cost guardrails.
  3. Rule checks are enabled on both input and output, with auditable decision logs.
  4. Canary and shadow rollout paths are configured together with replay-based regression tests.
  5. Fallback paths exist for retrieval, reranker, and LLM provider failures.

References

Related chapters

Enable tracking in Settings