GenAI/RAG System Architecture — System Design Space

Most RAG systems do not break inside the model. They break at the seams between ingestion, retrieval, answer orchestration, and safety checks.

The chapter breaks the live loop into parts and shows how knowledge ingestion, ranking, guardrails, evaluation, and cost control together determine whether an answer is truly useful.

For interviews and architecture discussions, it is useful because it frames RAG not as a quick prototype, but as a system with SLOs, failure modes, and operational trade-offs.

Practical value of this chapter

Design in practice

Translate guidance on production RAG architecture, retrieval quality, and knowledge control into architecture decisions for data flow, model serving, and quality control points.

Decision quality

Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.

Interview articulation

Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.

Trade-off framing

Make trade-offs explicit for production RAG architecture, retrieval quality, and knowledge control: experiment speed, quality, explainability, resource budget, and maintenance complexity.

Primary source

Foundational RAG paper (2020)

The paper that formalized RAG as generation grounded in retrieved context.

Open paper

GenAI/RAG System Architecture is not a single service around an LLM, but a set of connected planes: data ingestion, retrieval, generation orchestration, guardrails, and operational quality evaluation. The system becomes fit for live use only when these planes are designed as one contract for latency, quality, and cost.

Below is a practical blueprint you can use as a baseline for an enterprise AI assistant, internal knowledge assistant, or customer support bot.

Reference GenAI/RAG architecture

The diagram shows the RAG path by layers: from knowledge ingestion and indexing to answer generation, guardrails, and fallback behavior.

Knowledge sources and ingestion

documentationon-call instructionssupport ticketsdata owners

Layer transition

Cleaning, chunking, and index

deduplicationnormalizationchunksdocument versions

Layer transition

Retrieval and access filters

semantic searchlexical searchACLquery filters

Layer transition

Re-ranking and context assembly

rerankertop-kcontext blockscontext limit

Layer transition

Generation and response shaping

system instructionsLLMcitationsclient format

Layer transition

Guardrails and fallback

PII checkspolicy checksfallbackaudit

What to keep under control

It helps to view the RAG path not only as a chain of services, but as a balance of retrieval quality, response latency, cost, and rollout safety.

Retrieval quality

hit ratemiss reasonscitation coveragegrounded-answer rate

Live constraints

p95 latencycontext windowACL correctnessprovider timeout

Safe rollout

replay setshadow rolloutfreshnessregression checks

Request path: from question to grounded answer

The diagram below shows the synchronous RAG path: from early request checks through retrieval and context assembly to a cited answer with final validation and response shaping.

How a question flows through the RAG path

The synchronous path from user request to cited answer

Interactive replayStep 1/5

Active step

Step budget: ~30-80 ms

1. Question intake and pre-checks

The system normalizes the request, identifies the scenario, filters out out-of-domain questions, and runs early policy checks before retrieval.

Grounded online answer path

This path is tightly constrained by latency.
Retrieval quality influences the outcome as much as the model itself.
Access checks and guardrails work both before and after generation.

Latency budgetACLCitationsFallback

SLO and capacity baseline

Latency

P95 < 2.0s

Break budget into retrieval, model inference, and post-processing components.

Quality

Grounded-answer rate > 90%

Measure whether the answer is actually grounded in retrieved context, not only whether it sounds fluent.

Economics

Cost/task within target range

Control costs via model routing, caching, and context size limits.

Recommendations

Design RAG as two connected systems: a knowledge platform (ingestion and index) and a live answer path (retrieval and generation).
Treat retrieval observability as first-class: hit rate, miss reasons, latency, and segmented quality.
Stabilize contracts: chunk structure, query filters, context blocks, and the client-facing response format.
Before rolling out a new model, run both shadow traffic and historical replay on a benchmark task set.

Common pitfalls

Evaluating only BLEU/ROUGE without product metrics and grounding checks.
Indexing raw data without cleanup, deduplication, and source version control.
Trying to fix quality only through prompt wording while ignoring retrieval quality and data freshness.
Applying access control after generation instead of before context retrieval.

Launch mini-checklist

A source catalog exists and each knowledge domain has data owners.
Each use case has defined quality metrics, a latency SLO, and cost guardrails.
Rule checks are enabled on both input and output, with auditable decision logs.
Canary and shadow rollout paths are configured together with replay-based regression tests.
Fallback paths exist for retrieval, reranker, and LLM provider failures.

References

Related chapters

AI Engineering (short summary) - An engineering frame for AI applications: evaluation, release management, and day-two operations.
Hands-On Large Language Models (short summary) - A foundation for embeddings, retrieval, and the internals of LLM-based systems.
Prompt Engineering for LLMs (short summary) - Prompt contracts and context engineering practices for RAG.
Enterprise AI Copilot - An applied GenAI case with ACL-aware retrieval, citations, guardrails, and tenant isolation.
Evaluation and Observability for AI Systems - How to measure groundedness, retrieval quality, and system behavior after rollout.
Generative AI System Design Interview (short summary) - Adds the interview frame for RAG: requirements, data, answer quality, safety, and post-launch monitoring.
An Illustrated Guide to AI Agents (short summary) - Next step after RAG: tool use, planning, and orchestration.
Data Governance & Compliance - PII control, lineage, and regulatory requirements for knowledge bases.