Most RAG systems do not break inside the model. They break at the seams between ingestion, retrieval, answer orchestration, and safety checks.
The chapter breaks the live loop into parts and shows how knowledge ingestion, ranking, guardrails, evaluation, and cost control together determine whether an answer is truly useful.
For interviews and architecture discussions, it is useful because it frames RAG not as a quick prototype, but as a system with SLOs, failure modes, and operational trade-offs.
Practical value of this chapter
Design in practice
Translate guidance on production RAG architecture, retrieval quality, and knowledge control into architecture decisions for data flow, model serving, and quality control points.
Decision quality
Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.
Interview articulation
Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.
Trade-off framing
Make trade-offs explicit for production RAG architecture, retrieval quality, and knowledge control: experiment speed, quality, explainability, resource budget, and maintenance complexity.
Primary source
Foundational RAG paper (2020)
The paper that formalized RAG as generation grounded in retrieved context.
GenAI/RAG System Architecture is not a single service around an LLM, but a set of connected planes: data ingestion, retrieval, generation orchestration, guardrails, and operational quality evaluation. The system becomes fit for live use only when these planes are designed as one contract for latency, quality, and cost.
Below is a practical blueprint you can use as a baseline for an enterprise AI assistant, internal knowledge assistant, or customer support bot.
Reference GenAI/RAG architecture
The diagram shows the RAG path by layers: from knowledge ingestion and indexing to answer generation, guardrails, and fallback behavior.
What to keep under control
It helps to view the RAG path not only as a chain of services, but as a balance of retrieval quality, response latency, cost, and rollout safety.
Retrieval quality
Live constraints
Safe rollout
Request path: from question to grounded answer
The diagram below shows the synchronous RAG path: from early request checks through retrieval and context assembly to a cited answer with final validation and response shaping.
How a question flows through the RAG path
The synchronous path from user request to cited answer
Active step
Step budget: ~30-80 ms1. Question intake and pre-checks
The system normalizes the request, identifies the scenario, filters out out-of-domain questions, and runs early policy checks before retrieval.
Grounded online answer path
- This path is tightly constrained by latency.
- Retrieval quality influences the outcome as much as the model itself.
- Access checks and guardrails work both before and after generation.
SLO and capacity baseline
Latency
P95 < 2.0s
Break budget into retrieval, model inference, and post-processing components.
Quality
Grounded-answer rate > 90%
Measure whether the answer is actually grounded in retrieved context, not only whether it sounds fluent.
Economics
Cost/task within target range
Control costs via model routing, caching, and context size limits.
Recommendations
- Design RAG as two connected systems: a knowledge platform (ingestion and index) and a live answer path (retrieval and generation).
- Treat retrieval observability as first-class: hit rate, miss reasons, latency, and segmented quality.
- Stabilize contracts: chunk structure, query filters, context blocks, and the client-facing response format.
- Before rolling out a new model, run both shadow traffic and historical replay on a benchmark task set.
Common pitfalls
- Evaluating only BLEU/ROUGE without product metrics and grounding checks.
- Indexing raw data without cleanup, deduplication, and source version control.
- Trying to fix quality only through prompt wording while ignoring retrieval quality and data freshness.
- Applying access control after generation instead of before context retrieval.
Launch mini-checklist
- A source catalog exists and each knowledge domain has data owners.
- Each use case has defined quality metrics, a latency SLO, and cost guardrails.
- Rule checks are enabled on both input and output, with auditable decision logs.
- Canary and shadow rollout paths are configured together with replay-based regression tests.
- Fallback paths exist for retrieval, reranker, and LLM provider failures.
References
Related chapters
- AI Engineering (short summary) - An engineering frame for AI applications: evaluation, release management, and day-two operations.
- Hands-On Large Language Models (short summary) - A foundation for embeddings, retrieval, and the internals of LLM-based systems.
- Prompt Engineering for LLMs (short summary) - Prompt contracts and context engineering practices for RAG.
- Enterprise AI Copilot - An applied GenAI case with ACL-aware retrieval, citations, guardrails, and tenant isolation.
- Evaluation and Observability for AI Systems - How to measure groundedness, retrieval quality, and system behavior after rollout.
- Generative AI System Design Interview (short summary) - Adds the interview frame for RAG: requirements, data, answer quality, safety, and post-launch monitoring.
- An Illustrated Guide to AI Agents (short summary) - Next step after RAG: tool use, planning, and orchestration.
- Data Governance & Compliance - PII control, lineage, and regulatory requirements for knowledge bases.
