The cost of an LLM application is not a single per-token price but a sum of model size, context length, KV cache, and whether it is a hosted API or self-hosted with GPU-hours and utilization. The most underestimated line item is long prompts and RAG context, which quietly multiply input tokens on every request.
Three levers control this: routing between models (cheap→expensive cascade, complexity classifier, task-aware router), caching (exact, semantic by embedding, provider prompt caching), and token reduction (prompt compression, context trimming, output limits). Each lever is a balancing of the quality / cost / latency triangle.
An LLM gateway ties it all together: a single routing point, cost accounting per request and per tenant, fallbacks, and end-to-end cost observability. Without measuring cost per resolved task, any saving stays a hypothesis, and an expensive default model plus a cache with no invalidation are the most common ways to overpay.
Practical value of this chapter
What makes up the bill
Input and output tokens (output is pricier), model size, context length and KV cache, hosted API vs self-hosted with GPU-hours and utilization. The hidden line item is long prompts and RAG context multiplying input tokens on every request.
Model routing
A cheap→expensive cascade by confidence (FrugalGPT, Chen, Zaharia, Zou, 2023), a complexity classifier, and a task-aware router. Every choice is a quality / cost / latency trade-off: a shift toward one vertex is paid for at the other two.
Caching and token reduction
Exact response cache, semantic cache by embedding in a vector database, and provider prompt caching (Anthropic cache reads ~0.1x of input). Plus prompt compression, context trimming, and a hard output cap. The main risk is staleness without invalidation.
LLM gateway and control
A single routing point, cost accounting per request and per tenant, fallbacks and cost-based strategies (LiteLLM), end-to-end cost observability. Cheap tasks move to batch and async; quotas are set not only by request count but by spend.
Related chapter
LLM Inference Optimization
That chapter covers the engine internals and per-token cost; this one is the application economics on top.
This chapter is about the economics of LLM applications: what makes up the cost and how to route requests between models. It deliberately does not duplicate three neighboring topics. The inference engine internals — KV cache, batching, quantization — are covered in "LLM Inference Optimization." Context assembly and answer quality belong to "GenAI RAG System Architecture." The wide picture of engineering LLM applications is in "AI Engineering Overview." Here there is only one question: how to keep the quality you need within budget — on both money and latency.
Most LLM products lose not on quality but on unit economics: the expensive model is the default on all traffic, there is no cache or it has no invalidation, and cost is never measured per request. What follows is what actually makes up the bill and which levers control it.
What makes up the cost
Input and output tokens
Providers bill input and output tokens separately, and output is usually several times more expensive. A long system prompt is multiplied by every request, while a verbose answer hits the most expensive line of the bill directly.
Model size and context
A larger model makes every token cost more, and a long context grows inference compute quadratically in attention and memory linearly in the KV cache. How that cache is built inside the engine is covered in the neighboring chapter on inference optimization.
Hosted API vs self-hosted
A hosted tariff is a simple per-token price with no capital cost. Self-hosting turns the bill into GPU-hours, where throughput and utilization decide everything: an idle accelerator costs the same as a busy one.
The hidden line item is almost always the same — long prompts and RAG context: they quietly multiply input tokens on every request, and they are the easiest to underestimate when sizing cost.
Cascade and routing: what it looks like
The diagram shows a cascade route: a lightweight complexity classifier sends simple requests to the cheap model, while hard or low-confidence answers escalate to the expensive one. A semantic cache removes part of the traffic before the model is even reached.
Economics metrics
$ / 1K tok
Cost per 1K tokens
The provider's base unit of tariff. Useful for comparing models, but it says nothing about how many tokens your scenario actually spends.
$ / request
Cost per request
Accounts for the real length of the prompt, RAG context, and answer. This is where the hidden cost of long prompts and bloated context shows up.
$ / task
Cost per resolved task
The metric the business sees: how much it costs to drive a request to a useful result, including retries, fallbacks, and switches between models.
$ / quality
Cost of quality
The price gap between an expensive and a cheap model per unit of quality gained. Often the expensive model is justified only on a narrow class of hard requests.
Model routing
Model cascade
A cheap model answers first; on low confidence or a failed check, the request escalates to a more expensive one. The approach is formalized in FrugalGPT (Chen, Zaharia, Zou, 2023), where an LLM cascade is presented as a way to cut inference cost while preserving quality.
Complexity classifier
A lightweight classifier or heuristic scores the request up front and sends simple cases to the cheap model and hard ones straight to the strong model, avoiding wasted escalation.
Task-aware router
Different models for different tasks: a compact one for classification and extraction, a strong one for reasoning and generation. This is also where you pick the trade-off between quality, latency, and cost.
Any routing is a balancing of the quality / cost / latency triangle: every shift toward one vertex is paid for at the other two.
Caching
Exact response cache
A match on the normalized request returns a ready answer without touching the model. Cheap and fast, but it demands discipline with cache keys and careful invalidation when data changes.
Semantic cache
Semantically similar requests are found via an embedding in a vector database. This raises the hit rate, but adds the risk of returning an answer to a not-quite-identical question, so the similarity threshold has to be calibrated.
Provider prompt caching
Reuse of the KV state for a shared prompt prefix on the provider side. Anthropic explicitly prices cache writes and reads; OpenAI enables caching automatically for eligible long prefixes and reduces input-token cost according to current platform rules.
The main risk of every layer is staleness: a cache with no clear invalidation rules saves money right up until it silently starts serving wrong answers.
Source note: provider-side prompt caching and token pricing are policy knobs, not protocol guarantees. Check current Anthropic/OpenAI docs before modeling costs; the architectural pattern here is "stable prefix → eligible cache hit," not a fixed discount percentage.
Token reduction
- Prompt compression: strip repetition and filler from the system part, and move stable instructions into a cacheable prefix.
- Context trimming: feed the model only the relevant fragments after retrieval, not the entire retrieved corpus.
- Output limits: a hard max_tokens and a structured (for example, JSON) answer cut the most expensive, output side of the bill.
- RAG cost control: every document added to context is input tokens on every request, so the number and length of RAG fragments must be budgeted as strictly as the model itself.
Batching, async, and quotas
- Cheap, latency-insensitive tasks (labeling, enrichment, offline analytics) move to batch and async modes, where providers offer a reduced tariff.
- Quotas and rate limiting are set not only by request count but by cost: a spend limit per tenant guards against a sudden bill from loops or abuse.
- A priority queue separates interactive traffic from background jobs so that cheap bulk processing does not crowd out a paying user.
LLM gateway: a single control point
Single routing point
A gateway in front of providers picks the model by cascade and task rules, hiding the differences between provider APIs behind one contract — essentially the API gateway idea extended to LLM traffic.
Cost accounting and fallbacks
The gateway counts tokens and money per call, applies budgets, and switches to a fallback model or provider on error or timeout. In LiteLLM this is routing with a cost-based strategy, cooldowns, and fallbacks.
Cost observability
End-to-end observability of cost, tokens, and cache share per route and per tenant turns the LLM bill from an opaque line item into a managed metric.
Key trade-offs
- A cheap model saves money but raises the escalation rate; a cascade saves on the average request at the cost of higher latency on hard ones.
- Aggressive caching cuts the bill but raises stale-answer risk; a semantic cache widens hits at the cost of answering a slightly different question.
- Rich RAG context improves answers but linearly grows input tokens per request — quality here converts directly into money.
- Self-hosting pays off only at consistently high GPU utilization; on uneven traffic a hosted API is almost always cheaper and simpler.
Common mistakes
Recommendations
Routing, caching, and token control only work together, and only with measurement: without cost per resolved task, any "saving" stays a hypothesis.
References
Related chapters
- LLM Inference Optimization - The internals of the inference engine — KV cache, batching, and quantization — that define the per-token cost on a self-hosted path.
- GenAI RAG System Architecture - Retrieval and context assembly drive answer quality — and in the same move grow input tokens, and therefore the cost of every request.
- AI Engineering Overview - The broad context of engineering LLM applications, into which model routing and cost control are embedded.
- Model Serving & Inference Architecture - The inference runtime, queues, and degraded modes on top of which request routing and economics live.
