The neighboring serving-architecture chapter describes the outer runtime: online/batch/stream modes, routing, the latency budget, and autoscaling. This chapter opens up the LLM inference engine — what happens inside a single node when a large language model actually generates an answer, one token at a time.
Naive LLM inference is slow and expensive for a fundamental reason: text is produced autoregressively, and every new token needs a separate forward pass through the whole model. The key insight is that the prefill phase is compute-bound and sets TTFT, while the decode phase is memory-bandwidth-bound and sets the token-stream speed (TPOT).
From there we walk the engine's levers: the KV-cache and PagedAttention, continuous batching from Orca, weight and KV-cache quantization (GPTQ, AWQ, FP8/INT8), speculative decoding, parallelism and prefill-decode disaggregation, plus capacity and cost-per-token economics under SLOs.
Practical value of this chapter
Autoregression and the two phases
An answer of a few hundred tokens is a few hundred sequential passes through the model. Prefill processes the whole prompt in one pass (compute-bound, sets TTFT), while decode generates one token at a time, reading all weights and the whole KV-cache each step (memory-bandwidth-bound, sets TPOT/inter-token latency). Separating these phases is the basis of every optimization.
KV-cache and PagedAttention
The KV-cache stores attention keys and values so they are not recomputed per token; it grows linearly with context and batch and dominates GPU memory. Naive engines waste 60-80% of it to fragmentation. Kwon et al. (SOSP 2023) port OS paged memory into attention: fixed-size blocks on non-contiguous memory remove fragmentation and yield a 2-4x throughput gain at the same latency.
Batching and quantization
Continuous (in-flight) batching from Orca (OSDI 2022) rebuilds the batch at every iteration and keeps GPU utilization high. Quantization cuts memory and speeds up decode: weight-only GPTQ and AWQ compress weights to 3-4 bits, FP8/INT8 speeds up both phases, and KV-cache quantization frees memory for larger batches and longer contexts — all at a quality risk you must measure on your own data.
Speculation, parallelism, economics
Speculative decoding (Leviathan et al., ICML 2023; Medusa/EAGLE) produces several tokens per heavy pass when draft and target models agree. Tensor/pipeline/sequence parallelism and prefill-decode disaggregation scale large models and split phases across pools. The operational bottom line is cost per 1K tokens under separate TTFT/TPOT SLOs.
Neighboring chapter
Model Serving & Inference Architecture
That one covers the broader serving runtime (online/batch/stream, routing, autoscaling). This one is about the internals of the LLM engine.
The neighboring chapter "Model Serving & Inference Architecture" describes the outer runtime: online/batch/stream modes, routing, the latency budget, and autoscaling. This chapter is about what happens inside a single node when a large language model actually generates an answer. Instead of repeating the serving runtime, we open up the inference engine: KV-cache, batching, quantization, speculative decoding, and parallelism.
Naive LLM inference is slow and expensive for a fundamental reason: text is generated autoregressively — one token at a time, and each new token requires a separate forward pass through the whole model. A single answer of a few hundred tokens is a few hundred sequential passes, each reading all model weights for just a few operations per token. Everything that follows turns on one observation: prefill and decode are bound by different resources, and no single trick optimizes both.
The two decoding phases: prefill and decode
Prefill — processing the prompt
All prompt tokens pass through the network in a single forward pass and fill the attention cache in parallel. This phase is dominated by large matrix multiplications: it is compute-bound and sets TTFT.
Decode — one token at a time
Each subsequent token is a separate forward pass that reads all model weights and the whole KV-cache. There is little arithmetic per byte of memory read, so this phase is memory-bandwidth-bound and sets the token-stream speed.
Where LLM inference is bound
A single request passes through both phases, and each is bound by its own resource: prefill loads compute and sets TTFT, decode reads memory per token and sets TPOT. This split is where the optimization techniques diverge.
Metrics and the latency / throughput / cost trade-off
Time to first token
TTFT
Latency from request arrival to the first generated token. Driven mainly by prompt length, warm-up, and the prefill phase; this is what users perceive as the model thinking before it starts answering.
Time per output token
TPOT / ITL
Time-per-output-token (a.k.a. inter-token latency): the average time for each subsequent token during decode. It defines how smoothly the answer streams; its inverse is the single-request token rate.
Throughput
tokens/s
Aggregate token-generation rate across all requests on a node. It grows with batch size and high GPU utilization, but almost always competes with the latency of any individual request.
Goodput
goodput
The share of requests served within the TTFT and TPOT SLOs. A high tokens/s number is useless if half the requests violate the latency SLO.
These metrics pull in different directions: enlarging the batch raises tokens/s and lowers cost-per-token but worsens TPOT and tail latency. Optimizing one in isolation is useless — measure the gain against latency, throughput, and cost at once, or an improvement on one axis quietly gets paid for on another.
KV-cache: the main memory consumer
What it is
The KV-cache stores attention keys and values for every already-processed token so that generating each new token does not recompute attention over the whole sequence. Without it, decode would degrade to quadratic cost.
Why it dominates memory
KV-cache size grows linearly with context length, layer count, and the number of requests in a batch. On long contexts it easily exceeds the model weights themselves and becomes the main consumer of GPU memory, capping the maximum batch size.
Fragmentation
Naive engines reserve a contiguous buffer for each request's maximum length. Real answers are shorter, so internal and external fragmentation plus reserved-but-unused memory appear — up to 60-80% wasted by the vLLM authors' measurements.
PagedAttention / vLLM
Kwon et al. (SOSP 2023) port the idea of paged virtual memory from operating systems into attention: the KV-cache is stored in small fixed-size blocks mapped to non-contiguous physical memory. This removes almost all fragmentation, allows blocks to be shared across requests, and raises throughput by 2-4x at the same latency.
Batching: from static to continuous
Static batch
Requests are collected into a fixed batch and processed until all finish together. Short answers idle waiting for the longest one in the batch, and new requests wait for the next window — low utilization and high tail latency.
Dynamic batch
The engine waits a short window and forms a batch on the fly, balancing latency against fill. Better than static, but it still schedules at the granularity of a whole request, so it gets stuck on the longest generations. The serving-side batching windows are covered in the neighboring serving chapter.
Continuous / in-flight batching
Orca (OSDI 2022) proposes iteration-level scheduling: the batch is rebuilt at every decoding step. A finished request leaves the batch immediately and a new one takes its place without waiting for the others. Selective batching applies batching only to the operations where it is correct. This keeps GPU utilization high even when answer lengths vary.
Continuous (in-flight) batching is the main technique that raises GPU utilization: the batch never idles waiting for the longest answer, but is constantly refilled with new requests. Together with a paged KV-cache it is the foundation of modern engines (vLLM, TensorRT-LLM).
Quantization: quality / speed / memory
| Method | What it does | Memory saving | Trade-off |
|---|---|---|---|
| GPTQ (weight-only) | Layer-by-layer weight quantization to 3-4 bits using approximate second-order information | ~4x smaller weights (FP16 -> INT4) | Near-zero quality loss at 4 bits (3-bit degradation is more noticeable and is better tolerated by very large models); quantizes a 175B model in ~4 GPU-hours |
| AWQ (weight-only) | 4-bit quantization that protects the ~1% salient channels identified from activation statistics | ~4x smaller weights | Better quality robustness than naive round-to-nearest; >3x speedup vs FP16 in TinyChat |
| FP8 / INT8 | Quantizes weights and/or activations to 8 bits; FP8 is natively accelerated on modern GPUs | ~2x smaller weights; speeds up both prefill and decode | Minimal quality loss with careful calibration; requires hardware support |
| KV-cache quantization | Stores keys and values in INT8/FP8 instead of FP16 | ~2x smaller KV-cache -> larger batch / longer context | Frees memory for bigger batches; risk of degradation on very long contexts |
GPTQ (Frantar et al.) and AWQ (Lin et al.) are weight-only quantization: only the weights are compressed, which helps the memory-bandwidth-bound decode phase the most. FP8/INT8 compress activations as well, speeding up both phases on hardware with native support. KV-cache quantization frees memory for larger batches and longer contexts.
Speculative decoding
The idea of Leviathan, Kalman, and Matias (ICML 2023): a small fast draft model guesses several next tokens at once, and the large target model verifies them all in one parallel pass. Accepted tokens are kept; the first rejected one and everything after it are recomputed. The output is mathematically identical to ordinary decoding, yet several tokens are born from one heavy pass — the authors report a 2-3x speedup on T5-XXL.
Medusa adds several heads to the model that predict future tokens without a separate draft model; EAGLE predicts features of the next step rather than tokens, raising the acceptance rate. The technique helps when the draft agrees well with the target (predictable text, code); with a low acceptance rate the overhead can cancel the gain or even slow generation down.
Parallelism for large models
Tensor parallelism
Each layer (attention and MLP matrices) is split across GPUs that compute their share in lockstep and exchange partial results at every layer. It lowers latency and per-GPU memory, but needs a fast interconnect (NVLink) and is sensitive to its bandwidth.
Pipeline parallelism
The model is sliced by layers into stages placed on different GPUs, and requests flow through the pipeline. Cheap on communication, but it creates an idle bubble at the pipeline edges that is smoothed out with micro-batches.
Sequence parallelism
A long sequence and its KV-cache are split across devices along the token axis. It relaxes the memory limit on very long contexts, complementing tensor and pipeline parallelism.
Prefill-decode disaggregation
The compute-bound prefill phase and the memory-bandwidth-bound decode phase are placed on separate GPU pools. This removes mutual interference (a long prefill no longer chokes a neighbor's decode stream) and lets each phase be scaled and tuned independently.
Capacity and operations
Choosing the batch size
A larger batch raises tokens/s and GPU utilization but hurts TPOT and risks breaking the SLO. The batch size is chosen from the goodput curve: the largest batch at which TTFT and TPOT still fit the budget.
Autoscaling against the SLO
Autoscaling is driven not by average load but by TTFT/TPOT SLO violations and queue depth. A heavy LLM engine warms up slowly, so you need pre-warmed capacity — as described in the neighboring serving chapter.
Cost-per-token economics
The core unit of economics is cost per 1K tokens: the GPU-hour price divided by sustained tokens/s within the SLO. Quantization, continuous batching, and disaggregation lower it by raising useful hardware utilization.
Key trade-offs
- A larger batch and longer window raise tokens/s and lower cost-per-token, but they worsen TTFT/TPOT and the tail latency of interactive requests.
- Aggressive quantization (especially below 4 bits, and KV-cache quantization on long contexts) saves memory and speeds up inference at the cost of quality risk — measure it on your own tasks rather than trusting averaged benchmarks.
- Speculative decoding speeds up decode when the draft and target models agree well, but with a low acceptance rate it adds overhead and can slow generation down.
- Tensor parallelism lowers latency but needs an expensive fast interconnect; pipeline parallelism is cheap on communication but adds an idle bubble — the choice depends on node topology.
Common mistakes
Recommendations
References
Source map: vLLM/PagedAttention supports the KV-cache and paged-memory explanation; Orca supports continuous batching; GPTQ/AWQ support quantization; the speculative decoding paper supports draft-model acceleration; TensorRT-LLM documents practical in-flight batching, paged KV-cache, and FP8/INT8 features.
Related chapters
- Model Serving & Inference Architecture - The neighboring chapter on the broader serving runtime: online/batch/stream, routing, latency budget, and autoscaling that the LLM engine plugs into.
- AI Engineering: Overview - Where inference optimization sits in the AI application lifecycle and how it ties to evaluation and cost.
- GenAI/RAG System Architecture - How long RAG contexts stress the KV-cache and why inference optimization defines answer latency and price.
- History of NVIDIA AI accelerators - Why memory bandwidth and FP8 support on accelerators directly set the limits of LLM inference.
