LLM Post-Training: SFT, LoRA, and Alignment (DPO/RLHF)

After pretraining, an LLM can continue text, but it is not yet an assistant: a foundation model simply predicts the next token, does not follow instructions, and is not aligned with human preferences.

This chapter is about adapting a base model after pretraining: instruction tuning (SFT), efficient fine-tuning (LoRA/QLoRA), and preference alignment (DPO/RLHF). Neighboring chapters draw the boundaries: inference optimization is about serving a ready model, RAG is about supplying context by retrieval, and the AI Engineering overview is about the big picture.

From there we walk the pretraining → SFT → alignment pipeline, the fine-tune vs prompt/RAG fork, the economics of full versus PEFT, preference collection and evaluation, and the common mistakes that make an aligned release regress on reasoning and code.

Practical value of this chapter

Three stages: pretraining → SFT → alignment

Pretraining teaches next-token prediction and yields a base model; SFT on instruction→answer pairs teaches instruction following; alignment on preference pairs (DPO/RLHF) brings the model to a useful assistant. Post-training is SFT plus alignment on top of a ready base model.

SFT and the forgetting risk

Supervised fine-tuning changes not knowledge but a form of behavior: format, role, answering to the point. A small or one-sided dataset leads to overfitting and catastrophic forgetting — the model trades general ability for a narrow style, so data matters more than it seems.

PEFT: LoRA and QLoRA

LoRA (Hu et al., 2021) freezes weights and injects trainable low-rank matrices — a fraction of a percent of parameters is trained. QLoRA (Dettmers et al., 2023) adds 4-bit quantization of the base model and sharply lowers GPU memory needs. Adapters weigh megabytes, so they are cheap to store, version, and load at serving time.

Alignment and evaluation

RLHF (InstructGPT, Ouyang et al., 2022) builds a reward model and optimizes the policy via PPO; DPO (Rafailov et al., 2023) expresses the same objective directly without a separate reward model or RL. Evaluation runs in layers — benchmarks, LLM-judge, and A/B — to catch reward hacking and regressions in reasoning and code.

Related chapter

GenAI RAG System Architecture

When knowledge is better supplied by retrieval at query time than baked into weights through fine-tuning.

Читать обзор

After pretraining, an LLM can already continue text, but it is not yet an assistant: a pre-trained foundation model simply predicts the next token, does not follow instructions, and is not aligned with what people consider a good answer. Post-training closes that gap: instruction following (SFT), efficient fine-tuning (LoRA/QLoRA), and preference alignment (DPO/RLHF) turn a token predictor into an interlocutor that holds the format and answers to the point. The neighboring chapters draw the boundaries: LLM inference optimization is about serving an already-built model; RAG is about supplying context by retrieval; the AI Engineering overview is about the big picture. This chapter is about how a base model becomes a useful, aligned assistant.

Three stages: pretraining → SFT → alignment

The path from pretraining to an aligned model runs through three stages. For each one you can see what comes in, what the model learns, and what comes out.

SFT: supervised fine-tuning for instruction following

Data format

The input is pairs of instruction and desired answer, often with a system prompt and a chat template. The model picks up no new facts here; it absorbs a form of behavior: follow the format, hold the role, answer to the point.

What changes

Same language model, same next-token objective, but the data distribution shifts toward demonstrations of how to behave. This is the first step that turns a base model into a useful assistant.

Risks

A small or one-sided dataset leads to overfitting and catastrophic forgetting: the model trades the general abilities gained in pretraining for a narrow style.

Parameter-efficient fine-tuning: LoRA, QLoRA, adapters

LoRA

Hu et al. (2021) freeze the pre-trained weights and inject small trainable low-rank matrices into the layers. Only a tiny fraction of parameters is trained and the original weights are untouched, which makes it cheap to store many adapters over one base model.

QLoRA

Dettmers et al. (2023) add 4-bit quantization of the frozen base model and train LoRA adapters on top of it. This sharply lowers GPU memory requirements and lets you fine-tune large models where full fine-tuning is out of reach.

Adapters and when it is enough

The PEFT family (LoRA, adapters, prefix/prompt-tuning) works well when you need to adapt behavior or domain rather than retrain from scratch. For style, format, and a narrow task it is almost always enough.

Under the hood the whole PEFT family runs on one trick: train a small fraction of parameters over a frozen base model. The payoffs follow from that — a lower training and memory bill, cheaper storage for many versions, and a single shared base model that keeps serving everyone in production.

Preference alignment: RLHF and DPO

RLHF (reward model + PPO)

The classic path from InstructGPT (Ouyang et al., 2022): train a separate reward model on preference pairs, then use PPO to fine-tune the LLM to maximize its score. Powerful, but it is a full reinforcement-learning pipeline with its own instability and cost.

DPO

Rafailov et al. (2023) show that preference alignment can be reduced to direct optimization with a simple classification loss — no separate reward model and no RL loop. Noticeably simpler and more stable to train.

Variants (IPO/KTO/ORPO)

A family of methods grew around DPO: IPO adjusts the loss against overfitting to preferences, KTO learns from standalone good/bad labels without pairs, and ORPO folds the preference signal directly into the SFT stage.

Both RLHF and DPO train the model on the same signal — human preferences between answers. The difference is mechanics: RLHF builds a separate reward model and optimizes the policy via PPO, whereas DPO expresses the same objective directly and skips the RL loop, which makes it a common default choice.

Data and evaluation

Preference collection: pairs of answers labeled with which one is better. Label quality and inter-annotator agreement directly cap how far alignment can go.
Evaluation runs in layers: benchmarks, an LLM-judge (a model as arbiter), and a product A/B on live traffic. No single layer is sufficient on its own.
Reward hacking: the model finds a way to please the reward model or judge without becoming more useful. A typical symptom is verbose, confident, but empty answers.
Regressions: a win in style and safety is easy to pay for with a drop in reasoning, code, or rare languages. You need separate slices, not one aggregate number.

Cost and infrastructure

Full fine-tuning updates all weights and needs memory for gradients and optimizer states for the whole model; PEFT/LoRA trains a fraction of a percent of parameters and removes most of the GPU load.
Many experiments mean many checkpoints and versions. Adapters weigh megabytes instead of tens of gigabytes, so they are cheap to store, version, and compare.
Adapters also shape serving: one base model can stay in memory while different LoRA weights are loaded per tenant or task. How that runs on a node is the subject of the neighboring inference chapter.

When to choose what: the cost-and-control ladder

The fine-tune vs prompt/RAG fork is decided by cost and the level of control you need, not by fashion. Climb the ladder only when the previous rung does not solve the task.

minimal

Prompting / few-shot

Change behavior without training — through task framing and in-context examples. The cheapest, fastest lever; always try it first.

low

RAG

Supply fresh knowledge by retrieval at query time without touching weights. It solves freshness and factuality better than fine-tuning. See the RAG chapter.

medium

PEFT (LoRA/QLoRA)

When you need to durably change style, format, or domain behavior and prompting/RAG do not deliver it. Cheap to train and store, easy to roll back.

high

Full SFT + alignment

Full behavior adaptation and preference alignment. Maximum control and cost; justified when the model is the core of the product.

Key trade-offs

Control versus cost: full SFT plus alignment gives the most behavioral control, but it is the most expensive path and the slowest to iterate.
Alignment tax: optimizing for preferences and safety often slightly lowers the model's peak capability. It is a trade-off, not a bug.
Data over method: the quality and diversity of demonstrations and preferences almost always matter more than the choice between DPO and RLHF.
PEFT versus full: LoRA is cheaper and safer to iterate with, but for a deep change of behavior it can lag behind full fine-tuning — verify this, do not assume it.

Common mistakes

Fine-tuning for fresh facts instead of RAG: weights hold knowledge poorly, go stale fast, and are expensive to update.

Jumping straight into training without exhausting prompting and few-shot: often the needed effect is reached without a single gradient step.

Evaluating a release with one aggregate metric and missing regressions in reasoning, code, or rare scenarios.

Trusting the reward model or LLM-judge as absolute truth and not checking alignment on live traffic and slices.

Recommendations

Climb the cost ladder bottom-up: prompt to RAG to PEFT to full SFT+alignment, stopping where the task is already solved.

Do SFT first for base behavior and format, then preference alignment (DPO as a simple default, RLHF when it is justified).

Keep a layered evaluation: offline benchmarks, an LLM-judge, and a product A/B, plus separate slices to catch regressions.

Prefer LoRA/QLoRA for iteration: cheap to train, easy to version and roll back, convenient to keep several adapters over one base model.

References

Source map: InstructGPT supports the RLHF flow with a reward model and PPO; LoRA and QLoRA support parameter-efficient fine-tuning; DPO supports preference optimization without a separate reward model; TRL documents library implementations of these stages. The choice among SFT, DPO, RLHF, and PEFT is still empirical: dataset quality, eval policy, and the base model matter more than the method name.

Ouyang et al. — Training language models to follow instructions with human feedback (InstructGPT, RLHF, 2022)Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2021)Dettmers et al. — QLoRA: Efficient Finetuning of Quantized LLMs (2023)Rafailov et al. — Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO, 2023)Hugging Face — TRL: a library for SFT, reward modeling, PPO, and DPO

Related chapters

AI Engineering Overview - The overall map of working with LLMs in a product, where post-training fits as one lever alongside prompting, RAG, and serving.
GenAI RAG System Architecture - The alternative to fine-tuning for supplying knowledge: retrieve context at query time instead of baking facts into weights.
LLM Inference Optimization - What happens when serving an already-trained model and adapters: decode phases, the KV-cache, batching, and cost per token.
Model Serving & Inference Architecture - The outer runtime around the model: routing, latency budget, degraded modes, and loading adapters per tenant.