Post-Training Pipeline 2026: SFT, DPO, GRPO, and the Rise of Verifiable Rewards
The 2026 LLM post-training stack — SFT, DPO, RLHF, GRPO, RLVR. What each step does, when to use it, and what frontier labs do differently.
The 2026 Post-Training Pipeline
Pretraining gives you a model that completes text. Post-training turns that into a model that follows instructions, refuses unsafe requests, reasons step by step, and gets graded by humans as helpful. The 2026 pipeline has more stages than 2023's "SFT + RLHF."
This piece walks through each stage, what it does, and the order frontier labs ship them in.
The Pipeline
flowchart LR
Pre[Pretrained Base] --> SFT[SFT]
SFT --> Pref[Preference Optimization<br/>DPO / IPO / KTO]
Pref --> RL[RL with Rewards<br/>GRPO / PPO]
RL --> RLVR[RL with Verifiable Rewards<br/>RLVR]
RLVR --> Final[Aligned Model]
SFT (Supervised Fine-Tuning)
Standard fine-tuning on instruction/response pairs. Teaches format, basic instruction following, refusal patterns. Foundation for everything later. Typically a few hundred K to a few M examples.
Preference Optimization
Replaces the RLHF "reward model + PPO" sandwich with simpler offline objectives. The 2026 lineup:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- DPO (Direct Preference Optimization): train the model to prefer chosen responses over rejected ones, no separate reward model
- IPO: a regularized variant that handles label noise better
- KTO (Kahneman-Tversky Optimization): works with binary thumbs-up/thumbs-down data instead of pairs
DPO and its cousins are the dominant preference-optimization choice in 2026 because they are simpler and more stable than PPO.
RL with Reward Models (PPO and successors)
The classical RLHF approach is still used at frontier labs for the final polish. PPO is the most common; some teams use REINFORCE++ or GRPO for efficiency.
GRPO (Group Relative Policy Optimization)
DeepSeek-R1 popularized GRPO in early 2025. Sample multiple responses per prompt; compute group-relative rewards; update the policy. Requires no value function (cheaper than PPO). Strong fit for reasoning tasks.
RLVR (RL with Verifiable Rewards)
The 2026 frontier. For tasks where success is verifiable — math (does the answer match?), code (do tests pass?), tool use (did the tool succeed?) — use the verifier as the reward. No human preference labels needed. The "thinking" models from major labs are largely RLVR-trained.
Why Verifiable Rewards Are a Big Deal
Reward modeling is hard. Reward models are noisy. Reward hacking is real. Verifiers are cleaner: a Python interpreter says "yes, correct" or "no, wrong." When your task has a verifier, you skip a fragile abstraction.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
flowchart TD
Prompt[Math problem] --> Model[Model proposes solution]
Model --> Verifier[Python interpreter checks]
Verifier -->|correct| Reward[Reward = 1]
Verifier -->|incorrect| Reward2[Reward = 0]
Reward --> Update[Update policy]
Reward2 --> Update
This is the engine behind 2025-2026 reasoning gains. Math, code, and structured tool-use benchmarks all jumped dramatically once RLVR pipelines stabilized.
Distillation
Often interleaved with the above: train smaller models to imitate larger trained models. Distillation can preserve most of the quality at a fraction of the cost. DeepSeek-R1's distilled variants and Microsoft's Phi-4 family are 2026 examples.
Order in Practice
A 2026 frontier-lab order, simplified:
- SFT on diverse instructions
- DPO/IPO on preference data for general helpfulness and safety
- PPO or GRPO on a curated reward model (still used by labs that have invested in RM infrastructure)
- RLVR on math, code, and structured-tool benchmarks
- Final SFT on the highest-quality outputs to consolidate
Open-source recipes (Tulu 3, Llama 4 post-training, Qwen3 post-training) follow similar patterns with public data.
What Practitioners Do
For application-specific fine-tuning in 2026:
- Stick to SFT and DPO unless you have unique data
- Use LoRA / QLoRA — full fine-tunes are rarely needed
- For domain-specific reasoning (code, math, structured outputs), invest in a verifier and try RLVR
- Reserve full RLHF for cases where you have the human labeling pipeline already
Open-Source Tooling
- TRL (Hugging Face): SFT, DPO, KTO, PPO, GRPO recipes
- OpenRLHF: scalable PPO for large models
- Axolotl: configuration-driven SFT/DPO
- LLaMA-Factory: comprehensive post-training UI/CLI
Sources
- DPO paper — https://arxiv.org/abs/2305.18290
- DeepSeek-R1 paper — https://arxiv.org/abs/2501.12948
- "Tulu 3" Allen AI — https://arxiv.org/abs/2411.15124
- TRL library — https://huggingface.co/docs/trl
- OpenRLHF — https://github.com/OpenRLHF/OpenRLHF
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.