Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation
Speculative decoding is now standard for LLM inference. The 2026 algorithms — EAGLE-3, Medusa-V2, MTP — and how to choose between them.
What Speculative Decoding Does
LLM autoregressive generation is bottlenecked by sequential token-by-token decoding. Speculative decoding flips that: a small fast "draft" model proposes several tokens ahead, the big "target" model verifies them in parallel, and the system accepts the longest run that the target model agrees with.
When the draft is well-aligned with the target, this gives 2-4x throughput improvement at zero quality cost. By 2026 it is standard in every production inference server.
The Core Algorithm
flowchart LR
Prompt --> Draft[Draft Model<br/>fast]
Draft --> Tokens[Propose K tokens]
Tokens --> Target[Target Model<br/>verify in parallel]
Target --> Accept{Compare<br/>distributions}
Accept -->|match| Take[Accept tokens]
Accept -->|mismatch| Resample[Resample at first divergence]
Take --> Loop[Repeat]
Resample --> Loop
The key property: when you accept the draft, you produce K tokens for the latency of one target forward pass. When you reject, you waste the draft compute but produce a target-sampled token anyway — never wrong, just slow on bad guesses.
The Algorithms That Matter in 2026
EAGLE-3
EAGLE family algorithms train the draft as a tiny decoder head that uses the target model's hidden states as input. EAGLE-3 (2025) uses the target's deep hidden states and a draft tree (multiple candidates per position) to push acceptance rates above 75 percent on standard benchmarks. It is the highest-quality method in 2026 for general-purpose LLMs.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Medusa-V2
Medusa attaches multiple decoding heads to the target model itself. Each head predicts a different position into the future. The simpler and more compact version of speculative decoding — easier to deploy, slightly lower acceptance rates than EAGLE-3.
Multi-Token Prediction (MTP)
DeepSeek pioneered this in V3 and V4: the model is trained from scratch to predict multiple tokens in parallel. No draft model needed; the target itself produces multiple tokens per step. Highest quality, requires retraining.
Self-Speculation
The target model uses its own earlier tokens (from the same sequence) as draft. Cheap to deploy, no extra parameters. Lower acceptance rates but zero memory overhead.
Side-by-Side
| Method | Acceptance | Setup | Memory Overhead |
|---|---|---|---|
| EAGLE-3 | 70-78% | Train EAGLE head | Small |
| Medusa-V2 | 60-70% | Train heads | Small |
| MTP | Built-in 80%+ | Retrain target | None (built into model) |
| Self-Speculation | 40-55% | None | None |
For deploying an existing model, EAGLE-3 is the leader in 2026. For new pretraining, MTP is the path most frontier labs are taking (DeepSeek V4 is the public example).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Tree Verification
flowchart TB
Prompt --> D[Draft proposes tree:<br/>multiple candidates per position]
D --> T[Target verifies tree in one pass]
T --> Acc[Accept longest matching path]
Tree-based drafts (EAGLE-3, SpecInfer) propose multiple candidate continuations at each position. The target verifies all of them in a single forward pass via a tree-attention mask. Higher hardware utilization, higher acceptance.
Cost and Latency
The numbers from 2026 benchmarks on a Llama-3-70B on H200:
- Baseline: ~38 tokens/sec
- Medusa-V2: ~85 tokens/sec
- EAGLE-3: ~115 tokens/sec
- MTP-style (DeepSeek V4): ~140 tokens/sec on the equivalent model size
For batch-1 latency-sensitive workloads (voice agents, interactive code completion), speculative decoding is essential — it's the difference between 200ms first-token latency and 80ms.
Where It Underperforms
- Highly creative or random sampling: high temperature reduces acceptance rates because the draft and target diverge more
- Out-of-distribution prompts: draft model trained on different data than target loses acceptance
- Very large drafts: a 7B drafting for a 70B target is too slow; the draft must be much smaller
What Inference Servers Ship
vLLM, TensorRT-LLM, SGLang, and TGI all ship speculative decoding in 2026. EAGLE and Medusa support is mature; MTP is integrated when serving a model trained for it (DeepSeek V4, etc.).
For most teams, the right action is to enable speculative decoding with the engine's default; tune draft model size only if benchmarks reveal headroom.
Sources
- EAGLE-3 paper — https://arxiv.org/abs/2503.01840
- Medusa paper — https://arxiv.org/abs/2401.10774
- DeepSeek V3 MTP discussion — https://github.com/deepseek-ai/DeepSeek-V3
- vLLM speculative decoding docs — https://docs.vllm.ai
- "Speculative decoding survey" 2025 — https://arxiv.org/abs/2401.07851
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.