---
title: "Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation"
description: "Speculative decoding is now standard for LLM inference. The 2026 algorithms — EAGLE-3, Medusa-V2, MTP — and how to choose between them."
canonical: https://callsphere.ai/blog/speculative-decoding-2026-eagle-3-medusa-v2-self-speculation
category: "Large Language Models"
tags: ["Speculative Decoding", "Inference", "LLM", "Optimization", "EAGLE"]
author: "CallSphere Team"
published: 2026-04-24T00:00:00.000Z
updated: 2026-05-08T17:27:37.432Z
---

# Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation

> Speculative decoding is now standard for LLM inference. The 2026 algorithms — EAGLE-3, Medusa-V2, MTP — and how to choose between them.

## What Speculative Decoding Does

LLM autoregressive generation is bottlenecked by sequential token-by-token decoding. Speculative decoding flips that: a small fast "draft" model proposes several tokens ahead, the big "target" model verifies them in parallel, and the system accepts the longest run that the target model agrees with.

When the draft is well-aligned with the target, this gives 2-4x throughput improvement at zero quality cost. By 2026 it is standard in every production inference server.

## The Core Algorithm

```mermaid
flowchart LR
    Prompt --> Draft[Draft Model
fast]
    Draft --> Tokens[Propose K tokens]
    Tokens --> Target[Target Model
verify in parallel]
    Target --> Accept{Compare
distributions}
    Accept -->|match| Take[Accept tokens]
    Accept -->|mismatch| Resample[Resample at first divergence]
    Take --> Loop[Repeat]
    Resample --> Loop
```

The key property: when you accept the draft, you produce K tokens for the latency of one target forward pass. When you reject, you waste the draft compute but produce a target-sampled token anyway — never wrong, just slow on bad guesses.

## The Algorithms That Matter in 2026

### EAGLE-3

EAGLE family algorithms train the draft as a tiny decoder head that uses the target model's hidden states as input. EAGLE-3 (2025) uses the target's deep hidden states and a draft tree (multiple candidates per position) to push acceptance rates above 75 percent on standard benchmarks. It is the highest-quality method in 2026 for general-purpose LLMs.

### Medusa-V2

Medusa attaches multiple decoding heads to the target model itself. Each head predicts a different position into the future. The simpler and more compact version of speculative decoding — easier to deploy, slightly lower acceptance rates than EAGLE-3.

### Multi-Token Prediction (MTP)

DeepSeek pioneered this in V3 and V4: the model is trained from scratch to predict multiple tokens in parallel. No draft model needed; the target itself produces multiple tokens per step. Highest quality, requires retraining.

### Self-Speculation

The target model uses its own earlier tokens (from the same sequence) as draft. Cheap to deploy, no extra parameters. Lower acceptance rates but zero memory overhead.

## Side-by-Side

| Method | Acceptance | Setup | Memory Overhead |
| --- | --- | --- | --- |
| EAGLE-3 | 70-78% | Train EAGLE head | Small |
| Medusa-V2 | 60-70% | Train heads | Small |
| MTP | Built-in 80%+ | Retrain target | None (built into model) |
| Self-Speculation | 40-55% | None | None |

For deploying an existing model, EAGLE-3 is the leader in 2026. For new pretraining, MTP is the path most frontier labs are taking (DeepSeek V4 is the public example).

## Tree Verification

```mermaid
flowchart TB
    Prompt --> D[Draft proposes tree:
multiple candidates per position]
    D --> T[Target verifies tree in one pass]
    T --> Acc[Accept longest matching path]
```

Tree-based drafts (EAGLE-3, SpecInfer) propose multiple candidate continuations at each position. The target verifies all of them in a single forward pass via a tree-attention mask. Higher hardware utilization, higher acceptance.

## Cost and Latency

The numbers from 2026 benchmarks on a Llama-3-70B on H200:

- Baseline: ~38 tokens/sec
- Medusa-V2: ~85 tokens/sec
- EAGLE-3: ~115 tokens/sec
- MTP-style (DeepSeek V4): ~140 tokens/sec on the equivalent model size

For batch-1 latency-sensitive workloads (voice agents, interactive code completion), speculative decoding is essential — it's the difference between 200ms first-token latency and 80ms.

## Where It Underperforms

- **Highly creative or random sampling**: high temperature reduces acceptance rates because the draft and target diverge more
- **Out-of-distribution prompts**: draft model trained on different data than target loses acceptance
- **Very large drafts**: a 7B drafting for a 70B target is too slow; the draft must be much smaller

## What Inference Servers Ship

vLLM, TensorRT-LLM, SGLang, and TGI all ship speculative decoding in 2026. EAGLE and Medusa support is mature; MTP is integrated when serving a model trained for it (DeepSeek V4, etc.).

For most teams, the right action is to enable speculative decoding with the engine's default; tune draft model size only if benchmarks reveal headroom.

## Sources

- EAGLE-3 paper — [https://arxiv.org/abs/2503.01840](https://arxiv.org/abs/2503.01840)
- Medusa paper — [https://arxiv.org/abs/2401.10774](https://arxiv.org/abs/2401.10774)
- DeepSeek V3 MTP discussion — [https://github.com/deepseek-ai/DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)
- vLLM speculative decoding docs — [https://docs.vllm.ai](https://docs.vllm.ai)
- "Speculative decoding survey" 2025 — [https://arxiv.org/abs/2401.07851](https://arxiv.org/abs/2401.07851)

## Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation — operator perspective

Reading Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation as an operator, the question isn't 'is this exciting?' — it's 'does this change anything in my agent loop, my prompt cache, or my cost per session?' The CallSphere stack treats announcements as input to an evals queue, not a product roadmap. Production agents stay pinned; new releases earn their slot only after a regression suite confirms cost, latency, and tool-call reliability move the right way.

## Base model vs. production LLM stack — the gap that costs you uptime

A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: `gpt-4o-realtime` for the live call (streaming audio in and out, tool calls inline) and `gpt-4o-mini` for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback.

## FAQs

**Q: Does speculative Decoding in 2026 actually move p95 latency or tool-call reliability?**

A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. CallSphere runs 37 specialized AI agents wired to 90+ function tools across 115+ database tables in 6 live verticals.

**Q: What would have to be true before speculative Decoding in 2026 ships into production?**

A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change.

**Q: Which CallSphere vertical would benefit from speculative Decoding in 2026 first?**

A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Real Estate, which already run the largest share of production traffic.

## See it live

Want to see sales agents handle real traffic? Walk through https://sales.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.

---

Source: https://callsphere.ai/blog/speculative-decoding-2026-eagle-3-medusa-v2-self-speculation
