What Speculative Decoding Does

LLM autoregressive generation is bottlenecked by sequential token-by-token decoding. Speculative decoding flips that: a small fast "draft" model proposes several tokens ahead, the big "target" model verifies them in parallel, and the system accepts the longest run that the target model agrees with.

When the draft is well-aligned with the target, this gives 2-4x throughput improvement at zero quality cost. By 2026 it is standard in every production inference server.

The Core Algorithm

flowchart LR
    Prompt --> Draft[Draft Model<br/>fast]
    Draft --> Tokens[Propose K tokens]
    Tokens --> Target[Target Model<br/>verify in parallel]
    Target --> Accept{Compare<br/>distributions}
    Accept -->|match| Take[Accept tokens]
    Accept -->|mismatch| Resample[Resample at first divergence]
    Take --> Loop[Repeat]
    Resample --> Loop

The key property: when you accept the draft, you produce K tokens for the latency of one target forward pass. When you reject, you waste the draft compute but produce a target-sampled token anyway — never wrong, just slow on bad guesses.

The Algorithms That Matter in 2026

EAGLE-3

EAGLE family algorithms train the draft as a tiny decoder head that uses the target model's hidden states as input. EAGLE-3 (2025) uses the target's deep hidden states and a draft tree (multiple candidates per position) to push acceptance rates above 75 percent on standard benchmarks. It is the highest-quality method in 2026 for general-purpose LLMs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Medusa-V2

Medusa attaches multiple decoding heads to the target model itself. Each head predicts a different position into the future. The simpler and more compact version of speculative decoding — easier to deploy, slightly lower acceptance rates than EAGLE-3.

Multi-Token Prediction (MTP)

DeepSeek pioneered this in V3 and V4: the model is trained from scratch to predict multiple tokens in parallel. No draft model needed; the target itself produces multiple tokens per step. Highest quality, requires retraining.

Self-Speculation

The target model uses its own earlier tokens (from the same sequence) as draft. Cheap to deploy, no extra parameters. Lower acceptance rates but zero memory overhead.

Side-by-Side

Method	Acceptance	Setup	Memory Overhead
EAGLE-3	70-78%	Train EAGLE head	Small
Medusa-V2	60-70%	Train heads	Small
MTP	Built-in 80%+	Retrain target	None (built into model)
Self-Speculation	40-55%	None	None

For deploying an existing model, EAGLE-3 is the leader in 2026. For new pretraining, MTP is the path most frontier labs are taking (DeepSeek V4 is the public example).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Tree Verification

flowchart TB
    Prompt --> D[Draft proposes tree:<br/>multiple candidates per position]
    D --> T[Target verifies tree in one pass]
    T --> Acc[Accept longest matching path]

Tree-based drafts (EAGLE-3, SpecInfer) propose multiple candidate continuations at each position. The target verifies all of them in a single forward pass via a tree-attention mask. Higher hardware utilization, higher acceptance.

Cost and Latency

The numbers from 2026 benchmarks on a Llama-3-70B on H200:

Baseline: ~38 tokens/sec
Medusa-V2: ~85 tokens/sec
EAGLE-3: ~115 tokens/sec
MTP-style (DeepSeek V4): ~140 tokens/sec on the equivalent model size

For batch-1 latency-sensitive workloads (voice agents, interactive code completion), speculative decoding is essential — it's the difference between 200ms first-token latency and 80ms.

Where It Underperforms

Highly creative or random sampling: high temperature reduces acceptance rates because the draft and target diverge more
Out-of-distribution prompts: draft model trained on different data than target loses acceptance
Very large drafts: a 7B drafting for a 70B target is too slow; the draft must be much smaller

What Inference Servers Ship

vLLM, TensorRT-LLM, SGLang, and TGI all ship speculative decoding in 2026. EAGLE and Medusa support is mature; MTP is integrated when serving a model trained for it (DeepSeek V4, etc.).

For most teams, the right action is to enable speculative decoding with the engine's default; tune draft model size only if benchmarks reveal headroom.

Sources

EAGLE-3 paper — https://arxiv.org/abs/2503.01840
Medusa paper — https://arxiv.org/abs/2401.10774
DeepSeek V3 MTP discussion — https://github.com/deepseek-ai/DeepSeek-V3
vLLM speculative decoding docs — https://docs.vllm.ai
"Speculative decoding survey" 2025 — https://arxiv.org/abs/2401.07851

## Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation — operator perspective Reading Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation as an operator, the question isn't 'is this exciting?' — it's 'does this change anything in my agent loop, my prompt cache, or my cost per session?' The CallSphere stack treats announcements as input to an evals queue, not a product roadmap. Production agents stay pinned; new releases earn their slot only after a regression suite confirms cost, latency, and tool-call reliability move the right way. ## Base model vs. production LLM stack — the gap that costs you uptime A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: `gpt-4o-realtime` for the live call (streaming audio in and out, tool calls inline) and `gpt-4o-mini` for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback. ## FAQs **Q: Does speculative Decoding in 2026 actually move p95 latency or tool-call reliability?** A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. CallSphere runs 37 specialized AI agents wired to 90+ function tools across 115+ database tables in 6 live verticals. **Q: What would have to be true before speculative Decoding in 2026 ships into production?** A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change. **Q: Which CallSphere vertical would benefit from speculative Decoding in 2026 first?** A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Real Estate, which already run the largest share of production traffic. ## See it live Want to see sales agents handle real traffic? Walk through https://sales.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.

Speculative Decoding in 2026: EAGLE-3, Medusa-V2, and Self-Speculation

What Speculative Decoding Does

The Core Algorithm

The Algorithms That Matter in 2026

EAGLE-3

Medusa-V2

Multi-Token Prediction (MTP)

Self-Speculation

Side-by-Side

Tree Verification

Cost and Latency

Where It Underperforms

What Inference Servers Ship

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Claude Sonnet 4.6 Vision Capabilities for Document and Chart Unders...

Claude for Equity Research: Workflows from Buy-Side Analysts

Constitutional AI: Genuine Safety Moat or Sophisticated Marketing?

Ollama in 2026: Is It Production-Ready Now? An Honest Look

Bedrock Agents Powered by Claude: A Reference Architecture

Multi-Provider Failover: Patterns That Don't Drop Quality