Skip to content
Learn Agentic AI
Learn Agentic AI10 min read3 views

Speculative Decoding: Using Small Models to Speed Up Large Model Inference

Learn how speculative decoding uses lightweight draft models to generate candidate tokens that a large target model verifies in parallel, achieving 2-3x inference speedups without quality loss.

The Inference Bottleneck

Large language model inference is fundamentally bottlenecked by memory bandwidth, not compute. Each token generation requires loading billions of parameters from memory, but the actual computation per token is minimal. This means that whether you are generating one token or checking five candidate tokens, the wall-clock time is similar — the memory transfer dominates.

Speculative decoding exploits this insight: use a small, fast model to draft several tokens at once, then verify all of them in a single pass through the large model. If the large model agrees with the draft, you have generated multiple tokens in the time it would take to generate one.

How Speculative Decoding Works

The process has three phases:

flowchart TD
    START["Speculative Decoding: Using Small Models to Speed…"] --> A
    A["The Inference Bottleneck"]
    A --> B
    B["How Speculative Decoding Works"]
    B --> C
    C["Speedup Factors and Draft Model Selecti…"]
    C --> D
    D["Implementation in Agent Pipelines"]
    D --> E
    E["FAQ"]
    E --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Draft phase. A small model (the draft model) autoregressively generates K candidate tokens. Because the draft model is small, this is fast — often faster than a single forward pass of the target model.

Verify phase. The large target model processes all K draft tokens in a single forward pass, computing the probability distribution for each position. This is efficient because transformer attention over K tokens in parallel costs roughly the same as generating one token due to the memory-bandwidth bottleneck.

Accept/reject phase. Each draft token is compared against the target model's distribution. Tokens are accepted or rejected using a modified rejection sampling scheme that preserves the exact output distribution of the target model.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode(
    draft_model,
    target_model,
    tokenizer,
    prompt: str,
    max_tokens: int = 100,
    draft_length: int = 5,
) -> str:
    """Speculative decoding with a draft model and target model."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids.clone()

    tokens_generated = 0
    while tokens_generated < max_tokens:
        # Phase 1: Draft K tokens with the small model
        draft_ids = generated.clone()
        draft_probs_list = []

        for _ in range(draft_length):
            with torch.no_grad():
                draft_out = draft_model(draft_ids)
                draft_logits = draft_out.logits[:, -1, :]
                draft_probs = torch.softmax(draft_logits, dim=-1)
                draft_probs_list.append(draft_probs)
                next_token = torch.multinomial(draft_probs, 1)
                draft_ids = torch.cat([draft_ids, next_token], dim=1)

        # Phase 2: Verify all draft tokens with the target model
        with torch.no_grad():
            target_out = target_model(draft_ids)
            target_logits = target_out.logits

        # Phase 3: Accept or reject each draft token
        n_accepted = 0
        for i in range(draft_length):
            pos = generated.shape[1] + i
            target_probs = torch.softmax(target_logits[:, pos - 1, :], dim=-1)
            draft_token = draft_ids[:, pos]
            draft_p = draft_probs_list[i][:, draft_token].item()
            target_p = target_probs[:, draft_token].item()

            # Acceptance criterion preserving target distribution
            if np.random.random() < min(1.0, target_p / (draft_p + 1e-10)):
                n_accepted += 1
            else:
                # Reject: sample from adjusted distribution
                adjusted = torch.clamp(target_probs - draft_probs_list[i], min=0)
                adjusted = adjusted / adjusted.sum()
                new_token = torch.multinomial(adjusted, 1)
                generated = torch.cat([generated, draft_ids[:, generated.shape[1]:pos].reshape(1, -1), new_token], dim=1)
                tokens_generated += n_accepted + 1
                break
        else:
            # All draft tokens accepted, sample one bonus token
            generated = draft_ids
            tokens_generated += draft_length

        if tokenizer.eos_token_id in generated[0, input_ids.shape[1]:]:
            break

    return tokenizer.decode(generated[0, input_ids.shape[1]:], skip_special_tokens=True)

Speedup Factors and Draft Model Selection

The speedup depends on the acceptance rate — how often the target model agrees with the draft model. A well-matched draft model that agrees 70-80% of the time typically yields 2-3x speedup. Poor matches drop to 1.2-1.5x or even no speedup.

Good draft model choices:

  • A smaller model from the same family (Llama-7B drafting for Llama-70B)
  • A quantized version of the target model
  • A model fine-tuned on similar data distributions
def estimate_speedup(
    acceptance_rate: float, draft_length: int,
    draft_time_ms: float, target_time_ms: float,
) -> float:
    """Estimate speculative decoding speedup factor."""
    # Expected tokens per speculation round
    expected_tokens = (1 - acceptance_rate ** (draft_length + 1)) / (1 - acceptance_rate)

    # Time per speculation round
    round_time = draft_length * draft_time_ms + target_time_ms

    # Standard autoregressive time for same tokens
    standard_time = expected_tokens * target_time_ms

    return standard_time / round_time

Implementation in Agent Pipelines

For agent developers using API-based inference, speculative decoding is typically handled by the serving infrastructure (vLLM, TensorRT-LLM, llama.cpp all support it). Your role is choosing the right draft model and tuning the draft length.

For self-hosted agents, enable speculative decoding in your serving framework. In vLLM, it is a configuration flag. The serving layer handles the draft-verify-accept cycle transparently, and your application code sees only faster token generation with identical output quality.

FAQ

Does speculative decoding change the output quality?

No. The mathematical guarantee of speculative decoding is that the output distribution is identical to what the target model would produce on its own. The rejection sampling scheme ensures that accepted tokens follow the exact same probability distribution. You get speed without any quality tradeoff.

What draft length should I use?

Start with K=5 and tune based on your acceptance rate. Higher acceptance rates support longer draft lengths (K=8-10). Lower acceptance rates benefit from shorter drafts (K=3-4) because rejected tokens waste the draft model's compute. Monitor the acceptance rate in production and adjust accordingly.

Can I use speculative decoding with API providers like OpenAI?

Not directly from your application code — the draft-verify cycle requires access to both models' logits during generation. However, API providers implement speculative decoding internally on their serving infrastructure. You benefit from it automatically without any code changes.


#SpeculativeDecoding #InferenceOptimization #DraftModels #Performance #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

Sub-500ms Latency Voice Agents: Architecture Patterns for Production Deployment

Technical deep dive into achieving under 500ms voice agent latency with streaming architectures, edge deployment, connection pooling, pre-warming, and async tool execution.

Learn Agentic AI

Adaptive Thinking in Claude 4.6: How AI Agents Decide When and How Much to Reason

Technical exploration of adaptive thinking in Claude 4.6 — how the model dynamically adjusts reasoning depth, its impact on agent architectures, and practical implementation patterns.