Speculative Decoding: Using Small Models to Speed Up Large Model Inference

The Inference Bottleneck

Large language model inference is fundamentally bottlenecked by memory bandwidth, not compute. Each token generation requires loading billions of parameters from memory, but the actual computation per token is minimal. This means that whether you are generating one token or checking five candidate tokens, the wall-clock time is similar — the memory transfer dominates.

Speculative decoding exploits this insight: use a small, fast model to draft several tokens at once, then verify all of them in a single pass through the large model. If the large model agrees with the draft, you have generated multiple tokens in the time it would take to generate one.

How Speculative Decoding Works

The process has three phases:

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Draft phase. A small model (the draft model) autoregressively generates K candidate tokens. Because the draft model is small, this is fast — often faster than a single forward pass of the target model.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Verify phase. The large target model processes all K draft tokens in a single forward pass, computing the probability distribution for each position. This is efficient because transformer attention over K tokens in parallel costs roughly the same as generating one token due to the memory-bandwidth bottleneck.

Accept/reject phase. Each draft token is compared against the target model's distribution. Tokens are accepted or rejected using a modified rejection sampling scheme that preserves the exact output distribution of the target model.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode(
    draft_model,
    target_model,
    tokenizer,
    prompt: str,
    max_tokens: int = 100,
    draft_length: int = 5,
) -> str:
    """Speculative decoding with a draft model and target model."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids.clone()

    tokens_generated = 0
    while tokens_generated < max_tokens:
        # Phase 1: Draft K tokens with the small model
        draft_ids = generated.clone()
        draft_probs_list = []

        for _ in range(draft_length):
            with torch.no_grad():
                draft_out = draft_model(draft_ids)
                draft_logits = draft_out.logits[:, -1, :]
                draft_probs = torch.softmax(draft_logits, dim=-1)
                draft_probs_list.append(draft_probs)
                next_token = torch.multinomial(draft_probs, 1)
                draft_ids = torch.cat([draft_ids, next_token], dim=1)

        # Phase 2: Verify all draft tokens with the target model
        with torch.no_grad():
            target_out = target_model(draft_ids)
            target_logits = target_out.logits

        # Phase 3: Accept or reject each draft token
        n_accepted = 0
        for i in range(draft_length):
            pos = generated.shape[1] + i
            target_probs = torch.softmax(target_logits[:, pos - 1, :], dim=-1)
            draft_token = draft_ids[:, pos]
            draft_p = draft_probs_list[i][:, draft_token].item()
            target_p = target_probs[:, draft_token].item()

            # Acceptance criterion preserving target distribution
            if np.random.random() < min(1.0, target_p / (draft_p + 1e-10)):
                n_accepted += 1
            else:
                # Reject: sample from adjusted distribution
                adjusted = torch.clamp(target_probs - draft_probs_list[i], min=0)
                adjusted = adjusted / adjusted.sum()
                new_token = torch.multinomial(adjusted, 1)
                generated = torch.cat([generated, draft_ids[:, generated.shape[1]:pos].reshape(1, -1), new_token], dim=1)
                tokens_generated += n_accepted + 1
                break
        else:
            # All draft tokens accepted, sample one bonus token
            generated = draft_ids
            tokens_generated += draft_length

        if tokenizer.eos_token_id in generated[0, input_ids.shape[1]:]:
            break

    return tokenizer.decode(generated[0, input_ids.shape[1]:], skip_special_tokens=True)

Speedup Factors and Draft Model Selection

The speedup depends on the acceptance rate — how often the target model agrees with the draft model. A well-matched draft model that agrees 70-80% of the time typically yields 2-3x speedup. Poor matches drop to 1.2-1.5x or even no speedup.

Good draft model choices:

A smaller model from the same family (Llama-7B drafting for Llama-70B)
A quantized version of the target model
A model fine-tuned on similar data distributions

def estimate_speedup(
    acceptance_rate: float, draft_length: int,
    draft_time_ms: float, target_time_ms: float,
) -> float:
    """Estimate speculative decoding speedup factor."""
    # Expected tokens per speculation round
    expected_tokens = (1 - acceptance_rate ** (draft_length + 1)) / (1 - acceptance_rate)

    # Time per speculation round
    round_time = draft_length * draft_time_ms + target_time_ms

    # Standard autoregressive time for same tokens
    standard_time = expected_tokens * target_time_ms

    return standard_time / round_time

Implementation in Agent Pipelines

For agent developers using API-based inference, speculative decoding is typically handled by the serving infrastructure (vLLM, TensorRT-LLM, llama.cpp all support it). Your role is choosing the right draft model and tuning the draft length.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

For self-hosted agents, enable speculative decoding in your serving framework. In vLLM, it is a configuration flag. The serving layer handles the draft-verify-accept cycle transparently, and your application code sees only faster token generation with identical output quality.

FAQ

Does speculative decoding change the output quality?

No. The mathematical guarantee of speculative decoding is that the output distribution is identical to what the target model would produce on its own. The rejection sampling scheme ensures that accepted tokens follow the exact same probability distribution. You get speed without any quality tradeoff.

What draft length should I use?

Start with K=5 and tune based on your acceptance rate. Higher acceptance rates support longer draft lengths (K=8-10). Lower acceptance rates benefit from shorter drafts (K=3-4) because rejected tokens waste the draft model's compute. Monitor the acceptance rate in production and adjust accordingly.

Can I use speculative decoding with API providers like OpenAI?

Not directly from your application code — the draft-verify cycle requires access to both models' logits during generation. However, API providers implement speculative decoding internally on their serving infrastructure. You benefit from it automatically without any code changes.

#SpeculativeDecoding #InferenceOptimization #DraftModels #Performance #AgenticAI #LearnAI #AIEngineering

Speculative Decoding: Using Small Models to Speed Up Large Model Inference

The Inference Bottleneck

How Speculative Decoding Works

Speedup Factors and Draft Model Selection

Implementation in Agent Pipelines

FAQ

Does speculative decoding change the output quality?

What draft length should I use?

Can I use speculative decoding with API providers like OpenAI?

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Anthropic Skills System: Loadable Tool Packs for Claude Agents

Enterprise CIO Guide: Harvey AI — Legal Agents Move from Pilot to Practice

Enterprise CIO Guide: Perplexity Comet — The Agentic Browser Goes Mass Market

Enterprise CIO Guide: Hippocratic AI — Healthcare Agents at Scale

Designing Agent Loops with the Claude Agent SDK