---
title: "Speculative Decoding: Using Small Models to Speed Up Large Model Inference"
description: "Learn how speculative decoding uses lightweight draft models to generate candidate tokens that a large target model verifies in parallel, achieving 2-3x inference speedups without quality loss."
canonical: https://callsphere.ai/blog/speculative-decoding-small-models-speed-up-large-model-inference
category: "Learn Agentic AI"
tags: ["Speculative Decoding", "Inference Optimization", "Draft Models", "Performance", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-08T09:30:25.490Z
---

# Speculative Decoding: Using Small Models to Speed Up Large Model Inference

> Learn how speculative decoding uses lightweight draft models to generate candidate tokens that a large target model verifies in parallel, achieving 2-3x inference speedups without quality loss.

## The Inference Bottleneck

Large language model inference is fundamentally bottlenecked by memory bandwidth, not compute. Each token generation requires loading billions of parameters from memory, but the actual computation per token is minimal. This means that whether you are generating one token or checking five candidate tokens, the wall-clock time is similar — the memory transfer dominates.

Speculative decoding exploits this insight: use a small, fast model to draft several tokens at once, then verify all of them in a single pass through the large model. If the large model agrees with the draft, you have generated multiple tokens in the time it would take to generate one.

## How Speculative Decoding Works

The process has three phases:

```mermaid
flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching
vLLM scheduler"]
    PREF{"Prefill or
decode?"}
    PRE["Prefill phase
parallel attention"]
    DEC["Decode phase
token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling
top-p, temp"]
    STREAM["Stream tokens
to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

**Draft phase.** A small model (the draft model) autoregressively generates K candidate tokens. Because the draft model is small, this is fast — often faster than a single forward pass of the target model.

**Verify phase.** The large target model processes all K draft tokens in a single forward pass, computing the probability distribution for each position. This is efficient because transformer attention over K tokens in parallel costs roughly the same as generating one token due to the memory-bandwidth bottleneck.

**Accept/reject phase.** Each draft token is compared against the target model's distribution. Tokens are accepted or rejected using a modified rejection sampling scheme that preserves the exact output distribution of the target model.

```python
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode(
    draft_model,
    target_model,
    tokenizer,
    prompt: str,
    max_tokens: int = 100,
    draft_length: int = 5,
) -> str:
    """Speculative decoding with a draft model and target model."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated = input_ids.clone()

    tokens_generated = 0
    while tokens_generated  float:
    """Estimate speculative decoding speedup factor."""
    # Expected tokens per speculation round
    expected_tokens = (1 - acceptance_rate ** (draft_length + 1)) / (1 - acceptance_rate)

    # Time per speculation round
    round_time = draft_length * draft_time_ms + target_time_ms

    # Standard autoregressive time for same tokens
    standard_time = expected_tokens * target_time_ms

    return standard_time / round_time
```

## Implementation in Agent Pipelines

For agent developers using API-based inference, speculative decoding is typically handled by the serving infrastructure (vLLM, TensorRT-LLM, llama.cpp all support it). Your role is choosing the right draft model and tuning the draft length.

For self-hosted agents, enable speculative decoding in your serving framework. In vLLM, it is a configuration flag. The serving layer handles the draft-verify-accept cycle transparently, and your application code sees only faster token generation with identical output quality.

## FAQ

### Does speculative decoding change the output quality?

No. The mathematical guarantee of speculative decoding is that the output distribution is identical to what the target model would produce on its own. The rejection sampling scheme ensures that accepted tokens follow the exact same probability distribution. You get speed without any quality tradeoff.

### What draft length should I use?

Start with K=5 and tune based on your acceptance rate. Higher acceptance rates support longer draft lengths (K=8-10). Lower acceptance rates benefit from shorter drafts (K=3-4) because rejected tokens waste the draft model's compute. Monitor the acceptance rate in production and adjust accordingly.

### Can I use speculative decoding with API providers like OpenAI?

Not directly from your application code — the draft-verify cycle requires access to both models' logits during generation. However, API providers implement speculative decoding internally on their serving infrastructure. You benefit from it automatically without any code changes.

---

#SpeculativeDecoding #InferenceOptimization #DraftModels #Performance #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/speculative-decoding-small-models-speed-up-large-model-inference
