Distillation: Training Smaller Models to Mimic Larger Ones for Production Use

The Production Cost Problem

GPT-4o produces excellent results. It also costs $2.50 per million input tokens and $10 per million output tokens. At 100,000 requests per day with an average of 2,000 tokens per request, that is roughly $2,000-5,000 per month. A distilled GPT-4o-mini or fine-tuned Llama 3.1 8B can deliver 80-95% of the quality at 5-20% of the cost.

Knowledge distillation is the process of training a smaller "student" model to replicate the behavior of a larger "teacher" model. Unlike traditional fine-tuning where you need human-labeled data, distillation uses the teacher model itself to generate training data and labels.

The Distillation Pipeline

The basic approach is straightforward: send your production prompts to the teacher model, collect its responses, and fine-tune the student model on those input-output pairs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    DATA[("Curated dataset<br/>instruction or chat")]
    CLEAN["Clean and dedupe<br/>PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA<br/>adapters only"]
    SFT["Full SFT<br/>all params"]
    DPO["DPO or RLHF<br/>preference learning"]
    EVAL["Held out eval<br/>plus regression suite"]
    DEPLOY[("Adapter or<br/>merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff

from openai import OpenAI
import json
import asyncio
from typing import Optional

client = OpenAI()

async def generate_teacher_response(
    client: OpenAI,
    messages: list[dict],
    model: str = "gpt-4o",
) -> Optional[str]:
    """Get a response from the teacher model."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Teacher error: {e}")
        return None

def build_distillation_dataset(
    production_inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
    output_path: str = "distillation_data.jsonl",
) -> int:
    """Generate distillation training data from production inputs."""
    count = 0

    with open(output_path, "w") as f:
        for input_data in production_inputs:
            user_message = input_data["user_message"]
            messages = []
            if system_prompt:
                messages.append({"role": "system", "content": system_prompt})
            messages.append({"role": "user", "content": user_message})

            teacher_response = client.chat.completions.create(
                model=teacher_model,
                messages=messages,
                temperature=0.0,
            ).choices[0].message.content

            if teacher_response:
                training_example = {
                    "messages": messages + [
                        {"role": "assistant", "content": teacher_response}
                    ]
                }
                f.write(json.dumps(training_example) + "\n")
                count += 1

    print(f"Generated {count} distillation examples")
    return count

Selective Distillation: Focus on What Matters

Not all teacher responses are worth learning from. A teacher that produces a mediocre response teaches mediocre behavior. Filter teacher responses before adding them to the training set.

def selective_distillation(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    judge_model: str = "gpt-4o-mini",
    quality_threshold: float = 4.0,
    system_prompt: str = "",
) -> list[dict]:
    """Generate and filter distillation data using a quality judge."""
    high_quality = []

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": user_message})

        # Get teacher response
        teacher_response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Judge the quality
        judge_response = client.chat.completions.create(
            model=judge_model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Rate this response on a 1-5 scale for accuracy, "
                        "helpfulness, and completeness. Output only the number."
                    ),
                },
                {
                    "role": "user",
                    "content": f"Question: {user_message}\n\nResponse: {teacher_response}",
                },
            ],
            temperature=0.0,
            max_tokens=5,
        )

        try:
            score = float(judge_response.choices[0].message.content.strip())
        except ValueError:
            continue

        if score >= quality_threshold:
            high_quality.append({
                "messages": messages + [
                    {"role": "assistant", "content": teacher_response}
                ],
                "quality_score": score,
            })

    print(f"Kept {len(high_quality)}/{len(inputs)} examples (score >= {quality_threshold})")
    return high_quality

Chain-of-Thought Distillation

For reasoning-heavy tasks, distill the teacher's reasoning process — not just its final answer. This transfers the problem-solving strategy, not merely the output.

def distill_with_reasoning(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
) -> list[dict]:
    """Distill chain-of-thought reasoning from teacher to student."""
    examples = []

    cot_system = (
        f"{system_prompt}\n\n"
        "Think through the problem step by step before giving your final answer. "
        "Format: first show your reasoning under '## Reasoning', "
        "then give the final answer under '## Answer'."
    )

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = [
            {"role": "system", "content": cot_system},
            {"role": "user", "content": user_message},
        ]

        response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Verify the response contains both sections
        if "## Reasoning" in response and "## Answer" in response:
            examples.append({
                "messages": [
                    {"role": "system", "content": cot_system},
                    {"role": "user", "content": user_message},
                    {"role": "assistant", "content": response},
                ]
            })

    return examples

Cost Analysis: Teacher vs Student

def calculate_distillation_roi(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    teacher_input_cost_per_m: float,   # e.g., $2.50 for GPT-4o
    teacher_output_cost_per_m: float,  # e.g., $10.00 for GPT-4o
    student_input_cost_per_m: float,   # e.g., $0.15 for GPT-4o-mini
    student_output_cost_per_m: float,  # e.g., $0.60 for GPT-4o-mini
    distillation_examples: int = 5000,
    distillation_epochs: int = 3,
) -> dict:
    """Calculate the ROI of distillation."""
    # Monthly inference costs
    monthly_requests = daily_requests * 30
    monthly_input_tokens = monthly_requests * avg_input_tokens / 1_000_000
    monthly_output_tokens = monthly_requests * avg_output_tokens / 1_000_000

    teacher_monthly = (
        monthly_input_tokens * teacher_input_cost_per_m
        + monthly_output_tokens * teacher_output_cost_per_m
    )
    student_monthly = (
        monthly_input_tokens * student_input_cost_per_m
        + monthly_output_tokens * student_output_cost_per_m
    )

    # One-time distillation cost (generating training data with teacher)
    distillation_tokens = distillation_examples * (avg_input_tokens + avg_output_tokens)
    distillation_cost = (
        distillation_tokens / 1_000_000 * teacher_input_cost_per_m
        + distillation_tokens / 1_000_000 * teacher_output_cost_per_m
    )

    monthly_savings = teacher_monthly - student_monthly
    break_even_months = distillation_cost / monthly_savings if monthly_savings > 0 else float("inf")

    return {
        "teacher_monthly_cost": f"${teacher_monthly:,.2f}",
        "student_monthly_cost": f"${student_monthly:,.2f}",
        "monthly_savings": f"${monthly_savings:,.2f}",
        "distillation_cost": f"${distillation_cost:,.2f}",
        "break_even_months": round(break_even_months, 1),
        "annual_savings": f"${monthly_savings * 12 - distillation_cost:,.2f}",
    }

# Example: 50K requests/day, 500 input + 300 output tokens average
roi = calculate_distillation_roi(
    daily_requests=50_000,
    avg_input_tokens=500,
    avg_output_tokens=300,
    teacher_input_cost_per_m=2.50,
    teacher_output_cost_per_m=10.00,
    student_input_cost_per_m=0.15,
    student_output_cost_per_m=0.60,
)
# teacher_monthly: $5,625, student_monthly: $405, monthly_savings: $5,220
# break_even: 0.1 months, annual_savings: ~$62,000

FAQ

How much quality loss should I expect from distillation?

For well-defined tasks (classification, extraction, formatting), distilled models retain 90-98% of teacher quality. For open-ended generation and complex reasoning, expect 80-90%. Narrow tasks distill well; broad creative tasks distill poorly because the student cannot capture the teacher's full capability distribution.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Should I distill within the same model family or cross-family?

Same-family distillation (GPT-4o to GPT-4o-mini, Llama 70B to 8B) works better because architectures share representations. Cross-family works but needs more data. Choose based on deployment needs — if you need self-hosted, distill to open-source regardless of teacher family.

How many distillation examples do I need?

For focused tasks, 1,000-3,000 high-quality examples suffice. For broader capabilities, aim for 5,000-10,000. Coverage matters more than volume — a thousand diverse examples beats ten thousand repetitive ones.

#KnowledgeDistillation #ModelCompression #FineTuning #ProductionML #CostOptimization #AgenticAI #LearnAI #AIEngineering

Distillation: Training Smaller Models to Mimic Larger Ones for Production Use

The Production Cost Problem

The Distillation Pipeline

Selective Distillation: Focus on What Matters

Chain-of-Thought Distillation

Cost Analysis: Teacher vs Student

FAQ

How much quality loss should I expect from distillation?

Should I distill within the same model family or cross-family?

How many distillation examples do I need?

Try CallSphere AI Voice Agents

Related Articles You May Like

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?