---
title: "Distillation: Training Smaller Models to Mimic Larger Ones for Production Use"
description: "Learn how to use knowledge distillation to transfer capabilities from large teacher models to smaller, cheaper student models suitable for production deployment, with concrete examples of cost savings and quality tradeoffs."
canonical: https://callsphere.ai/blog/knowledge-distillation-training-smaller-models-production
category: "Learn Agentic AI"
tags: ["Knowledge Distillation", "Model Compression", "Fine-Tuning", "Production ML", "Cost Optimization"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T02:36:10.163Z
---

# Distillation: Training Smaller Models to Mimic Larger Ones for Production Use

> Learn how to use knowledge distillation to transfer capabilities from large teacher models to smaller, cheaper student models suitable for production deployment, with concrete examples of cost savings and quality tradeoffs.

## The Production Cost Problem

GPT-4o produces excellent results. It also costs $2.50 per million input tokens and $10 per million output tokens. At 100,000 requests per day with an average of 2,000 tokens per request, that is roughly $2,000-5,000 per month. A distilled GPT-4o-mini or fine-tuned Llama 3.1 8B can deliver 80-95% of the quality at 5-20% of the cost.

Knowledge distillation is the process of training a smaller "student" model to replicate the behavior of a larger "teacher" model. Unlike traditional fine-tuning where you need human-labeled data, distillation uses the teacher model itself to generate training data and labels.

## The Distillation Pipeline

The basic approach is straightforward: send your production prompts to the teacher model, collect its responses, and fine-tune the student model on those input-output pairs.

```mermaid
flowchart LR
    DATA[("Curated dataset
instruction or chat")]
    CLEAN["Clean and dedupe
PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA
adapters only"]
    SFT["Full SFT
all params"]
    DPO["DPO or RLHF
preference learning"]
    EVAL["Held out eval
plus regression suite"]
    DEPLOY[("Adapter or
merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff
```

```python
from openai import OpenAI
import json
import asyncio
from typing import Optional

client = OpenAI()

async def generate_teacher_response(
    client: OpenAI,
    messages: list[dict],
    model: str = "gpt-4o",
) -> Optional[str]:
    """Get a response from the teacher model."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Teacher error: {e}")
        return None

def build_distillation_dataset(
    production_inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
    output_path: str = "distillation_data.jsonl",
) -> int:
    """Generate distillation training data from production inputs."""
    count = 0

    with open(output_path, "w") as f:
        for input_data in production_inputs:
            user_message = input_data["user_message"]
            messages = []
            if system_prompt:
                messages.append({"role": "system", "content": system_prompt})
            messages.append({"role": "user", "content": user_message})

            teacher_response = client.chat.completions.create(
                model=teacher_model,
                messages=messages,
                temperature=0.0,
            ).choices[0].message.content

            if teacher_response:
                training_example = {
                    "messages": messages + [
                        {"role": "assistant", "content": teacher_response}
                    ]
                }
                f.write(json.dumps(training_example) + "\n")
                count += 1

    print(f"Generated {count} distillation examples")
    return count
```

## Selective Distillation: Focus on What Matters

Not all teacher responses are worth learning from. A teacher that produces a mediocre response teaches mediocre behavior. Filter teacher responses before adding them to the training set.

```python
def selective_distillation(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    judge_model: str = "gpt-4o-mini",
    quality_threshold: float = 4.0,
    system_prompt: str = "",
) -> list[dict]:
    """Generate and filter distillation data using a quality judge."""
    high_quality = []

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": user_message})

        # Get teacher response
        teacher_response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Judge the quality
        judge_response = client.chat.completions.create(
            model=judge_model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Rate this response on a 1-5 scale for accuracy, "
                        "helpfulness, and completeness. Output only the number."
                    ),
                },
                {
                    "role": "user",
                    "content": f"Question: {user_message}\n\nResponse: {teacher_response}",
                },
            ],
            temperature=0.0,
            max_tokens=5,
        )

        try:
            score = float(judge_response.choices[0].message.content.strip())
        except ValueError:
            continue

        if score >= quality_threshold:
            high_quality.append({
                "messages": messages + [
                    {"role": "assistant", "content": teacher_response}
                ],
                "quality_score": score,
            })

    print(f"Kept {len(high_quality)}/{len(inputs)} examples (score >= {quality_threshold})")
    return high_quality
```

## Chain-of-Thought Distillation

For reasoning-heavy tasks, distill the teacher's reasoning process — not just its final answer. This transfers the problem-solving strategy, not merely the output.

```python
def distill_with_reasoning(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
) -> list[dict]:
    """Distill chain-of-thought reasoning from teacher to student."""
    examples = []

    cot_system = (
        f"{system_prompt}\n\n"
        "Think through the problem step by step before giving your final answer. "
        "Format: first show your reasoning under '## Reasoning', "
        "then give the final answer under '## Answer'."
    )

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = [
            {"role": "system", "content": cot_system},
            {"role": "user", "content": user_message},
        ]

        response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Verify the response contains both sections
        if "## Reasoning" in response and "## Answer" in response:
            examples.append({
                "messages": [
                    {"role": "system", "content": cot_system},
                    {"role": "user", "content": user_message},
                    {"role": "assistant", "content": response},
                ]
            })

    return examples
```

## Cost Analysis: Teacher vs Student

```python
def calculate_distillation_roi(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    teacher_input_cost_per_m: float,   # e.g., $2.50 for GPT-4o
    teacher_output_cost_per_m: float,  # e.g., $10.00 for GPT-4o
    student_input_cost_per_m: float,   # e.g., $0.15 for GPT-4o-mini
    student_output_cost_per_m: float,  # e.g., $0.60 for GPT-4o-mini
    distillation_examples: int = 5000,
    distillation_epochs: int = 3,
) -> dict:
    """Calculate the ROI of distillation."""
    # Monthly inference costs
    monthly_requests = daily_requests * 30
    monthly_input_tokens = monthly_requests * avg_input_tokens / 1_000_000
    monthly_output_tokens = monthly_requests * avg_output_tokens / 1_000_000

    teacher_monthly = (
        monthly_input_tokens * teacher_input_cost_per_m
        + monthly_output_tokens * teacher_output_cost_per_m
    )
    student_monthly = (
        monthly_input_tokens * student_input_cost_per_m
        + monthly_output_tokens * student_output_cost_per_m
    )

    # One-time distillation cost (generating training data with teacher)
    distillation_tokens = distillation_examples * (avg_input_tokens + avg_output_tokens)
    distillation_cost = (
        distillation_tokens / 1_000_000 * teacher_input_cost_per_m
        + distillation_tokens / 1_000_000 * teacher_output_cost_per_m
    )

    monthly_savings = teacher_monthly - student_monthly
    break_even_months = distillation_cost / monthly_savings if monthly_savings > 0 else float("inf")

    return {
        "teacher_monthly_cost": f"${teacher_monthly:,.2f}",
        "student_monthly_cost": f"${student_monthly:,.2f}",
        "monthly_savings": f"${monthly_savings:,.2f}",
        "distillation_cost": f"${distillation_cost:,.2f}",
        "break_even_months": round(break_even_months, 1),
        "annual_savings": f"${monthly_savings * 12 - distillation_cost:,.2f}",
    }

# Example: 50K requests/day, 500 input + 300 output tokens average
roi = calculate_distillation_roi(
    daily_requests=50_000,
    avg_input_tokens=500,
    avg_output_tokens=300,
    teacher_input_cost_per_m=2.50,
    teacher_output_cost_per_m=10.00,
    student_input_cost_per_m=0.15,
    student_output_cost_per_m=0.60,
)
# teacher_monthly: $5,625, student_monthly: $405, monthly_savings: $5,220
# break_even: 0.1 months, annual_savings: ~$62,000
```

## FAQ

### How much quality loss should I expect from distillation?

For well-defined tasks (classification, extraction, formatting), distilled models retain 90-98% of teacher quality. For open-ended generation and complex reasoning, expect 80-90%. Narrow tasks distill well; broad creative tasks distill poorly because the student cannot capture the teacher's full capability distribution.

### Should I distill within the same model family or cross-family?

Same-family distillation (GPT-4o to GPT-4o-mini, Llama 70B to 8B) works better because architectures share representations. Cross-family works but needs more data. Choose based on deployment needs — if you need self-hosted, distill to open-source regardless of teacher family.

### How many distillation examples do I need?

For focused tasks, 1,000-3,000 high-quality examples suffice. For broader capabilities, aim for 5,000-10,000. Coverage matters more than volume — a thousand diverse examples beats ten thousand repetitive ones.

---

#KnowledgeDistillation #ModelCompression #FineTuning #ProductionML #CostOptimization #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/knowledge-distillation-training-smaller-models-production
