Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

Learn how to use large language models to generate, filter, and validate synthetic training data for fine-tuning smaller models, with techniques for ensuring quality, diversity, and deduplication.

Why Generate Synthetic Training Data

The biggest bottleneck in fine-tuning is not compute or infrastructure — it is high-quality training data. Expert annotation is expensive and slow. Production logs may not cover edge cases. Synthetic data generation uses a capable LLM (the "teacher") to create training examples for a smaller model (the "student").

This approach is used extensively in production. Many of the best open-source models were trained partly on synthetic data generated by larger models. The key is quality control — raw LLM output is not training-ready. It requires filtering, validation, and deduplication.

The Generation Pipeline

A robust synthetic data pipeline has four stages: seed creation, generation, filtering, and deduplication.

flowchart TD
    START["Synthetic Data Generation for Fine-Tuning: Using …"] --> A
    A["Why Generate Synthetic Training Data"]
    A --> B
    B["The Generation Pipeline"]
    B --> C
    C["Generating Training Examples"]
    C --> D
    D["Quality Filtering"]
    D --> E
    E["Deduplication for Synthetic Data"]
    E --> F
    F["Full Pipeline"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from openai import OpenAI
import json
import random
from typing import Optional

client = OpenAI()

def generate_seed_topics(domain: str, count: int = 50) -> list[str]:
    """Generate diverse seed topics for a domain."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate diverse, specific topics. Output one topic per line, no numbering."
            },
            {
                "role": "user",
                "content": f"List {count} diverse topics for a {domain} assistant. "
                           f"Cover common cases, edge cases, and tricky scenarios."
            },
        ],
        temperature=1.0,  # High temperature for diversity
    )
    topics = [
        line.strip()
        for line in response.choices[0].message.content.strip().split("\n")
        if line.strip()
    ]
    return topics

# Generate seeds
topics = generate_seed_topics("customer support for a SaaS billing platform")
print(f"Generated {len(topics)} seed topics")

Generating Training Examples

For each seed topic, generate a complete conversation. Use detailed system prompts to control the format and quality of the output.

GENERATION_PROMPT = """You are generating training data for a customer support AI.

Given a topic, create a realistic customer support interaction.

Requirements:
- The customer message should sound natural, as if written by a real person
- Include relevant details (account numbers, dates, specific issues)
- The assistant response should be helpful, accurate, and follow company policy
- Keep responses concise but complete
- Vary the tone: some customers are frustrated, some are polite, some are confused

Output EXACTLY this JSON format:
{
  "user_message": "the customer's message",
  "assistant_response": "the support agent's response"
}"""

def generate_example(
    topic: str,
    system_prompt: str,
    model: str = "gpt-4o",
) -> Optional[dict]:
    """Generate a single training example from a seed topic."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": GENERATION_PROMPT},
                {"role": "user", "content": f"Topic: {topic}"},
            ],
            temperature=0.8,
            response_format={"type": "json_object"},
        )
        data = json.loads(response.choices[0].message.content)

        if "user_message" not in data or "assistant_response" not in data:
            return None

        return {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": data["user_message"]},
                {"role": "assistant", "content": data["assistant_response"]},
            ]
        }
    except (json.JSONDecodeError, KeyError):
        return None

# Generate examples in batch
SYSTEM_PROMPT = "You are a helpful customer support agent for BillingPro, a SaaS billing platform."

def generate_batch(
    topics: list[str],
    system_prompt: str,
    examples_per_topic: int = 3,
) -> list[dict]:
    """Generate multiple examples per topic."""
    all_examples = []
    for topic in topics:
        for _ in range(examples_per_topic):
            example = generate_example(topic, system_prompt)
            if example:
                all_examples.append(example)
    print(f"Generated {len(all_examples)} examples from {len(topics)} topics")
    return all_examples

Quality Filtering

Not all generated examples are good enough for training. Filter by length, coherence, and content quality.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def quality_filter(examples: list[dict]) -> list[dict]:
    """Filter examples based on quality heuristics."""
    filtered = []

    for ex in examples:
        messages = ex["messages"]
        user_msg = messages[1]["content"]
        assistant_msg = messages[2]["content"]

        # Length checks
        user_words = len(user_msg.split())
        assistant_words = len(assistant_msg.split())

        if user_words < 5 or user_words > 500:
            continue
        if assistant_words < 10 or assistant_words > 1000:
            continue

        # Content checks
        if assistant_msg.strip().startswith("I'm sorry, I can't"):
            continue

        # Check for placeholder text
        placeholders = ["[insert", "[your", "xxx", "placeholder"]
        if any(p in assistant_msg.lower() for p in placeholders):
            continue

        # Check assistant actually addresses the user's question
        if len(assistant_msg) < len(user_msg) * 0.3:
            continue

        filtered.append(ex)

    print(f"Quality filter: {len(filtered)}/{len(examples)} passed")
    return filtered

Deduplication for Synthetic Data

LLMs tend to generate similar outputs even with different seeds. Aggressive deduplication is essential.

import hashlib
from difflib import SequenceMatcher

def dedup_synthetic(examples: list[dict], threshold: float = 0.80) -> list[dict]:
    """Remove near-duplicate synthetic examples."""
    unique = []
    seen_hashes = set()

    for ex in examples:
        user_msg = ex["messages"][1]["content"]
        assistant_msg = ex["messages"][2]["content"]
        combined = user_msg + assistant_msg

        # Exact dedup
        content_hash = hashlib.md5(combined.encode()).hexdigest()
        if content_hash in seen_hashes:
            continue
        seen_hashes.add(content_hash)

        # Fuzzy dedup against all kept examples
        is_dup = False
        for kept in unique:
            kept_combined = kept["messages"][1]["content"] + kept["messages"][2]["content"]
            similarity = SequenceMatcher(None, combined, kept_combined).ratio()
            if similarity > threshold:
                is_dup = True
                break

        if not is_dup:
            unique.append(ex)

    print(f"Dedup: {len(unique)}/{len(examples)} unique")
    return unique

Full Pipeline

def synthetic_data_pipeline(
    domain: str,
    system_prompt: str,
    target_count: int = 500,
) -> list[dict]:
    """End-to-end synthetic data generation pipeline."""
    topics = generate_seed_topics(domain, count=target_count // 2)
    raw = generate_batch(topics, system_prompt, examples_per_topic=3)
    cleaned = quality_filter(raw)
    scored = filter_by_score(cleaned, min_score=4.0)
    final = dedup_synthetic(scored, threshold=0.80)

    # Write to JSONL
    with open("synthetic_training_data.jsonl", "w") as f:
        for ex in final:
            f.write(json.dumps(ex) + "\n")

    return final

FAQ

OpenAI's terms allow using their API outputs to train models, including fine-tuning. However, some model licenses restrict using outputs to train competing models. Always check the terms of service for the specific API you use for generation. Ethically, be transparent about synthetic data usage and validate that generated data does not contain harmful biases or fabricated facts.

How do I ensure diversity in synthetic data so the model does not just learn one pattern?

Use three techniques: vary seed topics broadly, use high temperature (0.7-1.0) during generation, and explicitly prompt for different customer personas and scenarios. After generation, analyze the distribution of topics, tones, and response styles. If any category is under-represented, generate additional targeted examples for that category.

What ratio of synthetic to real data should I use?

Start with 100% synthetic data if you have no real data, then gradually replace synthetic examples with real ones as you collect production data. A common production ratio is 30-50% real data mixed with 50-70% synthetic data. Real data anchors the model to actual user patterns while synthetic data provides coverage for edge cases and rare scenarios.


#SyntheticData #FineTuning #DataGeneration #LLM #TrainingData #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

Large Language Models

Why Enterprises Need Custom LLMs: Base vs Fine-Tuned Models in 2026

Custom LLMs outperform base models for enterprise use cases by 40-65%. Learn when to fine-tune, RAG, or build custom models — with architecture patterns and ROI data.

Learn Agentic AI

Schema Representation for Text-to-SQL: How to Describe Your Database to LLMs

Master the art of schema representation for text-to-SQL systems. Learn how to format CREATE TABLE statements, add column descriptions, encode foreign key relationships, and provide sample data for maximum query accuracy.

Learn Agentic AI

Text-to-SQL Fundamentals: Converting Natural Language Questions to Database Queries

Learn what text-to-SQL is, how the architecture works from schema understanding to query generation, and why it is one of the most practical applications of large language models in enterprise software.

Learn Agentic AI

OpenAI Fine-Tuning API: Training Custom Models Step by Step

A complete walkthrough of fine-tuning models through the OpenAI API, covering data preparation in JSONL format, file upload, training job creation, evaluation, and deploying your custom model.