Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

Why Generate Synthetic Training Data

The biggest bottleneck in fine-tuning is not compute or infrastructure — it is high-quality training data. Expert annotation is expensive and slow. Production logs may not cover edge cases. Synthetic data generation uses a capable LLM (the "teacher") to create training examples for a smaller model (the "student").

This approach is used extensively in production. Many of the best open-source models were trained partly on synthetic data generated by larger models. The key is quality control — raw LLM output is not training-ready. It requires filtering, validation, and deduplication.

The Generation Pipeline

A robust synthetic data pipeline has four stages: seed creation, generation, filtering, and deduplication.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    DATA[("Curated dataset<br/>instruction or chat")]
    CLEAN["Clean and dedupe<br/>PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA<br/>adapters only"]
    SFT["Full SFT<br/>all params"]
    DPO["DPO or RLHF<br/>preference learning"]
    EVAL["Held out eval<br/>plus regression suite"]
    DEPLOY[("Adapter or<br/>merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff

from openai import OpenAI
import json
import random
from typing import Optional

client = OpenAI()

def generate_seed_topics(domain: str, count: int = 50) -> list[str]:
    """Generate diverse seed topics for a domain."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate diverse, specific topics. Output one topic per line, no numbering."
            },
            {
                "role": "user",
                "content": f"List {count} diverse topics for a {domain} assistant. "
                           f"Cover common cases, edge cases, and tricky scenarios."
            },
        ],
        temperature=1.0,  # High temperature for diversity
    )
    topics = [
        line.strip()
        for line in response.choices[0].message.content.strip().split("\n")
        if line.strip()
    ]
    return topics

# Generate seeds
topics = generate_seed_topics("customer support for a SaaS billing platform")
print(f"Generated {len(topics)} seed topics")

Generating Training Examples

For each seed topic, generate a complete conversation. Use detailed system prompts to control the format and quality of the output.

GENERATION_PROMPT = """You are generating training data for a customer support AI.

Given a topic, create a realistic customer support interaction.

Requirements:
- The customer message should sound natural, as if written by a real person
- Include relevant details (account numbers, dates, specific issues)
- The assistant response should be helpful, accurate, and follow company policy
- Keep responses concise but complete
- Vary the tone: some customers are frustrated, some are polite, some are confused

Output EXACTLY this JSON format:
{
  "user_message": "the customer's message",
  "assistant_response": "the support agent's response"
}"""

def generate_example(
    topic: str,
    system_prompt: str,
    model: str = "gpt-4o",
) -> Optional[dict]:
    """Generate a single training example from a seed topic."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": GENERATION_PROMPT},
                {"role": "user", "content": f"Topic: {topic}"},
            ],
            temperature=0.8,
            response_format={"type": "json_object"},
        )
        data = json.loads(response.choices[0].message.content)

        if "user_message" not in data or "assistant_response" not in data:
            return None

        return {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": data["user_message"]},
                {"role": "assistant", "content": data["assistant_response"]},
            ]
        }
    except (json.JSONDecodeError, KeyError):
        return None

# Generate examples in batch
SYSTEM_PROMPT = "You are a helpful customer support agent for BillingPro, a SaaS billing platform."

def generate_batch(
    topics: list[str],
    system_prompt: str,
    examples_per_topic: int = 3,
) -> list[dict]:
    """Generate multiple examples per topic."""
    all_examples = []
    for topic in topics:
        for _ in range(examples_per_topic):
            example = generate_example(topic, system_prompt)
            if example:
                all_examples.append(example)
    print(f"Generated {len(all_examples)} examples from {len(topics)} topics")
    return all_examples

Quality Filtering

Not all generated examples are good enough for training. Filter by length, coherence, and content quality.

def quality_filter(examples: list[dict]) -> list[dict]:
    """Filter examples based on quality heuristics."""
    filtered = []

    for ex in examples:
        messages = ex["messages"]
        user_msg = messages[1]["content"]
        assistant_msg = messages[2]["content"]

        # Length checks
        user_words = len(user_msg.split())
        assistant_words = len(assistant_msg.split())

        if user_words < 5 or user_words > 500:
            continue
        if assistant_words < 10 or assistant_words > 1000:
            continue

        # Content checks
        if assistant_msg.strip().startswith("I'm sorry, I can't"):
            continue

        # Check for placeholder text
        placeholders = ["[insert", "[your", "xxx", "placeholder"]
        if any(p in assistant_msg.lower() for p in placeholders):
            continue

        # Check assistant actually addresses the user's question
        if len(assistant_msg) < len(user_msg) * 0.3:
            continue

        filtered.append(ex)

    print(f"Quality filter: {len(filtered)}/{len(examples)} passed")
    return filtered

Deduplication for Synthetic Data

LLMs tend to generate similar outputs even with different seeds. Aggressive deduplication is essential.

import hashlib
from difflib import SequenceMatcher

def dedup_synthetic(examples: list[dict], threshold: float = 0.80) -> list[dict]:
    """Remove near-duplicate synthetic examples."""
    unique = []
    seen_hashes = set()

    for ex in examples:
        user_msg = ex["messages"][1]["content"]
        assistant_msg = ex["messages"][2]["content"]
        combined = user_msg + assistant_msg

        # Exact dedup
        content_hash = hashlib.md5(combined.encode()).hexdigest()
        if content_hash in seen_hashes:
            continue
        seen_hashes.add(content_hash)

        # Fuzzy dedup against all kept examples
        is_dup = False
        for kept in unique:
            kept_combined = kept["messages"][1]["content"] + kept["messages"][2]["content"]
            similarity = SequenceMatcher(None, combined, kept_combined).ratio()
            if similarity > threshold:
                is_dup = True
                break

        if not is_dup:
            unique.append(ex)

    print(f"Dedup: {len(unique)}/{len(examples)} unique")
    return unique

Full Pipeline

def synthetic_data_pipeline(
    domain: str,
    system_prompt: str,
    target_count: int = 500,
) -> list[dict]:
    """End-to-end synthetic data generation pipeline."""
    topics = generate_seed_topics(domain, count=target_count // 2)
    raw = generate_batch(topics, system_prompt, examples_per_topic=3)
    cleaned = quality_filter(raw)
    scored = filter_by_score(cleaned, min_score=4.0)
    final = dedup_synthetic(scored, threshold=0.80)

    # Write to JSONL
    with open("synthetic_training_data.jsonl", "w") as f:
        for ex in final:
            f.write(json.dumps(ex) + "\n")

    return final

FAQ

Is it legal and ethical to use LLM-generated data for fine-tuning?

OpenAI's terms allow using their API outputs to train models, including fine-tuning. However, some model licenses restrict using outputs to train competing models. Always check the terms of service for the specific API you use for generation. Ethically, be transparent about synthetic data usage and validate that generated data does not contain harmful biases or fabricated facts.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How do I ensure diversity in synthetic data so the model does not just learn one pattern?

Use three techniques: vary seed topics broadly, use high temperature (0.7-1.0) during generation, and explicitly prompt for different customer personas and scenarios. After generation, analyze the distribution of topics, tones, and response styles. If any category is under-represented, generate additional targeted examples for that category.

What ratio of synthetic to real data should I use?

Start with 100% synthetic data if you have no real data, then gradually replace synthetic examples with real ones as you collect production data. A common production ratio is 30-50% real data mixed with 50-70% synthetic data. Real data anchors the model to actual user patterns while synthetic data provides coverage for edge cases and rare scenarios.

#SyntheticData #FineTuning #DataGeneration #LLM #TrainingData #AgenticAI #LearnAI #AIEngineering

Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

Why Generate Synthetic Training Data

The Generation Pipeline

Generating Training Examples

Quality Filtering

Deduplication for Synthetic Data

Full Pipeline

FAQ

Is it legal and ethical to use LLM-generated data for fine-tuning?

How do I ensure diversity in synthetic data so the model does not just learn one pattern?

What ratio of synthetic to real data should I use?

Try CallSphere AI Voice Agents

Related Articles You May Like

Claude Sonnet 4.6 Vision Capabilities for Document and Chart Unders...

Claude for Equity Research: Workflows from Buy-Side Analysts

Constitutional AI: Genuine Safety Moat or Sophisticated Marketing?

Bedrock Agents Powered by Claude: A Reference Architecture

LLM A/B Testing in Production: Metrics and Pitfalls

Sparse Attention Patterns: Sliding Window, Longformer, BigBird Today