---
title: "Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data"
description: "Learn how to use large language models to generate, filter, and validate synthetic training data for fine-tuning smaller models, with techniques for ensuring quality, diversity, and deduplication."
canonical: https://callsphere.ai/blog/synthetic-data-generation-fine-tuning-llms-training-data
category: "Learn Agentic AI"
tags: ["Synthetic Data", "Fine-Tuning", "Data Generation", "LLM", "Training Data"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T20:16:58.476Z
---

# Synthetic Data Generation for Fine-Tuning: Using LLMs to Create Training Data

> Learn how to use large language models to generate, filter, and validate synthetic training data for fine-tuning smaller models, with techniques for ensuring quality, diversity, and deduplication.

## Why Generate Synthetic Training Data

The biggest bottleneck in fine-tuning is not compute or infrastructure — it is high-quality training data. Expert annotation is expensive and slow. Production logs may not cover edge cases. Synthetic data generation uses a capable LLM (the "teacher") to create training examples for a smaller model (the "student").

This approach is used extensively in production. Many of the best open-source models were trained partly on synthetic data generated by larger models. The key is quality control — raw LLM output is not training-ready. It requires filtering, validation, and deduplication.

## The Generation Pipeline

A robust synthetic data pipeline has four stages: seed creation, generation, filtering, and deduplication.

```mermaid
flowchart LR
    DATA[("Curated dataset
instruction or chat")]
    CLEAN["Clean and dedupe
PII filter"]
    TOK["Tokenize and pack"]
    METHOD{"Method"}
    LORA["LoRA or QLoRA
adapters only"]
    SFT["Full SFT
all params"]
    DPO["DPO or RLHF
preference learning"]
    EVAL["Held out eval
plus regression suite"]
    DEPLOY[("Adapter or
merged model")]
    DATA --> CLEAN --> TOK --> METHOD
    METHOD --> LORA --> EVAL
    METHOD --> SFT --> EVAL
    METHOD --> DPO --> EVAL
    EVAL --> DEPLOY
    style METHOD fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DEPLOY fill:#059669,stroke:#047857,color:#fff
```

```python
from openai import OpenAI
import json
import random
from typing import Optional

client = OpenAI()

def generate_seed_topics(domain: str, count: int = 50) -> list[str]:
    """Generate diverse seed topics for a domain."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate diverse, specific topics. Output one topic per line, no numbering."
            },
            {
                "role": "user",
                "content": f"List {count} diverse topics for a {domain} assistant. "
                           f"Cover common cases, edge cases, and tricky scenarios."
            },
        ],
        temperature=1.0,  # High temperature for diversity
    )
    topics = [
        line.strip()
        for line in response.choices[0].message.content.strip().split("\n")
        if line.strip()
    ]
    return topics

# Generate seeds
topics = generate_seed_topics("customer support for a SaaS billing platform")
print(f"Generated {len(topics)} seed topics")
```

## Generating Training Examples

For each seed topic, generate a complete conversation. Use detailed system prompts to control the format and quality of the output.

```python
GENERATION_PROMPT = """You are generating training data for a customer support AI.

Given a topic, create a realistic customer support interaction.

Requirements:
- The customer message should sound natural, as if written by a real person
- Include relevant details (account numbers, dates, specific issues)
- The assistant response should be helpful, accurate, and follow company policy
- Keep responses concise but complete
- Vary the tone: some customers are frustrated, some are polite, some are confused

Output EXACTLY this JSON format:
{
  "user_message": "the customer's message",
  "assistant_response": "the support agent's response"
}"""

def generate_example(
    topic: str,
    system_prompt: str,
    model: str = "gpt-4o",
) -> Optional[dict]:
    """Generate a single training example from a seed topic."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": GENERATION_PROMPT},
                {"role": "user", "content": f"Topic: {topic}"},
            ],
            temperature=0.8,
            response_format={"type": "json_object"},
        )
        data = json.loads(response.choices[0].message.content)

        if "user_message" not in data or "assistant_response" not in data:
            return None

        return {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": data["user_message"]},
                {"role": "assistant", "content": data["assistant_response"]},
            ]
        }
    except (json.JSONDecodeError, KeyError):
        return None

# Generate examples in batch
SYSTEM_PROMPT = "You are a helpful customer support agent for BillingPro, a SaaS billing platform."

def generate_batch(
    topics: list[str],
    system_prompt: str,
    examples_per_topic: int = 3,
) -> list[dict]:
    """Generate multiple examples per topic."""
    all_examples = []
    for topic in topics:
        for _ in range(examples_per_topic):
            example = generate_example(topic, system_prompt)
            if example:
                all_examples.append(example)
    print(f"Generated {len(all_examples)} examples from {len(topics)} topics")
    return all_examples
```

## Quality Filtering

Not all generated examples are good enough for training. Filter by length, coherence, and content quality.

```python
def quality_filter(examples: list[dict]) -> list[dict]:
    """Filter examples based on quality heuristics."""
    filtered = []

    for ex in examples:
        messages = ex["messages"]
        user_msg = messages[1]["content"]
        assistant_msg = messages[2]["content"]

        # Length checks
        user_words = len(user_msg.split())
        assistant_words = len(assistant_msg.split())

        if user_words  500:
            continue
        if assistant_words  1000:
            continue

        # Content checks
        if assistant_msg.strip().startswith("I'm sorry, I can't"):
            continue

        # Check for placeholder text
        placeholders = ["[insert", "[your", "xxx", "placeholder"]
        if any(p in assistant_msg.lower() for p in placeholders):
            continue

        # Check assistant actually addresses the user's question
        if len(assistant_msg)  list[dict]:
    """Remove near-duplicate synthetic examples."""
    unique = []
    seen_hashes = set()

    for ex in examples:
        user_msg = ex["messages"][1]["content"]
        assistant_msg = ex["messages"][2]["content"]
        combined = user_msg + assistant_msg

        # Exact dedup
        content_hash = hashlib.md5(combined.encode()).hexdigest()
        if content_hash in seen_hashes:
            continue
        seen_hashes.add(content_hash)

        # Fuzzy dedup against all kept examples
        is_dup = False
        for kept in unique:
            kept_combined = kept["messages"][1]["content"] + kept["messages"][2]["content"]
            similarity = SequenceMatcher(None, combined, kept_combined).ratio()
            if similarity > threshold:
                is_dup = True
                break

        if not is_dup:
            unique.append(ex)

    print(f"Dedup: {len(unique)}/{len(examples)} unique")
    return unique
```

## Full Pipeline

```python
def synthetic_data_pipeline(
    domain: str,
    system_prompt: str,
    target_count: int = 500,
) -> list[dict]:
    """End-to-end synthetic data generation pipeline."""
    topics = generate_seed_topics(domain, count=target_count // 2)
    raw = generate_batch(topics, system_prompt, examples_per_topic=3)
    cleaned = quality_filter(raw)
    scored = filter_by_score(cleaned, min_score=4.0)
    final = dedup_synthetic(scored, threshold=0.80)

    # Write to JSONL
    with open("synthetic_training_data.jsonl", "w") as f:
        for ex in final:
            f.write(json.dumps(ex) + "\n")

    return final
```

## FAQ

### Is it legal and ethical to use LLM-generated data for fine-tuning?

OpenAI's terms allow using their API outputs to train models, including fine-tuning. However, some model licenses restrict using outputs to train competing models. Always check the terms of service for the specific API you use for generation. Ethically, be transparent about synthetic data usage and validate that generated data does not contain harmful biases or fabricated facts.

### How do I ensure diversity in synthetic data so the model does not just learn one pattern?

Use three techniques: vary seed topics broadly, use high temperature (0.7-1.0) during generation, and explicitly prompt for different customer personas and scenarios. After generation, analyze the distribution of topics, tones, and response styles. If any category is under-represented, generate additional targeted examples for that category.

### What ratio of synthetic to real data should I use?

Start with 100% synthetic data if you have no real data, then gradually replace synthetic examples with real ones as you collect production data. A common production ratio is 30-50% real data mixed with 50-70% synthetic data. Real data anchors the model to actual user patterns while synthetic data provides coverage for edge cases and rare scenarios.

---

#SyntheticData #FineTuning #DataGeneration #LLM #TrainingData #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/synthetic-data-generation-fine-tuning-llms-training-data
