---
title: "Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning"
description: "A practical guide to creating high-quality evaluation datasets for AI agents using synthetic data generation, human annotation pipelines, active learning for efficient labeling, and dataset versioning strategies."
canonical: https://callsphere.ai/blog/building-evaluation-datasets-synthetic-generation-human-labeling-active-learning
category: "Learn Agentic AI"
tags: ["Evaluation Datasets", "Synthetic Data", "Data Labeling", "Active Learning", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.621Z
---

# Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

> A practical guide to creating high-quality evaluation datasets for AI agents using synthetic data generation, human annotation pipelines, active learning for efficient labeling, and dataset versioning strategies.

## The Dataset Is the Evaluation

Your evaluation is only as good as your dataset. A perfect scoring pipeline running against a biased or unrepresentative dataset gives you false confidence. Building evaluation datasets for AI agents is particularly challenging because agent interactions are multi-turn, involve tool calls, and have complex success criteria that go beyond simple text matching.

This guide covers three complementary approaches: synthetic generation for scale, human labeling for quality, and active learning for efficiency. Used together, they give you a dataset that is large enough for statistical reliability, accurate enough for trust, and continuously improving as your agent evolves.

## Synthetic Dataset Generation

Use an LLM to generate diverse evaluation samples at scale. The key is generating both the user inputs and the expected agent behavior.

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

```python
import json
import asyncio
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SyntheticSample:
    sample_id: str
    user_input: str
    expected_response: str
    expected_tool_calls: list[dict] = field(
        default_factory=list
    )
    difficulty: str = "medium"
    tags: list[str] = field(default_factory=list)
    generated_by: str = "synthetic"

async def generate_synthetic_samples(
    llm_client,
    task_description: str,
    tool_definitions: list[dict],
    count: int = 20,
    difficulties: list[str] = None,
) -> list[SyntheticSample]:
    difficulties = difficulties or ["easy", "medium", "hard"]
    tools_text = json.dumps(tool_definitions, indent=2)

    prompt = f"""Generate {count} diverse evaluation samples for
an AI agent with the following task and available tools.

## Task Description
{task_description}

## Available Tools
{tools_text}

For each sample, generate:
1. A realistic user input message
2. The expected agent response (or key points)
3. Expected tool calls with parameters
4. Difficulty level: {difficulties}
5. Tags describing the capability tested

Vary the samples across:
- Different user phrasings and communication styles
- Edge cases and unusual requests
- Multi-step and single-step tasks
- Clear and ambiguous intents

Return JSON array:
[
  {{
    "user_input": "...",
    "expected_response_summary": "...",
    "expected_tool_calls": [{{"name": "...", "params": {{}}}}],
    "difficulty": "easy|medium|hard",
    "tags": ["..."]
  }}
]"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9,
    )
    raw = json.loads(response.choices[0].message.content)
    items = raw if isinstance(raw, list) else raw.get("samples", [])

    samples = []
    for i, item in enumerate(items):
        samples.append(SyntheticSample(
            sample_id=f"syn-{i:04d}",
            user_input=item["user_input"],
            expected_response=item.get(
                "expected_response_summary", ""
            ),
            expected_tool_calls=item.get(
                "expected_tool_calls", []
            ),
            difficulty=item.get("difficulty", "medium"),
            tags=item.get("tags", []),
        ))
    return samples
```

Set the temperature high (0.8 to 1.0) for generation to maximize diversity. Then filter and validate the results. Synthetic data is a starting point — it fills the volume gap while you build out human-labeled gold sets.

## Human Annotation Pipeline

Human-labeled data is your ground truth. Design the annotation workflow to maximize consistency and minimize annotator fatigue.

```python
@dataclass
class AnnotationTask:
    task_id: str
    conversation: list[dict]
    agent_response: str
    questions: list[dict]  # What to annotate

@dataclass
class Annotation:
    task_id: str
    annotator_id: str
    labels: dict
    confidence: float  # 0.0 to 1.0
    time_seconds: float
    notes: Optional[str] = None

class AnnotationPipeline:
    def __init__(self, min_annotators: int = 2):
        self.min_annotators = min_annotators
        self.tasks: list[AnnotationTask] = []
        self.annotations: list[Annotation] = []

    def add_task(self, task: AnnotationTask):
        self.tasks.append(task)

    def submit_annotation(self, annotation: Annotation):
        self.annotations.append(annotation)

    def get_consensus(self, task_id: str) -> Optional[dict]:
        task_annotations = [
            a for a in self.annotations
            if a.task_id == task_id
        ]
        if len(task_annotations)  float:
        """Score how uncertain the agent is on this sample.
        Higher = more valuable to label."""
        agent_confidence = sample.get(
            "agent_confidence", 0.5
        )
        # Invert: low agent confidence = high labeling value
        uncertainty = 1.0 - agent_confidence

        # Boost novel patterns
        if sample.get("is_novel_pattern", False):
            uncertainty = min(1.0, uncertainty + 0.2)

        return uncertainty

    def select_batch(self, batch_size: int = 50) -> list[dict]:
        scored = [
            (self.score_uncertainty(s), s)
            for s in self.unlabeled
        ]
        # Mix: 70% highest uncertainty, 30% random
        scored.sort(key=lambda x: -x[0])
        n_uncertain = int(batch_size * 0.7)
        n_random = batch_size - n_uncertain

        selected = [s for _, s in scored[:n_uncertain]]
        remaining = [s for _, s in scored[n_uncertain:]]
        if remaining:
            selected.extend(
                random.sample(
                    remaining, min(n_random, len(remaining))
                )
            )

        # Remove selected from unlabeled pool
        selected_ids = {s.get("id") for s in selected}
        self.unlabeled = [
            s for s in self.unlabeled
            if s.get("id") not in selected_ids
        ]
        return selected
```

The 70/30 split between uncertain and random samples is important. Pure uncertainty sampling can create a biased dataset that only covers hard cases. The random component ensures your dataset still represents the full distribution of user requests.

## Dataset Versioning and Quality Control

Track every change to your dataset so evaluation results are always reproducible.

```python
import hashlib
from datetime import datetime

@dataclass
class DatasetVersion:
    version: str
    fingerprint: str
    sample_count: int
    created_at: str
    parent_version: Optional[str] = None
    changes: list[str] = field(default_factory=list)

class VersionedDataset:
    def __init__(self, name: str):
        self.name = name
        self.samples: list[dict] = []
        self.versions: list[DatasetVersion] = []

    def fingerprint(self) -> str:
        content = json.dumps(self.samples, sort_keys=True)
        return hashlib.sha256(
            content.encode()
        ).hexdigest()[:12]

    def commit(
        self, version: str, changes: list[str]
    ) -> DatasetVersion:
        parent = (
            self.versions[-1].version
            if self.versions else None
        )
        v = DatasetVersion(
            version=version,
            fingerprint=self.fingerprint(),
            sample_count=len(self.samples),
            created_at=datetime.utcnow().isoformat(),
            parent_version=parent,
            changes=changes,
        )
        self.versions.append(v)
        return v

    def quality_report(self) -> dict:
        tags_coverage = set()
        difficulties = {"easy": 0, "medium": 0, "hard": 0}
        for sample in self.samples:
            tags_coverage.update(sample.get("tags", []))
            diff = sample.get("difficulty", "medium")
            difficulties[diff] = difficulties.get(diff, 0) + 1

        return {
            "total_samples": len(self.samples),
            "unique_tags": len(tags_coverage),
            "difficulty_distribution": difficulties,
            "fingerprint": self.fingerprint(),
            "version": (
                self.versions[-1].version
                if self.versions else "uncommitted"
            ),
        }
```

Always reference the dataset fingerprint alongside evaluation results. When a score changes, you can immediately determine whether it was caused by a model change or a dataset change.

## FAQ

### How many samples do I need for a reliable evaluation dataset?

Aim for at least 50 samples per capability or task type. For statistical significance when comparing two models, you need 200 or more samples per comparison. Start with synthetic generation to reach volume, then replace low-quality synthetic samples with human-labeled ones over time. A 500-sample dataset that is 30 percent human-labeled and 70 percent high-quality synthetic is a strong starting point.

### How do I detect and remove bad synthetic samples?

Run three quality filters. First, a deterministic filter that catches formatting issues, empty fields, and duplicate inputs. Second, a self-consistency check where you generate the same task twice with different seeds and compare — inconsistent outputs suggest an underspecified prompt. Third, a human spot-check on 10 percent of each generated batch. Track the rejection rate to improve your generation prompts.

### When should I create a new dataset version versus modifying the existing one?

Create a new version whenever you add more than 10 percent new samples, remove samples, change annotation guidelines, or fix systematic labeling errors. For small additions (under 10 percent), append and increment a minor version. Always preserve old versions so you can re-run evaluations against them for trend analysis.

---

#EvaluationDatasets #SyntheticData #DataLabeling #ActiveLearning #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-evaluation-datasets-synthetic-generation-human-labeling-active-learning
