---
title: "How to Build a Golden Dataset for Production AI Agents"
description: "A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith."
canonical: https://callsphere.ai/blog/golden-dataset-production-ai-agents-langsmith
category: "Agentic AI"
tags: ["Agent Evaluation", "LangSmith", "Datasets", "Annotation Queue", "Agent Observability", "MLOps", "AI Engineering"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.354Z
---

# How to Build a Golden Dataset for Production AI Agents

> A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.

## TL;DR

A **golden dataset** is the single most leveraged artifact in an agent program. It is not a CSV your intern made. It is a **versioned, auditable, evolving collection of inputs paired with reference outputs and graders**, mined from real production traces, annotated by humans who know the domain, and refreshed every time the agent fails in a new way. In LangSmith, a golden dataset is a first-class `Dataset` object with `Examples`, attached evaluators, and an experiments history that lets you compare prompts, models, and tool changes against the same yardstick. This post is the playbook I use when I bootstrap one from zero, scale it past 1,000 examples, and keep it alive in CI for years.

If you only remember three things: **mine production, don't write fiction**; **version every change**; **let failures, not vibes, decide what gets added**.

## Why Golden Datasets Decide Whether Your Agent Ships

Every agent team I have walked into has the same shape of problem at month three: the demo works, the prompt has been tweaked 200 times, and nobody can answer the question, *"is this version better than last week's version?"* You ship a change, somebody on Slack says it feels worse, you revert, somebody else says the revert feels worse, and the team gives up on iterating.

The reason is always the same. There is no fixed reference. Without a golden dataset, every prompt change is judged on whichever 6 calls the on-call engineer happened to look at this morning. That is not evaluation — that is augury.

A golden dataset fixes this. It is the **regression test suite for an agent**. When you have one:

- Every prompt change runs against the same N examples.
- Every model swap runs against the same N examples.
- Every tool refactor runs against the same N examples.
- Wins and regressions are measurable to two decimal places, and the *category* of regression (tool selection, hallucination, refusal, latency) is visible.

The teams that ship voice and chat agents at scale — the ones we work with at [CallSphere](/products) across [healthcare, real estate, sales, and IT helpdesk verticals](/industries) — all have one. The teams that get stuck in pilot don't.

## What Belongs in a Golden Dataset (and What Doesn't)

A golden `Example` in LangSmith is a triple: **input**, **reference output**, **metadata**. For agents specifically, "input" is rarely just a string — it is a structured payload that captures the *full evaluable unit*: the user message, the conversation history, the tool catalog available at that point, the system prompt revision, and any retrieved context.

| Field | What goes in it | Why it matters |
| --- | --- | --- |
| `inputs.messages` | Full conversation up to the eval turn | Agents are stateful; one-shot inputs don't reproduce real failures |
| `inputs.tools` | Tool schemas available at runtime | Tool selection is a top-3 failure mode |
| `inputs.context` | RAG chunks, customer profile, prior call summary | Eliminates "context drift" as a confound |
| `outputs.reference` | Ideal final answer OR ideal trajectory | Either is fine; pick one and be consistent |
| `outputs.must_call` | Tools the agent MUST call | Trajectory-level grading |
| `outputs.must_not_say` | Forbidden phrases (PHI, competitor names) | Compliance graders run cheaply |
| `metadata.source` | `production`, `synthetic`, `adversarial`, `regression` | Lets you slice metrics by provenance |
| `metadata.severity` | `p0`, `p1`, `p2` | Weight failures appropriately |
| `metadata.persona` | `new_user`, `angry_customer`, `enterprise_admin` | Stratified sampling at eval time |

**What does NOT belong**: examples your PM made up in a Notion doc to "represent the user." Synthetic data has a place — adversarial generation, edge-case probes, distribution gap-filling — but the *core* of the golden dataset must be production-mined. Real users break agents in ways no PM ever imagines.

## The Build Pipeline: From Trace to Golden Example

Here is the pipeline I run, end to end. It is the same flow whether the agent is a customer support bot, a voice agent on Twilio, or a multi-agent research assistant.

```mermaid
flowchart LR
  A[Production traces
LangSmith Project] --> B{Filter}
  B -->|user feedback |tool error| C
  B -->|latency p99 spike| C
  B -->|judge score low| C
  B -->|random 1%| D[Sample pool]
  C --> E[Annotation Queue]
  D --> E
  E --> F[SME review]
  F -->|accept + correct| G[Dataset v_n+1]
  F -->|reject| H[Discard log]
  G --> I[Run evaluators]
  I --> J{Regression?}
  J -->|yes| K[Block deploy]
  J -->|no| L[Promote to prod]
  L --> A
  style A fill:#e6f3ff
  style E fill:#fff4e6
  style G fill:#e8f5e8
  style K fill:#fcc
```

*Figure 1 — The golden dataset is a closed loop. Production feeds it, SMEs curate it, evaluators gate deploys, and accepted deploys produce the next batch of traces. Break this loop and your dataset rots.*

### Step 1 — Mine Production Traces

LangSmith stores every `Run` you trace under a `Project`. The first job is to extract the runs that *should* end up in the dataset. Three filters give you 80% of the value.

```python
from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# 1. Negative user feedback (thumbs-down, low CSAT, etc.)
negative_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='and(eq(feedback_key, "user_score"), lt(feedback_score, 3))',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

# 2. Tool errors / exceptions inside the trace
error_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='eq(error, true)',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

# 3. Latency outliers (anything > 5s end-to-end)
slow_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='gt(latency, 5)',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

print(f"Mined: {sum(1 for _ in negative_runs)} neg, "
      f"{sum(1 for _ in error_runs)} err, "
      f"{sum(1 for _ in slow_runs)} slow")
```

In a real prod system at ~50K calls/day, this typically yields **300–800 candidates per week**. That is too many to annotate by hand and too few to skip. The Annotation Queue is how you bridge that gap.

### Step 2 — Send Candidates to an Annotation Queue

LangSmith's [Annotation Queue](https://docs.langchain.com/langsmith/evaluation) is a curated workspace where SMEs review one trace at a time, mark it correct or incorrect, and (critically) **edit the reference output** to what the agent *should* have said.

```typescript
import { Client } from "langsmith";

const client = new Client();

// Create a queue once per surface
await client.createAnnotationQueue({
  name: "voice-agent-failures-2026-w18",
  description: "Weekly SME review for golden dataset growth",
  defaultDataset: "voice-agent-golden-v3",
});

// Push candidate runs into the queue
const candidateIds: string[] = [/* run ids from Step 1 */];
await client.addRunsToAnnotationQueue(
  "voice-agent-failures-2026-w18",
  candidateIds,
);
```

The cardinal rule: **the SME owns the reference output, not the engineer.** Your SME is the support lead, the nurse practitioner, the sales manager, the compliance officer. They know what the right answer was. Engineers are not allowed to write reference outputs — that is how you bake the model's biases into the grader.

### Step 3 — Promote Annotated Examples Into the Dataset

Once an SME accepts an example, it gets promoted into the versioned dataset. LangSmith versions datasets automatically — every `add_examples` call creates a new version tag you can pin in CI.

```python
from langsmith import Client

client = Client()

# Create the dataset once
dataset = client.create_dataset(
    dataset_name="voice-agent-golden-v3",
    description="Production-mined golden set for the voice agent.",
)

# Promote one annotated example
client.create_example(
    dataset_id=dataset.id,
    inputs={
        "messages": [
            {"role": "system", "content": "You are an after-hours triage agent."},
            {"role": "user", "content": "My 4-year-old has had a fever for 3 days."},
        ],
        "tools": ["lookup_clinic_hours", "page_on_call", "schedule_callback"],
        "context": {"clinic_id": "clinic_482", "patient_age": 4},
    },
    outputs={
        "reference": "I'm going to page the on-call physician now. Please stay on the line.",
        "must_call": ["page_on_call"],
        "must_not_say": ["I'm not a doctor", "call 911"],
    },
    metadata={
        "source": "production",
        "severity": "p0",
        "persona": "worried_parent",
        "trace_id": "8f3a...e91",
        "annotator": "rn_kelly",
        "version_added": "v3.4",
    },
)
```

## Versioning, Splits, and the Two Datasets You Actually Need

People treat datasets as one big bucket. They shouldn't. In production I always run **two**:

1. **Golden — frozen.** The regression set. Curated, balanced across personas and severities, capped at ~500–1,500 examples. Changes go through review. CI runs against this dataset on every PR. **This is the gate.**
2. **Drift — rolling.** Last 30 days of production-mined examples. Grows weekly. No promotion needed. Used for *trend* analysis and to spot distribution shift. **This is the canary.**

Both live in LangSmith. The drift dataset feeds the golden one — examples that survive 30 days, get SME-confirmed, and represent something *new* (not already covered by an existing golden example) get promoted.

| Property | Golden | Drift |
| --- | --- | --- |
| Size | 500–1,500 | 5,000–50,000 |
| Cadence of change | Reviewed monthly | Refreshed weekly |
| Used for | CI regression gate | Distribution monitoring |
| Pinned version in CI | Yes (`v3.4`) | No (always latest) |
| Edited reference outputs | Yes | No (uses agent's own output) |

## Running Evaluators Against the Golden Set

Once the dataset exists, the next question is what to grade. For agents specifically, you grade three layers:

1. **Final-answer correctness** — does the last assistant message match the reference? Use an LLM judge with rubric, NOT exact string match.
2. **Trajectory correctness** — did the agent call `must_call` tools and avoid forbidden tools? Pure code grader, deterministic.
3. **Safety/policy** — did any message contain a `must_not_say` phrase? Pure code grader.

```python
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

def trajectory_grader(run, example):
    """Did the agent call all required tools?"""
    called = {c["name"] for c in run.outputs.get("tool_calls", [])}
    required = set(example.outputs.get("must_call", []))
    return {
        "key": "required_tools_called",
        "score": 1.0 if required.issubset(called) else 0.0,
        "comment": f"missing: {required - called}" if required - called else "ok",
    }

def policy_grader(run, example):
    """Did the agent say anything forbidden?"""
    text = " ".join(m["content"] for m in run.outputs.get("messages", []))
    forbidden = example.outputs.get("must_not_say", [])
    hits = [p for p in forbidden if p.lower() in text.lower()]
    return {
        "key": "policy_violations",
        "score": 0.0 if hits else 1.0,
        "comment": f"forbidden hits: {hits}" if hits else "clean",
    }

results = evaluate(
    lambda inputs: my_agent.invoke(inputs),
    data="voice-agent-golden-v3",
    evaluators=[trajectory_grader, policy_grader],
    experiment_prefix="prompt-rev-117",
    max_concurrency=8,
)

print(results.to_pandas().describe())
```

The output is a comparable experiment row in LangSmith. You change the prompt, you re-run, you get a diff. That is what "we made the agent better" looks like in evidence form.

## How a Golden Dataset Grows Without Rotting

A static dataset rots in 3 months. The agent improves, the easy examples become trivial, and the dataset stops discriminating. The discipline that keeps it useful:

- **Promote every novel failure.** If a production trace fails in a way no golden example covers, it gets a candidate slot. *Novel* is the key word — duplicates of existing failures don't help.
- **Retire solved examples.** When the agent has scored 100% on an example for 90 days across two model swaps, demote it to an "archive" tag. Keep it; stop running it on every PR.
- **Rebalance quarterly.** Pull persona/severity counts. If 70% of examples are `new_user` and your prod traffic shifted to `enterprise_admin`, you are evaluating the wrong agent.
- **Adversarial top-up.** Once a quarter, generate 50 synthetic adversarial examples — prompt injection, jailbreaks, role-switching — and SME-review them in. This is the one place synthetic data earns its keep.

## Common Mistakes I See

- **Engineer-written references.** The model already thinks like an engineer; engineer-written references just measure self-agreement.
- **No versioning.** "We added some examples last week" is not a version. Pin a tag in CI.
- **Treating the dataset as the test set.** It is the *training-the-prompt* set too. You will overfit. Hold out 20% for a true heldout.
- **One dataset for chat and voice.** Voice has interruptions, latency budgets, and hand-offs that chat doesn't. Separate them.
- **No `must_not_say` field.** Compliance failures are silent until they aren't. A two-line policy grader catches them in CI for ~$0 per run.

## Wiring It Into CI

The whole point is to gate deploys. A LangSmith `evaluate()` run produces a structured result you can fail a PR on.

```python
import sys
from langsmith.evaluation import evaluate

result = evaluate(
    my_agent_fn,
    data="voice-agent-golden-v3",  # pinned version
    evaluators=[trajectory_grader, policy_grader, llm_judge],
    experiment_prefix=f"pr-{os.environ['GITHUB_PR_NUMBER']}",
)

df = result.to_pandas()
required_tools_pass = df["feedback.required_tools_called"].mean()
policy_pass = df["feedback.policy_violations"].mean()

# Fail the PR if either gate slips below threshold
if required_tools_pass < 0.95 or policy_pass < 0.99:
    print(f"REGRESSION: tools={required_tools_pass:.3f} policy={policy_pass:.3f}")
    sys.exit(1)
```

That snippet, dropped into a GitHub Action, is the difference between an agent that gets better every week and an agent that is one bad prompt edit away from a Slack incident.

## FAQ

**Q: How big should the golden dataset be?**
A: 500 examples is a strong floor for a single-purpose agent. 1,500 is plenty for most production systems. Past 2,000 you are adding examples for sport — your CI cost is climbing and your discriminative power is not. Bigger is not better; *more diverse* is better.

**Q: Can I bootstrap a golden dataset before I have production traffic?**
A: Yes, but flag everything as `source=synthetic` and treat the metrics as directional only. The day production traffic exists, start mining. The synthetic examples will get pushed out of the regression set within 60 days — that is healthy.

**Q: How often should I re-version the dataset?**
A: Promote a new version monthly, with a release note describing what was added, removed, or rebalanced. CI pins the version, so promotion is non-disruptive — engineers update the pin when they are ready.

**Q: Do I need an SME, or can engineers annotate?**
A: For trivial agents, engineers are fine. For anything domain-specific (medical, legal, finance, sales), an engineer-annotated dataset will encode the engineer's misunderstandings. Pay the SME hours. The dataset is the most leveraged thing they will ever produce.

**Q: How does this differ from a unit test suite?**
A: Unit tests are deterministic and cover code paths. A golden dataset is probabilistic and covers *behaviors* — tool selection, refusal, persona-appropriate tone — that no unit test can express. They are complements, not substitutes.

## Build Your Golden Dataset With CallSphere

If you are running voice or chat agents in production, you already have the raw material — every call is a candidate. CallSphere ships with the trace export, annotation flow, and evaluator hooks you need to stand a golden dataset up in a week, not a quarter. See the [products page](/products), the [agent evaluation glossary entry](/glossary), or [book a working session](/demo).

[Book a demo](/demo) · [See products](/products) · [Browse the glossary](/glossary)

---

Source: https://callsphere.ai/blog/golden-dataset-production-ai-agents-langsmith