How to Build a Golden Dataset for Production AI Agents
A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.
TL;DR
A golden dataset is the single most leveraged artifact in an agent program. It is not a CSV your intern made. It is a versioned, auditable, evolving collection of inputs paired with reference outputs and graders, mined from real production traces, annotated by humans who know the domain, and refreshed every time the agent fails in a new way. In LangSmith, a golden dataset is a first-class Dataset object with Examples, attached evaluators, and an experiments history that lets you compare prompts, models, and tool changes against the same yardstick. This post is the playbook I use when I bootstrap one from zero, scale it past 1,000 examples, and keep it alive in CI for years.
If you only remember three things: mine production, don't write fiction; version every change; let failures, not vibes, decide what gets added.
Why Golden Datasets Decide Whether Your Agent Ships
Every agent team I have walked into has the same shape of problem at month three: the demo works, the prompt has been tweaked 200 times, and nobody can answer the question, "is this version better than last week's version?" You ship a change, somebody on Slack says it feels worse, you revert, somebody else says the revert feels worse, and the team gives up on iterating.
The reason is always the same. There is no fixed reference. Without a golden dataset, every prompt change is judged on whichever 6 calls the on-call engineer happened to look at this morning. That is not evaluation — that is augury.
A golden dataset fixes this. It is the regression test suite for an agent. When you have one:
- Every prompt change runs against the same N examples.
- Every model swap runs against the same N examples.
- Every tool refactor runs against the same N examples.
- Wins and regressions are measurable to two decimal places, and the category of regression (tool selection, hallucination, refusal, latency) is visible.
The teams that ship voice and chat agents at scale — the ones we work with at CallSphere across healthcare, real estate, sales, and IT helpdesk verticals — all have one. The teams that get stuck in pilot don't.
What Belongs in a Golden Dataset (and What Doesn't)
A golden Example in LangSmith is a triple: input, reference output, metadata. For agents specifically, "input" is rarely just a string — it is a structured payload that captures the full evaluable unit: the user message, the conversation history, the tool catalog available at that point, the system prompt revision, and any retrieved context.
| Field | What goes in it | Why it matters |
|---|---|---|
inputs.messages |
Full conversation up to the eval turn | Agents are stateful; one-shot inputs don't reproduce real failures |
inputs.tools |
Tool schemas available at runtime | Tool selection is a top-3 failure mode |
inputs.context |
RAG chunks, customer profile, prior call summary | Eliminates "context drift" as a confound |
outputs.reference |
Ideal final answer OR ideal trajectory | Either is fine; pick one and be consistent |
outputs.must_call |
Tools the agent MUST call | Trajectory-level grading |
outputs.must_not_say |
Forbidden phrases (PHI, competitor names) | Compliance graders run cheaply |
metadata.source |
production, synthetic, adversarial, regression |
Lets you slice metrics by provenance |
metadata.severity |
p0, p1, p2 |
Weight failures appropriately |
metadata.persona |
new_user, angry_customer, enterprise_admin |
Stratified sampling at eval time |
What does NOT belong: examples your PM made up in a Notion doc to "represent the user." Synthetic data has a place — adversarial generation, edge-case probes, distribution gap-filling — but the core of the golden dataset must be production-mined. Real users break agents in ways no PM ever imagines.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The Build Pipeline: From Trace to Golden Example
Here is the pipeline I run, end to end. It is the same flow whether the agent is a customer support bot, a voice agent on Twilio, or a multi-agent research assistant.
flowchart LR
A[Production traces<br/>LangSmith Project] --> B{Filter}
B -->|user feedback < 3| C[Failure pool]
B -->|tool error| C
B -->|latency p99 spike| C
B -->|judge score low| C
B -->|random 1%| D[Sample pool]
C --> E[Annotation Queue]
D --> E
E --> F[SME review]
F -->|accept + correct| G[Dataset v_n+1]
F -->|reject| H[Discard log]
G --> I[Run evaluators]
I --> J{Regression?}
J -->|yes| K[Block deploy]
J -->|no| L[Promote to prod]
L --> A
style A fill:#e6f3ff
style E fill:#fff4e6
style G fill:#e8f5e8
style K fill:#fcc
Figure 1 — The golden dataset is a closed loop. Production feeds it, SMEs curate it, evaluators gate deploys, and accepted deploys produce the next batch of traces. Break this loop and your dataset rots.
Step 1 — Mine Production Traces
LangSmith stores every Run you trace under a Project. The first job is to extract the runs that should end up in the dataset. Three filters give you 80% of the value.
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# 1. Negative user feedback (thumbs-down, low CSAT, etc.)
negative_runs = client.list_runs(
project_name="prod-voice-agent",
filter='and(eq(feedback_key, "user_score"), lt(feedback_score, 3))',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
# 2. Tool errors / exceptions inside the trace
error_runs = client.list_runs(
project_name="prod-voice-agent",
filter='eq(error, true)',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
# 3. Latency outliers (anything > 5s end-to-end)
slow_runs = client.list_runs(
project_name="prod-voice-agent",
filter='gt(latency, 5)',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
print(f"Mined: {sum(1 for _ in negative_runs)} neg, "
f"{sum(1 for _ in error_runs)} err, "
f"{sum(1 for _ in slow_runs)} slow")
In a real prod system at ~50K calls/day, this typically yields 300–800 candidates per week. That is too many to annotate by hand and too few to skip. The Annotation Queue is how you bridge that gap.
Step 2 — Send Candidates to an Annotation Queue
LangSmith's Annotation Queue is a curated workspace where SMEs review one trace at a time, mark it correct or incorrect, and (critically) edit the reference output to what the agent should have said.
import { Client } from "langsmith";
const client = new Client();
// Create a queue once per surface
await client.createAnnotationQueue({
name: "voice-agent-failures-2026-w18",
description: "Weekly SME review for golden dataset growth",
defaultDataset: "voice-agent-golden-v3",
});
// Push candidate runs into the queue
const candidateIds: string[] = [/* run ids from Step 1 */];
await client.addRunsToAnnotationQueue(
"voice-agent-failures-2026-w18",
candidateIds,
);
The cardinal rule: the SME owns the reference output, not the engineer. Your SME is the support lead, the nurse practitioner, the sales manager, the compliance officer. They know what the right answer was. Engineers are not allowed to write reference outputs — that is how you bake the model's biases into the grader.
Step 3 — Promote Annotated Examples Into the Dataset
Once an SME accepts an example, it gets promoted into the versioned dataset. LangSmith versions datasets automatically — every add_examples call creates a new version tag you can pin in CI.
from langsmith import Client
client = Client()
# Create the dataset once
dataset = client.create_dataset(
dataset_name="voice-agent-golden-v3",
description="Production-mined golden set for the voice agent.",
)
# Promote one annotated example
client.create_example(
dataset_id=dataset.id,
inputs={
"messages": [
{"role": "system", "content": "You are an after-hours triage agent."},
{"role": "user", "content": "My 4-year-old has had a fever for 3 days."},
],
"tools": ["lookup_clinic_hours", "page_on_call", "schedule_callback"],
"context": {"clinic_id": "clinic_482", "patient_age": 4},
},
outputs={
"reference": "I'm going to page the on-call physician now. Please stay on the line.",
"must_call": ["page_on_call"],
"must_not_say": ["I'm not a doctor", "call 911"],
},
metadata={
"source": "production",
"severity": "p0",
"persona": "worried_parent",
"trace_id": "8f3a...e91",
"annotator": "rn_kelly",
"version_added": "v3.4",
},
)
Versioning, Splits, and the Two Datasets You Actually Need
People treat datasets as one big bucket. They shouldn't. In production I always run two:
- Golden — frozen. The regression set. Curated, balanced across personas and severities, capped at ~500–1,500 examples. Changes go through review. CI runs against this dataset on every PR. This is the gate.
- Drift — rolling. Last 30 days of production-mined examples. Grows weekly. No promotion needed. Used for trend analysis and to spot distribution shift. This is the canary.
Both live in LangSmith. The drift dataset feeds the golden one — examples that survive 30 days, get SME-confirmed, and represent something new (not already covered by an existing golden example) get promoted.
| Property | Golden | Drift |
|---|---|---|
| Size | 500–1,500 | 5,000–50,000 |
| Cadence of change | Reviewed monthly | Refreshed weekly |
| Used for | CI regression gate | Distribution monitoring |
| Pinned version in CI | Yes (v3.4) |
No (always latest) |
| Edited reference outputs | Yes | No (uses agent's own output) |
Running Evaluators Against the Golden Set
Once the dataset exists, the next question is what to grade. For agents specifically, you grade three layers:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Final-answer correctness — does the last assistant message match the reference? Use an LLM judge with rubric, NOT exact string match.
- Trajectory correctness — did the agent call
must_calltools and avoid forbidden tools? Pure code grader, deterministic. - Safety/policy — did any message contain a
must_not_sayphrase? Pure code grader.
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
def trajectory_grader(run, example):
"""Did the agent call all required tools?"""
called = {c["name"] for c in run.outputs.get("tool_calls", [])}
required = set(example.outputs.get("must_call", []))
return {
"key": "required_tools_called",
"score": 1.0 if required.issubset(called) else 0.0,
"comment": f"missing: {required - called}" if required - called else "ok",
}
def policy_grader(run, example):
"""Did the agent say anything forbidden?"""
text = " ".join(m["content"] for m in run.outputs.get("messages", []))
forbidden = example.outputs.get("must_not_say", [])
hits = [p for p in forbidden if p.lower() in text.lower()]
return {
"key": "policy_violations",
"score": 0.0 if hits else 1.0,
"comment": f"forbidden hits: {hits}" if hits else "clean",
}
results = evaluate(
lambda inputs: my_agent.invoke(inputs),
data="voice-agent-golden-v3",
evaluators=[trajectory_grader, policy_grader],
experiment_prefix="prompt-rev-117",
max_concurrency=8,
)
print(results.to_pandas().describe())
The output is a comparable experiment row in LangSmith. You change the prompt, you re-run, you get a diff. That is what "we made the agent better" looks like in evidence form.
How a Golden Dataset Grows Without Rotting
A static dataset rots in 3 months. The agent improves, the easy examples become trivial, and the dataset stops discriminating. The discipline that keeps it useful:
- Promote every novel failure. If a production trace fails in a way no golden example covers, it gets a candidate slot. Novel is the key word — duplicates of existing failures don't help.
- Retire solved examples. When the agent has scored 100% on an example for 90 days across two model swaps, demote it to an "archive" tag. Keep it; stop running it on every PR.
- Rebalance quarterly. Pull persona/severity counts. If 70% of examples are
new_userand your prod traffic shifted toenterprise_admin, you are evaluating the wrong agent. - Adversarial top-up. Once a quarter, generate 50 synthetic adversarial examples — prompt injection, jailbreaks, role-switching — and SME-review them in. This is the one place synthetic data earns its keep.
Common Mistakes I See
- Engineer-written references. The model already thinks like an engineer; engineer-written references just measure self-agreement.
- No versioning. "We added some examples last week" is not a version. Pin a tag in CI.
- Treating the dataset as the test set. It is the training-the-prompt set too. You will overfit. Hold out 20% for a true heldout.
- One dataset for chat and voice. Voice has interruptions, latency budgets, and hand-offs that chat doesn't. Separate them.
- No
must_not_sayfield. Compliance failures are silent until they aren't. A two-line policy grader catches them in CI for ~$0 per run.
Wiring It Into CI
The whole point is to gate deploys. A LangSmith evaluate() run produces a structured result you can fail a PR on.
import sys
from langsmith.evaluation import evaluate
result = evaluate(
my_agent_fn,
data="voice-agent-golden-v3", # pinned version
evaluators=[trajectory_grader, policy_grader, llm_judge],
experiment_prefix=f"pr-{os.environ['GITHUB_PR_NUMBER']}",
)
df = result.to_pandas()
required_tools_pass = df["feedback.required_tools_called"].mean()
policy_pass = df["feedback.policy_violations"].mean()
# Fail the PR if either gate slips below threshold
if required_tools_pass < 0.95 or policy_pass < 0.99:
print(f"REGRESSION: tools={required_tools_pass:.3f} policy={policy_pass:.3f}")
sys.exit(1)
That snippet, dropped into a GitHub Action, is the difference between an agent that gets better every week and an agent that is one bad prompt edit away from a Slack incident.
FAQ
Q: How big should the golden dataset be? A: 500 examples is a strong floor for a single-purpose agent. 1,500 is plenty for most production systems. Past 2,000 you are adding examples for sport — your CI cost is climbing and your discriminative power is not. Bigger is not better; more diverse is better.
Q: Can I bootstrap a golden dataset before I have production traffic?
A: Yes, but flag everything as source=synthetic and treat the metrics as directional only. The day production traffic exists, start mining. The synthetic examples will get pushed out of the regression set within 60 days — that is healthy.
Q: How often should I re-version the dataset? A: Promote a new version monthly, with a release note describing what was added, removed, or rebalanced. CI pins the version, so promotion is non-disruptive — engineers update the pin when they are ready.
Q: Do I need an SME, or can engineers annotate? A: For trivial agents, engineers are fine. For anything domain-specific (medical, legal, finance, sales), an engineer-annotated dataset will encode the engineer's misunderstandings. Pay the SME hours. The dataset is the most leveraged thing they will ever produce.
Q: How does this differ from a unit test suite? A: Unit tests are deterministic and cover code paths. A golden dataset is probabilistic and covers behaviors — tool selection, refusal, persona-appropriate tone — that no unit test can express. They are complements, not substitutes.
Build Your Golden Dataset With CallSphere
If you are running voice or chat agents in production, you already have the raw material — every call is a candidate. CallSphere ships with the trace export, annotation flow, and evaluator hooks you need to stand a golden dataset up in a week, not a quarter. See the products page, the agent evaluation glossary entry, or book a working session.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.