By Sagar Shankaran, Founder of CallSphere
A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.
Key takeaways
A golden dataset is the single most leveraged artifact in an agent program. It is not a CSV your intern made. It is a versioned, auditable, evolving collection of inputs paired with reference outputs and graders, mined from real production traces, annotated by humans who know the domain, and refreshed every time the agent fails in a new way. In LangSmith, a golden dataset is a first-class Dataset object with Examples, attached evaluators, and an experiments history that lets you compare prompts, models, and tool changes against the same yardstick. This post is the playbook I use when I bootstrap one from zero, scale it past 1,000 examples, and keep it alive in CI for years.
If you only remember three things: mine production, don't write fiction; version every change; let failures, not vibes, decide what gets added.
Every agent team I have walked into has the same shape of problem at month three: the demo works, the prompt has been tweaked 200 times, and nobody can answer the question, "is this version better than last week's version?" You ship a change, somebody on Slack says it feels worse, you revert, somebody else says the revert feels worse, and the team gives up on iterating.
The reason is always the same. There is no fixed reference. Without a golden dataset, every prompt change is judged on whichever 6 calls the on-call engineer happened to look at this morning. That is not evaluation — that is augury.
A golden dataset fixes this. It is the regression test suite for an agent. When you have one:
The teams that ship voice and chat agents at scale — the ones we work with at CallSphere across healthcare, real estate, sales, and IT helpdesk verticals — all have one. The teams that get stuck in pilot don't.
A golden Example in LangSmith is a triple: input, reference output, metadata. For agents specifically, "input" is rarely just a string — it is a structured payload that captures the full evaluable unit: the user message, the conversation history, the tool catalog available at that point, the system prompt revision, and any retrieved context.
| Field | What goes in it | Why it matters |
|---|---|---|
inputs.messages |
Full conversation up to the eval turn | Agents are stateful; one-shot inputs don't reproduce real failures |
inputs.tools |
Tool schemas available at runtime | Tool selection is a top-3 failure mode |
inputs.context |
RAG chunks, customer profile, prior call summary | Eliminates "context drift" as a confound |
outputs.reference |
Ideal final answer OR ideal trajectory | Either is fine; pick one and be consistent |
outputs.must_call |
Tools the agent MUST call | Trajectory-level grading |
outputs.must_not_say |
Forbidden phrases (PHI, competitor names) | Compliance graders run cheaply |
metadata.source |
production, synthetic, adversarial, regression |
Lets you slice metrics by provenance |
metadata.severity |
p0, p1, p2 |
Weight failures appropriately |
metadata.persona |
new_user, angry_customer, enterprise_admin |
Stratified sampling at eval time |
What does NOT belong: examples your PM made up in a Notion doc to "represent the user." Synthetic data has a place — adversarial generation, edge-case probes, distribution gap-filling — but the core of the golden dataset must be production-mined. Real users break agents in ways no PM ever imagines.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Here is the pipeline I run, end to end. It is the same flow whether the agent is a customer support bot, a voice agent on Twilio, or a multi-agent research assistant.
flowchart LR
A[Production traces<br/>LangSmith Project] --> B{Filter}
B -->|user feedback < 3| C[Failure pool]
B -->|tool error| C
B -->|latency p99 spike| C
B -->|judge score low| C
B -->|random 1%| D[Sample pool]
C --> E[Annotation Queue]
D --> E
E --> F[SME review]
F -->|accept + correct| G[Dataset v_n+1]
F -->|reject| H[Discard log]
G --> I[Run evaluators]
I --> J{Regression?}
J -->|yes| K[Block deploy]
J -->|no| L[Promote to prod]
L --> A
style A fill:#e6f3ff
style E fill:#fff4e6
style G fill:#e8f5e8
style K fill:#fcc
Figure 1 — The golden dataset is a closed loop. Production feeds it, SMEs curate it, evaluators gate deploys, and accepted deploys produce the next batch of traces. Break this loop and your dataset rots.
LangSmith stores every Run you trace under a Project. The first job is to extract the runs that should end up in the dataset. Three filters give you 80% of the value.
from langsmith import Client
from datetime import datetime, timedelta
client = Client()
# 1. Negative user feedback (thumbs-down, low CSAT, etc.)
negative_runs = client.list_runs(
project_name="prod-voice-agent",
filter='and(eq(feedback_key, "user_score"), lt(feedback_score, 3))',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
# 2. Tool errors / exceptions inside the trace
error_runs = client.list_runs(
project_name="prod-voice-agent",
filter='eq(error, true)',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
# 3. Latency outliers (anything > 5s end-to-end)
slow_runs = client.list_runs(
project_name="prod-voice-agent",
filter='gt(latency, 5)',
start_time=datetime.utcnow() - timedelta(days=7),
is_root=True,
)
print(f"Mined: {sum(1 for _ in negative_runs)} neg, "
f"{sum(1 for _ in error_runs)} err, "
f"{sum(1 for _ in slow_runs)} slow")
In a real prod system at ~50K calls/day, this typically yields 300–800 candidates per week. That is too many to annotate by hand and too few to skip. The Annotation Queue is how you bridge that gap.
LangSmith's Annotation Queue is a curated workspace where SMEs review one trace at a time, mark it correct or incorrect, and (critically) edit the reference output to what the agent should have said.
import { Client } from "langsmith";
const client = new Client();
// Create a queue once per surface
await client.createAnnotationQueue({
name: "voice-agent-failures-2026-w18",
description: "Weekly SME review for golden dataset growth",
defaultDataset: "voice-agent-golden-v3",
});
// Push candidate runs into the queue
const candidateIds: string[] = [/* run ids from Step 1 */];
await client.addRunsToAnnotationQueue(
"voice-agent-failures-2026-w18",
candidateIds,
);
The cardinal rule: the SME owns the reference output, not the engineer. Your SME is the support lead, the nurse practitioner, the sales manager, the compliance officer. They know what the right answer was. Engineers are not allowed to write reference outputs — that is how you bake the model's biases into the grader.
Once an SME accepts an example, it gets promoted into the versioned dataset. LangSmith versions datasets automatically — every add_examples call creates a new version tag you can pin in CI.
from langsmith import Client
client = Client()
# Create the dataset once
dataset = client.create_dataset(
dataset_name="voice-agent-golden-v3",
description="Production-mined golden set for the voice agent.",
)
# Promote one annotated example
client.create_example(
dataset_id=dataset.id,
inputs={
"messages": [
{"role": "system", "content": "You are an after-hours triage agent."},
{"role": "user", "content": "My 4-year-old has had a fever for 3 days."},
],
"tools": ["lookup_clinic_hours", "page_on_call", "schedule_callback"],
"context": {"clinic_id": "clinic_482", "patient_age": 4},
},
outputs={
"reference": "I'm going to page the on-call physician now. Please stay on the line.",
"must_call": ["page_on_call"],
"must_not_say": ["I'm not a doctor", "call 911"],
},
metadata={
"source": "production",
"severity": "p0",
"persona": "worried_parent",
"trace_id": "8f3a...e91",
"annotator": "rn_kelly",
"version_added": "v3.4",
},
)
People treat datasets as one big bucket. They shouldn't. In production I always run two:
Both live in LangSmith. The drift dataset feeds the golden one — examples that survive 30 days, get SME-confirmed, and represent something new (not already covered by an existing golden example) get promoted.
| Property | Golden | Drift |
|---|---|---|
| Size | 500–1,500 | 5,000–50,000 |
| Cadence of change | Reviewed monthly | Refreshed weekly |
| Used for | CI regression gate | Distribution monitoring |
| Pinned version in CI | Yes (v3.4) |
No (always latest) |
| Edited reference outputs | Yes | No (uses agent's own output) |
Once the dataset exists, the next question is what to grade. For agents specifically, you grade three layers:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
must_call tools and avoid forbidden tools? Pure code grader, deterministic.must_not_say phrase? Pure code grader.from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
def trajectory_grader(run, example):
"""Did the agent call all required tools?"""
called = {c["name"] for c in run.outputs.get("tool_calls", [])}
required = set(example.outputs.get("must_call", []))
return {
"key": "required_tools_called",
"score": 1.0 if required.issubset(called) else 0.0,
"comment": f"missing: {required - called}" if required - called else "ok",
}
def policy_grader(run, example):
"""Did the agent say anything forbidden?"""
text = " ".join(m["content"] for m in run.outputs.get("messages", []))
forbidden = example.outputs.get("must_not_say", [])
hits = [p for p in forbidden if p.lower() in text.lower()]
return {
"key": "policy_violations",
"score": 0.0 if hits else 1.0,
"comment": f"forbidden hits: {hits}" if hits else "clean",
}
results = evaluate(
lambda inputs: my_agent.invoke(inputs),
data="voice-agent-golden-v3",
evaluators=[trajectory_grader, policy_grader],
experiment_prefix="prompt-rev-117",
max_concurrency=8,
)
print(results.to_pandas().describe())
The output is a comparable experiment row in LangSmith. You change the prompt, you re-run, you get a diff. That is what "we made the agent better" looks like in evidence form.
A static dataset rots in 3 months. The agent improves, the easy examples become trivial, and the dataset stops discriminating. The discipline that keeps it useful:
new_user and your prod traffic shifted to enterprise_admin, you are evaluating the wrong agent.must_not_say field. Compliance failures are silent until they aren't. A two-line policy grader catches them in CI for ~$0 per run.The whole point is to gate deploys. A LangSmith evaluate() run produces a structured result you can fail a PR on.
import sys
from langsmith.evaluation import evaluate
result = evaluate(
my_agent_fn,
data="voice-agent-golden-v3", # pinned version
evaluators=[trajectory_grader, policy_grader, llm_judge],
experiment_prefix=f"pr-{os.environ['GITHUB_PR_NUMBER']}",
)
df = result.to_pandas()
required_tools_pass = df["feedback.required_tools_called"].mean()
policy_pass = df["feedback.policy_violations"].mean()
# Fail the PR if either gate slips below threshold
if required_tools_pass < 0.95 or policy_pass < 0.99:
print(f"REGRESSION: tools={required_tools_pass:.3f} policy={policy_pass:.3f}")
sys.exit(1)
That snippet, dropped into a GitHub Action, is the difference between an agent that gets better every week and an agent that is one bad prompt edit away from a Slack incident.
Q: How big should the golden dataset be? A: 500 examples is a strong floor for a single-purpose agent. 1,500 is plenty for most production systems. Past 2,000 you are adding examples for sport — your CI cost is climbing and your discriminative power is not. Bigger is not better; more diverse is better.
Q: Can I bootstrap a golden dataset before I have production traffic?
A: Yes, but flag everything as source=synthetic and treat the metrics as directional only. The day production traffic exists, start mining. The synthetic examples will get pushed out of the regression set within 60 days — that is healthy.
Q: How often should I re-version the dataset? A: Promote a new version monthly, with a release note describing what was added, removed, or rebalanced. CI pins the version, so promotion is non-disruptive — engineers update the pin when they are ready.
Q: Do I need an SME, or can engineers annotate? A: For trivial agents, engineers are fine. For anything domain-specific (medical, legal, finance, sales), an engineer-annotated dataset will encode the engineer's misunderstandings. Pay the SME hours. The dataset is the most leveraged thing they will ever produce.
Q: How does this differ from a unit test suite? A: Unit tests are deterministic and cover code paths. A golden dataset is probabilistic and covers behaviors — tool selection, refusal, persona-appropriate tone — that no unit test can express. They are complements, not substitutes.
If you are running voice or chat agents in production, you already have the raw material — every call is a candidate. CallSphere ships with the trace export, annotation flow, and evaluator hooks you need to stand a golden dataset up in a week, not a quarter. See the products page, the agent evaluation glossary entry, or book a working session.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.
How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI