Skip to content
Agentic AI
Agentic AI12 min read0 views

How to Build a Golden Dataset for Production AI Agents

A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.

TL;DR

A golden dataset is the single most leveraged artifact in an agent program. It is not a CSV your intern made. It is a versioned, auditable, evolving collection of inputs paired with reference outputs and graders, mined from real production traces, annotated by humans who know the domain, and refreshed every time the agent fails in a new way. In LangSmith, a golden dataset is a first-class Dataset object with Examples, attached evaluators, and an experiments history that lets you compare prompts, models, and tool changes against the same yardstick. This post is the playbook I use when I bootstrap one from zero, scale it past 1,000 examples, and keep it alive in CI for years.

If you only remember three things: mine production, don't write fiction; version every change; let failures, not vibes, decide what gets added.

Why Golden Datasets Decide Whether Your Agent Ships

Every agent team I have walked into has the same shape of problem at month three: the demo works, the prompt has been tweaked 200 times, and nobody can answer the question, "is this version better than last week's version?" You ship a change, somebody on Slack says it feels worse, you revert, somebody else says the revert feels worse, and the team gives up on iterating.

The reason is always the same. There is no fixed reference. Without a golden dataset, every prompt change is judged on whichever 6 calls the on-call engineer happened to look at this morning. That is not evaluation — that is augury.

A golden dataset fixes this. It is the regression test suite for an agent. When you have one:

  • Every prompt change runs against the same N examples.
  • Every model swap runs against the same N examples.
  • Every tool refactor runs against the same N examples.
  • Wins and regressions are measurable to two decimal places, and the category of regression (tool selection, hallucination, refusal, latency) is visible.

The teams that ship voice and chat agents at scale — the ones we work with at CallSphere across healthcare, real estate, sales, and IT helpdesk verticals — all have one. The teams that get stuck in pilot don't.

What Belongs in a Golden Dataset (and What Doesn't)

A golden Example in LangSmith is a triple: input, reference output, metadata. For agents specifically, "input" is rarely just a string — it is a structured payload that captures the full evaluable unit: the user message, the conversation history, the tool catalog available at that point, the system prompt revision, and any retrieved context.

Field What goes in it Why it matters
inputs.messages Full conversation up to the eval turn Agents are stateful; one-shot inputs don't reproduce real failures
inputs.tools Tool schemas available at runtime Tool selection is a top-3 failure mode
inputs.context RAG chunks, customer profile, prior call summary Eliminates "context drift" as a confound
outputs.reference Ideal final answer OR ideal trajectory Either is fine; pick one and be consistent
outputs.must_call Tools the agent MUST call Trajectory-level grading
outputs.must_not_say Forbidden phrases (PHI, competitor names) Compliance graders run cheaply
metadata.source production, synthetic, adversarial, regression Lets you slice metrics by provenance
metadata.severity p0, p1, p2 Weight failures appropriately
metadata.persona new_user, angry_customer, enterprise_admin Stratified sampling at eval time

What does NOT belong: examples your PM made up in a Notion doc to "represent the user." Synthetic data has a place — adversarial generation, edge-case probes, distribution gap-filling — but the core of the golden dataset must be production-mined. Real users break agents in ways no PM ever imagines.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The Build Pipeline: From Trace to Golden Example

Here is the pipeline I run, end to end. It is the same flow whether the agent is a customer support bot, a voice agent on Twilio, or a multi-agent research assistant.

flowchart LR
  A[Production traces<br/>LangSmith Project] --> B{Filter}
  B -->|user feedback < 3| C[Failure pool]
  B -->|tool error| C
  B -->|latency p99 spike| C
  B -->|judge score low| C
  B -->|random 1%| D[Sample pool]
  C --> E[Annotation Queue]
  D --> E
  E --> F[SME review]
  F -->|accept + correct| G[Dataset v_n+1]
  F -->|reject| H[Discard log]
  G --> I[Run evaluators]
  I --> J{Regression?}
  J -->|yes| K[Block deploy]
  J -->|no| L[Promote to prod]
  L --> A
  style A fill:#e6f3ff
  style E fill:#fff4e6
  style G fill:#e8f5e8
  style K fill:#fcc

Figure 1 — The golden dataset is a closed loop. Production feeds it, SMEs curate it, evaluators gate deploys, and accepted deploys produce the next batch of traces. Break this loop and your dataset rots.

Step 1 — Mine Production Traces

LangSmith stores every Run you trace under a Project. The first job is to extract the runs that should end up in the dataset. Three filters give you 80% of the value.

from langsmith import Client
from datetime import datetime, timedelta

client = Client()

# 1. Negative user feedback (thumbs-down, low CSAT, etc.)
negative_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='and(eq(feedback_key, "user_score"), lt(feedback_score, 3))',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

# 2. Tool errors / exceptions inside the trace
error_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='eq(error, true)',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

# 3. Latency outliers (anything > 5s end-to-end)
slow_runs = client.list_runs(
    project_name="prod-voice-agent",
    filter='gt(latency, 5)',
    start_time=datetime.utcnow() - timedelta(days=7),
    is_root=True,
)

print(f"Mined: {sum(1 for _ in negative_runs)} neg, "
      f"{sum(1 for _ in error_runs)} err, "
      f"{sum(1 for _ in slow_runs)} slow")

In a real prod system at ~50K calls/day, this typically yields 300–800 candidates per week. That is too many to annotate by hand and too few to skip. The Annotation Queue is how you bridge that gap.

Step 2 — Send Candidates to an Annotation Queue

LangSmith's Annotation Queue is a curated workspace where SMEs review one trace at a time, mark it correct or incorrect, and (critically) edit the reference output to what the agent should have said.

import { Client } from "langsmith";

const client = new Client();

// Create a queue once per surface
await client.createAnnotationQueue({
  name: "voice-agent-failures-2026-w18",
  description: "Weekly SME review for golden dataset growth",
  defaultDataset: "voice-agent-golden-v3",
});

// Push candidate runs into the queue
const candidateIds: string[] = [/* run ids from Step 1 */];
await client.addRunsToAnnotationQueue(
  "voice-agent-failures-2026-w18",
  candidateIds,
);

The cardinal rule: the SME owns the reference output, not the engineer. Your SME is the support lead, the nurse practitioner, the sales manager, the compliance officer. They know what the right answer was. Engineers are not allowed to write reference outputs — that is how you bake the model's biases into the grader.

Step 3 — Promote Annotated Examples Into the Dataset

Once an SME accepts an example, it gets promoted into the versioned dataset. LangSmith versions datasets automatically — every add_examples call creates a new version tag you can pin in CI.

from langsmith import Client

client = Client()

# Create the dataset once
dataset = client.create_dataset(
    dataset_name="voice-agent-golden-v3",
    description="Production-mined golden set for the voice agent.",
)

# Promote one annotated example
client.create_example(
    dataset_id=dataset.id,
    inputs={
        "messages": [
            {"role": "system", "content": "You are an after-hours triage agent."},
            {"role": "user", "content": "My 4-year-old has had a fever for 3 days."},
        ],
        "tools": ["lookup_clinic_hours", "page_on_call", "schedule_callback"],
        "context": {"clinic_id": "clinic_482", "patient_age": 4},
    },
    outputs={
        "reference": "I'm going to page the on-call physician now. Please stay on the line.",
        "must_call": ["page_on_call"],
        "must_not_say": ["I'm not a doctor", "call 911"],
    },
    metadata={
        "source": "production",
        "severity": "p0",
        "persona": "worried_parent",
        "trace_id": "8f3a...e91",
        "annotator": "rn_kelly",
        "version_added": "v3.4",
    },
)

Versioning, Splits, and the Two Datasets You Actually Need

People treat datasets as one big bucket. They shouldn't. In production I always run two:

  1. Golden — frozen. The regression set. Curated, balanced across personas and severities, capped at ~500–1,500 examples. Changes go through review. CI runs against this dataset on every PR. This is the gate.
  2. Drift — rolling. Last 30 days of production-mined examples. Grows weekly. No promotion needed. Used for trend analysis and to spot distribution shift. This is the canary.

Both live in LangSmith. The drift dataset feeds the golden one — examples that survive 30 days, get SME-confirmed, and represent something new (not already covered by an existing golden example) get promoted.

Property Golden Drift
Size 500–1,500 5,000–50,000
Cadence of change Reviewed monthly Refreshed weekly
Used for CI regression gate Distribution monitoring
Pinned version in CI Yes (v3.4) No (always latest)
Edited reference outputs Yes No (uses agent's own output)

Running Evaluators Against the Golden Set

Once the dataset exists, the next question is what to grade. For agents specifically, you grade three layers:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  1. Final-answer correctness — does the last assistant message match the reference? Use an LLM judge with rubric, NOT exact string match.
  2. Trajectory correctness — did the agent call must_call tools and avoid forbidden tools? Pure code grader, deterministic.
  3. Safety/policy — did any message contain a must_not_say phrase? Pure code grader.
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

def trajectory_grader(run, example):
    """Did the agent call all required tools?"""
    called = {c["name"] for c in run.outputs.get("tool_calls", [])}
    required = set(example.outputs.get("must_call", []))
    return {
        "key": "required_tools_called",
        "score": 1.0 if required.issubset(called) else 0.0,
        "comment": f"missing: {required - called}" if required - called else "ok",
    }

def policy_grader(run, example):
    """Did the agent say anything forbidden?"""
    text = " ".join(m["content"] for m in run.outputs.get("messages", []))
    forbidden = example.outputs.get("must_not_say", [])
    hits = [p for p in forbidden if p.lower() in text.lower()]
    return {
        "key": "policy_violations",
        "score": 0.0 if hits else 1.0,
        "comment": f"forbidden hits: {hits}" if hits else "clean",
    }

results = evaluate(
    lambda inputs: my_agent.invoke(inputs),
    data="voice-agent-golden-v3",
    evaluators=[trajectory_grader, policy_grader],
    experiment_prefix="prompt-rev-117",
    max_concurrency=8,
)

print(results.to_pandas().describe())

The output is a comparable experiment row in LangSmith. You change the prompt, you re-run, you get a diff. That is what "we made the agent better" looks like in evidence form.

How a Golden Dataset Grows Without Rotting

A static dataset rots in 3 months. The agent improves, the easy examples become trivial, and the dataset stops discriminating. The discipline that keeps it useful:

  • Promote every novel failure. If a production trace fails in a way no golden example covers, it gets a candidate slot. Novel is the key word — duplicates of existing failures don't help.
  • Retire solved examples. When the agent has scored 100% on an example for 90 days across two model swaps, demote it to an "archive" tag. Keep it; stop running it on every PR.
  • Rebalance quarterly. Pull persona/severity counts. If 70% of examples are new_user and your prod traffic shifted to enterprise_admin, you are evaluating the wrong agent.
  • Adversarial top-up. Once a quarter, generate 50 synthetic adversarial examples — prompt injection, jailbreaks, role-switching — and SME-review them in. This is the one place synthetic data earns its keep.

Common Mistakes I See

  • Engineer-written references. The model already thinks like an engineer; engineer-written references just measure self-agreement.
  • No versioning. "We added some examples last week" is not a version. Pin a tag in CI.
  • Treating the dataset as the test set. It is the training-the-prompt set too. You will overfit. Hold out 20% for a true heldout.
  • One dataset for chat and voice. Voice has interruptions, latency budgets, and hand-offs that chat doesn't. Separate them.
  • No must_not_say field. Compliance failures are silent until they aren't. A two-line policy grader catches them in CI for ~$0 per run.

Wiring It Into CI

The whole point is to gate deploys. A LangSmith evaluate() run produces a structured result you can fail a PR on.

import sys
from langsmith.evaluation import evaluate

result = evaluate(
    my_agent_fn,
    data="voice-agent-golden-v3",  # pinned version
    evaluators=[trajectory_grader, policy_grader, llm_judge],
    experiment_prefix=f"pr-{os.environ['GITHUB_PR_NUMBER']}",
)

df = result.to_pandas()
required_tools_pass = df["feedback.required_tools_called"].mean()
policy_pass = df["feedback.policy_violations"].mean()

# Fail the PR if either gate slips below threshold
if required_tools_pass < 0.95 or policy_pass < 0.99:
    print(f"REGRESSION: tools={required_tools_pass:.3f} policy={policy_pass:.3f}")
    sys.exit(1)

That snippet, dropped into a GitHub Action, is the difference between an agent that gets better every week and an agent that is one bad prompt edit away from a Slack incident.

FAQ

Q: How big should the golden dataset be? A: 500 examples is a strong floor for a single-purpose agent. 1,500 is plenty for most production systems. Past 2,000 you are adding examples for sport — your CI cost is climbing and your discriminative power is not. Bigger is not better; more diverse is better.

Q: Can I bootstrap a golden dataset before I have production traffic? A: Yes, but flag everything as source=synthetic and treat the metrics as directional only. The day production traffic exists, start mining. The synthetic examples will get pushed out of the regression set within 60 days — that is healthy.

Q: How often should I re-version the dataset? A: Promote a new version monthly, with a release note describing what was added, removed, or rebalanced. CI pins the version, so promotion is non-disruptive — engineers update the pin when they are ready.

Q: Do I need an SME, or can engineers annotate? A: For trivial agents, engineers are fine. For anything domain-specific (medical, legal, finance, sales), an engineer-annotated dataset will encode the engineer's misunderstandings. Pay the SME hours. The dataset is the most leveraged thing they will ever produce.

Q: How does this differ from a unit test suite? A: Unit tests are deterministic and cover code paths. A golden dataset is probabilistic and covers behaviors — tool selection, refusal, persona-appropriate tone — that no unit test can express. They are complements, not substitutes.

Build Your Golden Dataset With CallSphere

If you are running voice or chat agents in production, you already have the raw material — every call is a candidate. CallSphere ships with the trace export, annotation flow, and evaluator hooks you need to stand a golden dataset up in a week, not a quarter. See the products page, the agent evaluation glossary entry, or book a working session.

Book a demo · See products · Browse the glossary

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.