Skip to content
Agentic AI
Agentic AI8 min read0 views

Testing & evals for Claude analytics agents: gate releases

Build an eval loop for a Claude self-service analytics agent: golden datasets, LLM-judge grading, and CI gates that block releases when quality regresses.

You change one sentence in your analytics agent's system prompt to fix a tool-routing bug, ship it, and three days later discover it quietly broke the way the agent handles date ranges — a category of question that was working fine before. This is the central problem with shipping LLM agents: the surface you're tuning is fuzzy, the failure modes are subtle, and a fix in one place can regress another with no compiler to catch it. The answer is the same one that disciplined software has always reached for, adapted to a probabilistic system: an evaluation loop that measures quality on a fixed set of cases and refuses to let a release through if the numbers drop. This post is about building that loop for a self-service data analytics agent specifically.

What "quality" means for an analytics agent

An eval is only as good as its definition of correct, and for an analytics agent "correct" has more than one dimension. The most important is answer accuracy: did the agent return the right number? For "total revenue in Q3," there's a single ground-truth value, and the agent either matched it or didn't. But accuracy alone misses things that matter in self-service. There's query correctness — did the agent join the right tables and filter on the right columns, or did it luck into the right number through a wrong query that will break on different data? There's tool-path correctness — did it look up the metric definition before querying, as it should? And there's refusal calibration — when a question is ambiguous or unanswerable from the available data, does the agent ask for clarification rather than confidently fabricating?

A good eval suite scores several of these, because optimizing only the headline number leads you astray. An agent that gets the right answer via a fragile query is a regression waiting to happen, and an agent that never asks for clarification is one that will eventually hand a stakeholder a confident wrong number. Define your dimensions up front, and write each test case to assert on the ones that matter for it.

Building the golden dataset

The foundation of the loop is a golden dataset: a curated set of representative questions, each paired with the known-correct answer and, where relevant, the expected query shape or tool path. Seed it from real usage — the questions analysts actually ask — and grow it deliberately. The highest-value entries are the failures: every time the agent gets something wrong in production, capture the full transcript, label what the right answer was, and add it as a permanent case. This is how the suite compounds. A bug you fix once becomes a test that guards against its return forever, and over a few months your golden dataset becomes a precise map of your agent's hard edges.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Cover the spread of difficulty deliberately. Include easy aggregations, multi-step questions that require joins and filtering, ambiguous questions where the correct behavior is to ask for clarification, and adversarial questions that probe injection or off-scope access. Because the Claude API is stateless and a transcript is fully self-contained, each case is just a saved input you can replay, which makes the dataset cheap to maintain and trivial to run.

flowchart TD
  A["Golden dataset of questions"] --> B["Run agent on each case"]
  B --> C["Capture answer + query + tool path"]
  C --> D{"Exact-match check"}
  D -->|Numeric / structured| E["Deterministic grader"]
  D -->|Open-ended| F["LLM-judge with rubric"]
  E --> G["Aggregate score"]
  F --> G
  G --> H{"Score below threshold?"}
  H -->|Yes| I["Block release"]
  H -->|No| J["Promote build"]

Grading: deterministic where you can, LLM-judge where you must

Grading splits cleanly into two regimes. Where the answer is a number, a structured result, or a specific tool path, grade deterministically: compare the agent's number to the expected value within a tolerance, check that the generated SQL references the expected tables, assert that get_metric_definition was called before run_sql. These checks are fast, free, and unambiguous, and you should push as much of your suite into this regime as possible.

Some dimensions resist exact matching — whether an explanation is faithful to the data, whether a clarifying question is appropriately scoped, whether a refusal was warranted. For these, use an LLM judge: a separate Claude call, given the question, the agent's response, the ground truth, and an explicit rubric, that scores the response against the rubric. An LLM judge is simply a model call that evaluates another model's output against stated criteria. The discipline that makes it reliable is the rubric: be concrete ("the answer states a single number and cites which table it came from") rather than vague ("the answer is good"), because a vague rubric produces noisy scores. Run the judge at a sensible effort and give it the ground truth so it's grading against fact, not vibes. To trust your judge, validate it against a small set of human-labeled cases and confirm it agrees with you before you let it gate anything.

Gating releases in CI

An eval suite that runs only when someone remembers to run it doesn't gate anything. Wire it into your release pipeline so every prompt change, tool-schema edit, or model bump triggers the full run, and set thresholds that block promotion when scores regress. The threshold can be absolute ("answer accuracy must stay above ninety percent") or relative ("no dimension may drop more than two points versus the current production build"). The relative form catches the insidious case where an overall number holds steady while a specific category quietly breaks — gate per-dimension, not just on the aggregate, so a fix that trades date-range handling for tool-routing gets caught.

Use the Batches API to run the suite cheaply: the cases are independent, so submit them as a batch at half price and poll for results, sharing a cached prefix across all of them to cut input cost further. Surface the diff against the last run prominently — which specific cases flipped from pass to fail — because "accuracy dropped one point" is far less actionable than "these four date-range questions now fail." The goal is that a developer changing the prompt sees, before merge, exactly what their change did to every category of question.

Closing the loop with production

The eval loop and production form a flywheel. Production surfaces new failure modes; you label them and fold them into the golden dataset; the suite grows more representative; future changes get tested against a richer set of real cases. Sample live traffic, run the same graders against production responses to catch drift that your fixed suite might miss, and route the failures back into the dataset. Over time the suite stops being a snapshot of what you thought to test and becomes an accumulated record of every way your agent has ever been wrong — which is exactly the asset you want guarding the gate. The teams that ship analytics agents confidently aren't the ones whose agents never fail; they're the ones whose every failure becomes a permanent test.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Frequently asked questions

What should I measure beyond answer accuracy?

Query correctness (did it join and filter correctly, not just luck into the number), tool-path correctness (did it look up the metric definition before querying), and refusal calibration (does it ask for clarification on ambiguous questions instead of fabricating). Optimizing accuracy alone rewards fragile queries and overconfident answers, so grade these dimensions explicitly.

When should I use an LLM judge versus deterministic grading?

Use deterministic grading wherever the answer is a number, a structured result, or a specific tool path — it's fast, free, and unambiguous. Reserve an LLM judge for things exact matching can't capture, like whether an explanation is faithful or a refusal was warranted, and always give the judge a concrete rubric and the ground truth so its scores are reliable.

How do I make sure the eval suite catches regressions, not just overall drops?

Gate per-dimension and per-category, not only on the aggregate score. A change can hold overall accuracy steady while quietly breaking one category like date ranges. Set thresholds that block promotion when any dimension or category regresses, and surface a case-level diff so you see exactly which questions flipped.

How big does the golden dataset need to be?

Start small with real, representative questions and grow it from production failures — every wrong answer becomes a permanent case. There's no magic number; what matters is coverage of difficulty levels and failure modes (easy aggregations, multi-step joins, ambiguous questions, adversarial probes). A focused, well-labeled few dozen cases that map your agent's hard edges beats hundreds of redundant easy ones.

From eval gates to live conversations

The same eval discipline — golden cases, rubric-based grading, release gates — is what lets a voice agent improve without quietly regressing on the calls that matter. CallSphere applies these agentic testing patterns to voice and chat, so AI assistants that answer every call and book work 24/7 keep getting better, not flakier. See it live at callsphere.ai.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.