Testing and Evals for Claude Agents: Gate Releases
Build a Claude agent eval loop: outcome and trajectory metrics, real-failure datasets, rubric-driven LLM-as-judge, and CI gates that block regressions.
You cannot ship an agent you cannot measure. Unlike a deterministic service, an agent's behavior drifts with every prompt edit, tool change, and model upgrade, and a tweak that fixes one case quietly breaks three others. Vibes-based iteration works until it doesn't — usually in production. The fix is an eval loop: a repeatable harness that scores your agent against a fixed dataset and blocks releases that regress. This post is about building that loop for Claude agents, from choosing what to measure to wiring an LLM-as-judge you can actually trust.
Key takeaways
- Evals turn agent quality from a feeling into a number you can gate releases on.
- Measure outcomes and trajectories: did the agent reach the right end state, and did it take a sane path to get there?
- Build a dataset from real failures — every production bug becomes a permanent regression test.
- Use a Claude model as a judge for fuzzy criteria, but pin it with a rubric and validate it against human labels.
- Run evals in CI, cache the shared prompt prefix to keep them cheap, and fail the build on regression.
What to measure in an agent
Agent quality is multidimensional, and picking the wrong metric leads you astray. There are three families worth tracking. Outcome correctness asks whether the final state is right: was the ticket actually created, the answer factually correct, the refund the right amount? Trajectory quality asks whether the path was reasonable: did the agent pick the right tools, avoid loops, and not take destructive detours? Operational metrics cover cost and latency: tokens per run, turns per run, and wall-clock time.
A clean definition to anchor on: an agent eval is an automated test that runs the agent on a fixed input and scores its output and behavior against predefined success criteria. The word "fixed" is doing real work — without a frozen dataset, you cannot compare runs, and comparison is the entire point.
Building the eval dataset
Start small and real. Twenty to fifty hand-picked cases beat a thousand synthetic ones, because each should encode a specific behavior you care about. Seed the set from three sources: golden-path cases that must always work, edge cases you know are tricky, and — most valuably — real production failures. Make it a rule that every bug you fix gets a corresponding eval case added before the fix merges. Over a few months this dataset becomes your most valuable asset: a precise, growing specification of what "good" means for your agent.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Code / prompt change"] --> B["Run agent on fixed eval set"]
B --> C["Score outcome + trajectory"]
C --> D{"LLM-as-judge for fuzzy cases"}
D --> E{"Pass rate >= threshold?"}
E -->|Yes| F["Allow merge / release"]
E -->|No| G["Block & surface failing cases"]
G --> A
Each case is just structured data: an input, the success criteria, and any required final-state assertions. Keep it in version control next to the code so the dataset evolves with the agent.
{
"id": "refund-wrong-item-001",
"input": "I got the blue mug but ordered the red one, order ORD-10024829",
"expect": {
"final_tool": "refund_order",
"args": { "order_id": "ORD-10024829", "reason": "wrong_item" },
"max_turns": 6,
"must_not_call": ["delete_order"]
}
}
Scoring: deterministic checks first, judge second
Prefer code-based checks wherever the criterion is objective. Did the agent call refund_order with the right ID? Assert it directly. Did it stay under the turn cap and avoid the forbidden tool? Assert those too. Deterministic checks are fast, free, and unambiguous, so use them for everything you can express as a rule.
For criteria that resist hard rules — was the tone appropriate, was the explanation accurate and complete — use a Claude model as a judge. The trick is to make the judge as deterministic as possible: give it a specific rubric, ask for a structured verdict with a short justification, and run it at low temperature. A vague "rate this 1-10" judge is noise; a rubric-driven "does the response satisfy each of these three named criteria, true or false" judge is signal.
You are grading a support agent's reply. Score each criterion true/false:
1. factual: every claim matches the provided order data
2. resolved: the reply states a concrete next step
3. tone: professional, no blame toward the customer
Return JSON: { "factual": bool, "resolved": bool, "tone": bool, "why": "one sentence" }
Trusting the judge: validate it against humans
An LLM judge is itself a model that can be wrong, so validate it before you depend on it. Hand-label a sample of fifty cases, run the judge on the same cases, and measure agreement. If the judge agrees with your human labels most of the time, you can trust it for the rest; if it disagrees often, fix the rubric until it does. Re-validate whenever you change the judge prompt or the judge model. A judge you have never checked against human labels is just a confident guess.
Wiring it into CI and keeping it cheap
An eval loop only changes behavior if it blocks bad releases. Run the suite on every pull request that touches the prompt, the tools, or the model version, and fail the build if the pass rate drops below your threshold. Evals can be token-hungry, so keep them affordable: share one stable system-plus-tools prefix across all cases and cache it, so each case pays full price only for its unique tail. Run independent cases through the batch path for an extra discount when CI latency allows. Report the delta against the previous run so reviewers see exactly which cases regressed.
Common pitfalls
- Only checking final answers. An agent can reach the right answer by a dangerous path. Score the trajectory — tools used, turns taken, forbidden actions avoided — not just the end state.
- Letting the dataset rot. If new failures never become cases, your suite slowly stops reflecting reality. Add a case for every fixed bug, no exceptions.
- An unvalidated judge. Treating an LLM-as-judge score as ground truth without checking it against human labels bakes the judge's blind spots into your gate.
- Evals that never block. A suite that reports but does not fail the build is documentation, not a gate. Wire it to the merge decision.
- Overfitting to the eval set. Tuning prompts until the fixed set passes can hurt generalization. Hold out a slice you never tune against and check it periodically.
Stand up an eval loop in 6 steps
- Pick metrics across outcome, trajectory, and cost — write down what "good" means.
- Assemble 20-50 cases from golden paths, edge cases, and real production failures.
- Write deterministic assertions for everything objective; reserve a rubric-driven judge for fuzzy criteria.
- Validate the judge against human labels and fix the rubric until agreement is high.
- Run the suite in CI with a cached shared prefix, failing the build below your pass-rate threshold.
- Add a new case for every bug and track the pass-rate delta on every release.
| Criterion type | Scoring method | Example |
|---|---|---|
| Objective final state | Code assertion | Correct tool called with correct ID |
| Behavioral constraint | Code assertion | Stayed under turn cap, avoided forbidden tool |
| Subjective quality | Rubric LLM-as-judge | Tone, accuracy, completeness |
| Cost / latency | Usage metrics | Tokens and turns per run |
Frequently asked questions
How many eval cases do I need to start?
Far fewer than you think — 20 to 50 well-chosen cases that each encode a real behavior beat thousands of synthetic ones. Grow the set by adding a case for every production failure, so the suite becomes a precise, lived-in specification of correct behavior over time.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Can I trust an LLM as a judge?
Only after you validate it. Pin the judge with a named-criteria rubric and low temperature, then measure its agreement against a sample of human labels. If agreement is high, use it; if not, refine the rubric and re-check whenever you change the judge prompt or model.
How do I keep evals from getting expensive?
Share one stable system-plus-tools prefix across every case and cache it, so each case only pays full price for its unique input tail. Route independent cases through the batch path when CI can tolerate the latency, and watch cache-read tokens to confirm the savings.
What should fail a release?
A drop in pass rate below your defined threshold, especially on golden-path or trajectory checks. Treat a regression on a previously passing case as a hard stop — the whole point of the loop is that quality can only move forward, never silently backward.
Bringing agentic AI to your phone lines
CallSphere gates its voice and chat agents the same way — fixed eval sets, rubric-driven judges, and CI thresholds — so every release of a call-handling agent is measured before it talks to a customer. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.