Testing and Evals for AI Agents: Gate Every Release
Build an eval loop for Claude agents — deterministic checks, LLM-as-judge, regression suites — that measures quality and gates releases in CI.
Every team that ships agents eventually hits the same wall: a prompt tweak that fixes one case quietly breaks three others, and nobody notices until a customer does. Manual spot-checking does not scale, and "it looked good in the demo" is not a release criterion. The discipline that separates teams who ship agents confidently from those who ship and pray is evals — a repeatable, automated measurement of agent quality that you run on every change and gate releases against. This post is about building that loop for Claude agents in a way that catches regressions before users do.
To define it plainly: an eval is an automated test that runs an agent against a fixed set of representative inputs and scores its outputs against quality criteria, so you can measure whether a change made the agent better or worse. Unlike unit tests, the answers are often open-ended, so scoring blends deterministic checks with model-based judgment. Done well, evals turn "I think this is better" into a number you can defend.
Key takeaways
- Build a dataset of real, representative tasks — including the weird edge cases and past failures — and grow it every time something breaks.
- Use deterministic checks where you can (exact match, schema validity, tool-call correctness) and LLM-as-judge only where output is genuinely open-ended.
- Score the whole trajectory, not just the final answer — wrong tools, wasted steps, and loops are failures even when the answer is right.
- Set a quality bar and gate releases on it in CI; a change that drops the score doesn't merge.
- Every production bug becomes a permanent eval case so the same regression can never ship twice.
- Keep evals fast and cheap enough to run on every PR — batch them and cache stable prompts.
What to measure
An agent's quality is more than its final text. Three dimensions matter: task success (did it achieve the goal?), process quality (did it use the right tools efficiently without looping?), and safety (did it avoid forbidden actions and stay in scope?). A run that produces the correct answer after eight wasted tool calls and one near-miss on a destructive action is not a pass — it is a latent incident. So your eval suite should inspect the trajectory: which tools were called, with what arguments, in what order, and how many steps it took. Capture all of that during the run and assert on it, the same way you would assert on a return value.
How the eval loop gates a release
The eval loop is a gate, not a report you read after shipping. The flow below shows where it sits in your release pipeline.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Change to prompt / tools / model"] --> B["Run agent on eval dataset"]
B --> C{"Deterministic checks pass?"}
C -->|No| F["Fail: block merge"]
C -->|Yes| D["LLM-as-judge scores open outputs"]
D --> E{"Score >= quality bar?"}
E -->|No| F
E -->|Yes| G{"Any safety violation?"}
G -->|Yes| F
G -->|No| H["Pass: allow release"]
Deterministic checks first, judges second
Reach for the cheapest, most reliable scoring you can. Many agent behaviors are checkable deterministically: did the output parse as valid JSON, did it call refund_order exactly once, did it include the order ID, did it avoid calling any tool on the deny-list? These are fast, free, and unambiguous. Only when the output is genuinely open — a summary, an explanation, a customer reply — do you bring in LLM-as-judge, where a separate model scores the output against a rubric. Use Claude as the judge with a tight rubric and concrete pass/fail criteria; vague rubrics produce noisy scores. Here is a compact eval harness pattern:
cases = load_eval_dataset("agent_evals.jsonl")
results = []
for case in cases:
run = run_agent(case["input"]) # returns final text + tool trajectory
checks = {
"valid_json": is_valid_json(run.final),
"called_expected_tool": case["expect_tool"] in run.tools_used,
"no_forbidden_tool": not (set(run.tools_used) & FORBIDDEN),
"step_budget_ok": run.steps <= case.get("max_steps", 10),
}
if case.get("rubric"): # open-ended -> LLM judge
checks["quality"] = judge_with_claude(run.final, case["rubric"]) >= 4
results.append({"id": case["id"], "passed": all(checks.values()), **checks})
score = sum(r["passed"] for r in results) / len(results)
assert score >= 0.95, f"Eval score {score:.2%} below release bar" # gate
The final assert is the gate: run this in CI on every pull request, and a change that pushes the pass rate below your bar fails the build before it can merge.
Common pitfalls
- Grading only the final answer. An agent can reach the right answer the wrong way. Assert on the tool trajectory and step count, not just the text.
- Vague judge rubrics. "Is this good?" produces noisy, irreproducible scores. Give the judge specific, checkable criteria and a numeric scale.
- A frozen dataset. If your eval set never grows, it stops representing reality. Add every production failure as a new case.
- Slow, expensive evals. If the suite takes an hour, nobody runs it on every change. Batch the runs and cache stable prompts to keep it fast.
- Testing the judge with the same model under test, unchecked. Periodically spot-check judge scores against human labels so the judge itself doesn't drift.
Stand up an eval loop in 6 steps
- Collect 30–100 real, representative tasks, weighting toward edge cases and known failures.
- Write deterministic checks for everything verifiable: schema validity, expected tools, forbidden tools, step budget.
- Add LLM-as-judge with a concrete rubric only for genuinely open-ended outputs.
- Pick a release bar (for example, 95% pass) and assert on it.
- Wire the suite into CI so it runs on every pull request and blocks merges that drop below the bar.
- On every production bug, add a regression case and re-run; never let the same failure ship twice.
Scoring method comparison
| Method | Best for | Pros | Cons |
|---|---|---|---|
| Exact / structural match | Structured outputs, tool calls | Fast, free, unambiguous | Only works for closed answers |
| Schema validation | JSON / typed outputs | Catches malformed data instantly | Doesn't judge content quality |
| LLM-as-judge | Summaries, replies, explanations | Handles open-ended quality | Costs tokens; needs a tight rubric |
| Human review | Calibration & sampling | Ground truth | Slow; can't gate every PR |
Frequently asked questions
How many eval cases do I need to start?
Begin with 30–100 representative tasks covering your common paths plus known edge cases. A small, sharp set you run on every change beats a huge set you run never. Grow it as failures surface.
When should I use LLM-as-judge versus exact match?
Use exact or structural match whenever the correct output is closed — a tool call, a schema, an ID. Reserve LLM-as-judge for open-ended text, and give it a concrete rubric with a numeric scale to keep scores stable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How do evals gate a release?
Run the suite in CI on each change. If the pass rate drops below your bar or any safety check fails, the build fails and the change can't merge. That turns quality into an enforced contract rather than a hope.
How do I keep evals from getting expensive?
Run them through the Message Batches API, cache the stable parts of your prompts, and keep deterministic checks doing most of the work so the LLM judge only runs where it's truly needed.
Bringing agentic AI to your phone lines
CallSphere gates its voice and chat agents the same way — trajectory-level evals and regression suites that must pass before any change reaches a live call — so quality is measured, not assumed. See the result at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.