Testing & evals for Claude agents: gating releases

You can hand-test a Claude agent ten times, watch it succeed, ship it, and still be surprised in production — because ten runs of a non-deterministic system tell you almost nothing about the eleven thousandth. The teams that ship agents confidently are not the ones who test harder by hand; they are the ones who built an eval loop that turns "it seemed to work" into a number they can defend. Without that, every prompt tweak is a gamble and every release is a leap of faith.

This post lays out how to measure the quality of an agent that reaches production systems through MCP, and how to wire those measurements into a gate that blocks bad releases automatically.

Why manual testing fails for agents

Agents are stochastic and path-dependent. The same task can be solved three different ways, and a change that helps one path can quietly break another. Manual spot-checks sample a tiny, biased slice of behavior, and they are not repeatable — you cannot rerun yesterday's hand-test against today's prompt to see what moved. Worse, the failures that matter most are often rare: the agent handles 95% of cases beautifully and mangles a specific 5% that happens to include your highest-value transactions. Manual testing almost never finds the 5%.

An eval, by contrast, is a fixed, repeatable measurement of agent quality against a curated set of cases with known expectations. An eval is an automated test that runs an agent against representative tasks and scores its outputs against defined success criteria, producing a comparable quality metric across versions. Once you have that number, you can answer the only question that matters at release time: did this change make the agent better or worse?

Building the eval dataset

Your eval is only as good as its cases, so curate them deliberately. Start from reality: pull real transcripts and tasks from logs, especially the ones that failed or did something surprising. Every production incident should become a permanent eval case so that bug can never silently return — this is how your suite compounds in value over time. Cover the happy paths, the tricky edge cases, and the adversarial inputs you care about, including attempted prompt injections if security is in scope.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["New agent / prompt version"] --> B["Run eval suite (N cases)"]
  B --> C{"Deterministic checks pass?"}
  C -->|No| D["Fail: tool call / arg mismatch"]
  C -->|Yes| E["LLM judge scores quality"]
  E --> F{"Score >= threshold & no regressions?"}
  F -->|No| G["Block release, surface diffs"]
  F -->|Yes| H["Promote to production"]
  D --> G
  G --> I["Fix & add failing case to suite"]

For each case, define what success means concretely. Sometimes that is a deterministic assertion — the agent must call create_ticket with a priority field, or must never call delete_account. These tool-trajectory checks are cheap, fast, and unambiguous, and they catch a large class of regressions. Wherever you can express correctness as a hard assertion on which tools were called with which arguments, do so; it is far more reliable than judging prose.

Judging the open-ended parts

Much of an agent's quality is not reducible to an assertion — was the final answer accurate, complete, and appropriately phrased? For that, use an LLM-as-judge: a separate Claude call that scores the agent's output against a rubric you write. The rubric is everything. Vague criteria like "is the answer good" produce noisy scores; specific criteria like "does the response state the order status, give an accurate delivery date, and avoid promising anything not in the data" produce signal. Have the judge output a structured verdict with a short justification so you can audit why it scored as it did.

Be aware of the judge's limits. LLM judges can be inconsistent and can be biased toward verbose or confident answers, so calibrate them: periodically have a human review a sample of judged cases and confirm the judge agrees with human judgment. Where stakes are high, combine deterministic checks (which never drift) with the judge (which handles nuance). The deterministic layer catches the unambiguous failures; the judge catches the subtle quality regressions that assertions cannot express.

Turning evals into a release gate

An eval you run occasionally is a curiosity; an eval that runs automatically on every change is a safety system. Wire the suite into CI so that any change to the prompt, tools, or model triggers the full run. Establish a quality threshold and a no-regression rule: a release passes only if the aggregate score clears the bar and no previously-passing case has started failing. That second condition is what stops the classic trap where a change lifts the average while quietly breaking a critical scenario.

Make the output diff-friendly. When a case fails, the gate should show exactly what changed — which tool call differed, where the answer diverged from expectation, what the judge flagged — so a developer can triage in seconds rather than rerunning the agent by hand. Treat eval results like test results: red means the release does not ship, and the fix includes adding the newly-discovered failure as a permanent case. Over months this discipline produces a suite that encodes everything your agent has ever gotten wrong, which is the most valuable regression asset you can own.

Monitoring beyond the gate

Pre-release evals catch known cases; production throws novel ones. Close the loop by sampling live runs and scoring them with the same judge and checks you use offline, so quality drift shows up as a falling online score before it shows up as complaints. Watch operational signals too — rising loop-guard trips, climbing validation-rejection rates, growing latency — because these often precede a visible quality drop. The strongest setups feed production surprises straight back into the offline suite, so every real-world failure becomes a permanent guard against its own recurrence.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Frequently asked questions

How many eval cases do I need to start?

Begin small and grow deliberately. A few dozen well-chosen cases covering happy paths, key edge cases, and known past failures already gates most regressions. The suite's value comes from curation and from adding every production incident as a permanent case, not from raw count.

Should I use deterministic checks or an LLM judge?

Both. Use deterministic assertions on tool trajectories and arguments wherever correctness is unambiguous — they are fast and never drift. Use an LLM judge with a specific rubric for open-ended quality the assertions cannot express, and calibrate the judge against human review periodically.

What does it mean to gate a release on evals?

It means a change ships only if the eval suite passes a defined threshold and no previously-passing case regresses. The suite runs automatically in CI on every prompt, tool, or model change, so quality is verified before deployment rather than discovered in production.

How do I keep evals from going stale?

Feed production back in. Sample live runs, score them with the same checks, and convert every real-world failure and surprise into a new offline case. This keeps the suite aligned with how the agent is actually used and makes it compound in value over time.

Bringing agentic AI to your phone lines

CallSphere runs the same eval discipline — curated cases, LLM judges, and release gates — behind voice and chat agents that handle every call and message, use tools mid-conversation, and book work 24/7 with measured, defensible quality. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Testing & evals for Claude agents: gating releases

Why manual testing fails for agents

Building the eval dataset

Judging the open-ended parts

Turning evals into a release gate

Monitoring beyond the gate

Frequently asked questions

How many eval cases do I need to start?

Should I use deterministic checks or an LLM judge?

What does it mean to gate a release on evals?

How do I keep evals from going stale?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild