Evals for Claude Code: gating releases with a test loop

You can't ship a Claude Code workflow on vibes. It worked in the three cases you tried by hand, so you merged it — and two weeks later it's quietly mishandling an edge case nobody noticed, because there was never a measurement that would have caught the regression. Agentic systems are non-deterministic, which means the only honest way to know whether a change made the workflow better or worse is to measure it across a representative set of tasks, every time. An eval loop is the engineering practice that turns "it seems fine" into "it passes the bar we defined," and it's the difference between a demo and a system you can trust to run unattended.

This post is about building that loop: what to score, how to grade non-deterministic output, how to assemble a task set that catches real regressions, and how to wire the whole thing into a release gate so quality is enforced rather than hoped for.

What an eval actually measures

An eval for an agentic workflow is a defined set of input tasks, each paired with a way to judge whether the agent's behavior was acceptable. The crucial word is behavior, not just output. For a chatbot you might score only the final text, but for a workflow that takes actions, you often care about the path: did it call the right tools, in a reasonable order, without taking a destructive action it shouldn't have? An agentic eval is a repeatable measurement of whether a workflow produces correct outcomes and safe behavior across a representative set of tasks.

The temptation is to score everything, which produces a number nobody can interpret. Instead, decide what actually matters for your workflow and score that. For a data-migration agent, correctness of the migrated data and absence of destructive operations matter most. For a research agent, factual accuracy and completeness matter; tool order doesn't. Pick the two or three dimensions that define success for this workflow and resist the urge to measure everything else.

A useful eval also distinguishes outright failures from quality gradients. A run that deleted the wrong table is a hard failure that should block release outright. A run that produced a correct but slightly verbose summary is a soft quality signal you track over time. Conflating the two — treating a safety violation and a style nitpick as the same kind of "score" — hides the failures that matter under the noise of the ones that don't.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Grading non-deterministic output

The hardest part of evaluating an agentic workflow is that there's no single correct string to diff against. The same task can produce different valid outputs on different runs. You need graders that judge correctness without demanding an exact match. There are three workhorse approaches, and good eval suites use all three.

The first is programmatic checks: assertions you can write in code. Did the migration produce the right row count? Does the output JSON validate against the schema? Was a forbidden command never called? These are cheap, fast, deterministic, and should cover everything that can be expressed as a rule. The second is reference-based scoring for tasks that do have an expected answer or a set of required facts — checking that the agent's output contains the key points, even if phrased differently. The third is an LLM-as-judge: using a model to grade open-ended quality against a rubric, for dimensions like helpfulness or clarity that resist programmatic checks.

flowchart TD
  A["Code change to workflow"] --> B["Run eval suite on golden tasks"]
  B --> C["Programmatic checks"]
  B --> D["Reference / fact match"]
  B --> E["LLM-as-judge rubric"]
  C --> F{"Hard failure?"}
  D --> F
  E --> G{"Quality below threshold?"}
  F -->|Yes| H["Block release"]
  G -->|Yes| H
  F -->|No| I{"Score >= bar?"}
  G -->|No| I
  I -->|Yes| J["Promote release"]
  I -->|No| H

The diagram shows how the three grader types feed one gate. Hard failures block immediately; quality scores must clear a threshold; only a run that passes both is promoted. LLM-as-judge is powerful but should be validated against human judgment on a sample, because a judge with a sloppy rubric just launders subjectivity into a number.

Building a task set that catches regressions

An eval is only as good as its tasks. A suite of three happy-path cases will pass forever while real failures slip through. The goal is a set of golden tasks that represents the distribution of work the workflow actually faces — including the edge cases, the ambiguous inputs, and the adversarial cases that have bitten you before.

The most valuable source of eval tasks is production failures. Every time the workflow does something wrong in the real world, capture that case, define the correct behavior, and add it to the suite. This is regression testing for agents: the bug you fixed today becomes the test that prevents it from coming back next month. Over time, a suite grown from real incidents becomes a sharp, opinionated definition of what your workflow must get right.

Size matters less than coverage early on, but it does need to be big enough that one lucky or unlucky run doesn't swing the verdict. Because runs are non-deterministic, consider running each task a few times and looking at pass rates rather than single outcomes — a task that passes four times out of five is telling you something different from one that passes once out of five, and a single run would hide that.

Wiring evals into the release gate

An eval suite that you run manually when you remember to is barely better than none, because the moment you're in a hurry — exactly when regressions slip in — you'll skip it. The discipline that makes evals matter is automation: the suite runs on every meaningful change, and a result below the bar blocks the release. This is continuous integration applied to agent quality, and it's what lets a team move fast without silently degrading the system.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Set explicit thresholds and treat them as a contract. A hard-failure rate above zero on safety-critical checks blocks unconditionally. A quality score must clear a defined bar. When a change improves the score, you can ratchet the bar up; when you intentionally trade some quality for speed or cost, you adjust it deliberately and visibly rather than letting it drift. The point is that quality becomes a number the whole team can see and defend, not a feeling that erodes one rushed merge at a time.

Frequently asked questions

What should I score in an agentic eval — output or behavior?

Both, weighted by what matters for the workflow. For agents that take actions, behavior often matters as much as output: did it call the right tools and avoid destructive ones? Score the two or three dimensions that define success for your specific workflow rather than trying to measure everything.

How do I grade output when every run is different?

Combine three graders: programmatic checks for anything expressible as a rule, reference or fact matching for tasks with expected content, and an LLM-as-judge for open-ended quality. Validate the judge against human ratings on a sample so it measures real quality rather than laundering subjectivity into a score.

Where do good eval tasks come from?

Mostly from production failures. Each time the workflow does something wrong in the real world, capture the case, define the correct behavior, and add it to the suite. A task set grown from real incidents becomes a sharp regression guard that prevents old bugs from returning.

How do evals gate a release in practice?

Run the suite automatically on every meaningful change and block any result that fails a safety check or falls below the quality threshold. Treating those thresholds as a contract turns quality into a visible, defensible number instead of a feeling that erodes one rushed merge at a time.

Bringing agentic AI to your phone lines

Eval-gated releases are how CallSphere keeps its voice and chat agents dependable as they evolve — assistants that answer every call and message, use tools mid-conversation, and book work around the clock, with every change measured before it ships. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Evals for Claude Code: gating releases with a test loop

What an eval actually measures

Grading non-deterministic output

Building a task set that catches regressions

Wiring evals into the release gate

Frequently asked questions

What should I score in an agentic eval — output or behavior?

How do I grade output when every run is different?

Where do good eval tasks come from?

How do evals gate a release in practice?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild