Testing and evals for Claude Cowork agents that ship

Here is an uncomfortable truth about agentic AI: most teams ship changes to their Claude Cowork agents the way they would change a config file — tweak a prompt, eyeball a couple of examples, and deploy. Then a week later someone notices the agent has quietly gotten worse at a task it used to handle perfectly, and nobody can say which change broke it. The cause is the absence of evals. Without a measurement loop, you are flying blind, and the model's non-determinism guarantees you will eventually crash.

An eval is a repeatable test that scores an agent's behavior on a fixed set of tasks so you can compare versions objectively. Evals are to agents what a test suite is to code: the thing that lets you change with confidence instead of superstition. This post lays out how to build an eval loop that actually gates releases rather than producing a dashboard nobody trusts.

Start from tasks, not prompts

The most common mistake is evaluating the wrong thing. You do not care whether the agent produced a particular sentence; you care whether it accomplished the task. So your eval cases should be defined as tasks with success criteria: "given this inbox, the agent should draft a reply that references the correct order number and offers a refund only if policy allows it." Each case pairs an input scenario with a check that captures what success means for that scenario.

Build this dataset from reality. The richest source is your own production transcripts — especially the failures. Every time the agent does something wrong, distill it into an eval case with the correct expected behavior. Over a few weeks this turns your bug reports into a regression suite that grows more valuable than any synthetic benchmark, because it encodes the exact ways your specific agent and tools tend to break.

How to grade an agent's output

Grading is where eval design gets interesting, because agent outputs are open-ended. You generally combine three grader types. Deterministic checks verify hard facts: did the agent call the refund tool, did it pass the right order ID, did it avoid touching the deletion connector. These are cheap, fast, and unambiguous, so use them for anything you can express as a rule.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Code change to agent"] --> B["Run eval suite"]
  B --> C["Deterministic checks"]
  B --> D["LLM-judge on quality"]
  B --> E["Trajectory checks: right tools"]
  C --> F{"Score >= threshold?"}
  D --> F
  E --> F
  F -->|Yes| G["Allow merge / deploy"]
  F -->|No| H["Block & show failing cases"]

For the open-ended parts — tone, completeness, whether a summary is actually faithful — use an LLM as a judge: a separate model call that scores the output against a rubric you write. The judge is powerful but must itself be validated; spot-check its scores against human judgment on a sample, and keep the rubric concrete so it grades the same way every run. Finally, add trajectory checks that score how the agent got there, not just the final answer, since an agent that reached the right output via three wrong tool calls is fragile even when the result looks fine.

Gating releases in CI

An eval suite that runs only when someone remembers to run it is theater. The value comes from wiring it into the release path so a regression physically blocks the change. The pattern mirrors unit tests: on every change to a prompt, a tool description, a Skill, or a model version, the eval suite runs and the deploy is blocked unless the score clears a threshold. The output should name the specific cases that failed, not just a number, so the author can see exactly what broke.

Set the threshold honestly. Because agents are stochastic, a single run can fluctuate, so either run each case a few times and average, or set the bar with a margin that tolerates normal variance. The goal is to catch real regressions without flapping on noise. When a change improves the average but breaks two previously-passing cases, that is a signal to investigate, not to wave it through — a net-positive aggregate can hide a serious narrow regression.

Metrics that matter beyond pass rate

Task success rate is the headline, but a mature eval setup tracks more. Cost and token usage per task catch the case where a change improves accuracy by quietly tripling the bill. Latency catches the change that makes the agent thorough but unusably slow. Tool-call efficiency — how many turns and how many tool calls a task took — is an early warning for the loop and over-calling failure modes that inflate cost. Watching these together stops you from optimizing one dimension into a different problem.

It helps to separate capability evals from safety evals. Capability evals ask whether the agent does the job well; safety evals ask whether it refuses to do dangerous things, resists injected instructions, and stays within its permissions. Both should gate releases, but you tune them differently — a safety regression is a hard block even if capability improved, because shipping an agent that does the job slightly better but can be coaxed into an unauthorized action is a net loss no aggregate score should be allowed to disguise.

Keeping the eval suite honest over time

An eval suite is not a build-once artifact; it decays if you let it. As the agent's tools and prompts evolve, old cases can become stale — testing a connector that no longer exists, or encoding an "expected" behavior you have since deliberately changed. Schedule a periodic review where you prune dead cases, update expectations that intentionally moved, and confirm the suite still reflects what you actually want. A suite full of obsolete cases is worse than none, because it produces failing scores nobody believes, and an eval everyone learns to ignore is no gate at all.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Guard against a subtler form of rot: overfitting to the suite. If engineers start tuning prompts specifically to pass known eval cases, the score climbs while real-world quality stalls. Counter this by keeping a held-out set of cases that never inform day-to-day tuning and by continuously feeding fresh failures from production into the suite. The eval should always contain surprises the team has not optimized against, or it stops measuring generalization and starts measuring memorization.

Frequently asked questions

What exactly is an eval for an agent?

An eval is a repeatable test that scores an agent on a fixed set of tasks with defined success criteria, so you can compare versions objectively. It plays the role a unit test suite plays for code, letting you change prompts, tools, or models without guessing whether quality moved.

Should I use an LLM to grade outputs?

Yes, for open-ended qualities like tone, faithfulness, and completeness that rules cannot capture — but validate the judge against human scores on a sample and keep its rubric concrete. Pair it with deterministic checks for anything expressible as a hard rule, since those are cheaper and unambiguous.

How do I stop a bad change from shipping?

Wire the eval suite into CI so a change to any prompt, tool, Skill, or model triggers a run and blocks the deploy if the score falls below threshold. Surface the specific failing cases, and set the threshold with margin so stochastic noise does not cause false blocks.

Where do good eval cases come from?

Your production transcripts, especially the failures. Every real mistake becomes an eval case with the correct expected behavior, building a regression suite that captures exactly how your agent and tools break — far more valuable than generic synthetic benchmarks.

Bringing agentic AI to your phone lines

A disciplined eval loop is exactly how CallSphere keeps its agentic voice and chat assistants improving without regressing — every change gated against real conversation transcripts before it reaches a live caller. See it live at callsphere.ai.

Testing and evals for Claude Cowork agents that ship

Start from tasks, not prompts

How to grade an agent's output

Gating releases in CI

Metrics that matter beyond pass rate

Keeping the eval suite honest over time

Frequently asked questions

What exactly is an eval for an agent?

Should I use an LLM to grade outputs?

How do I stop a bad change from shipping?

Where do good eval cases come from?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Migrating a workflow to Claude Cowork agents safely

Security hardening for Claude Cowork agentic AI systems

Cutting Claude Cowork token costs: caching and batching

Debugging Claude Cowork agents: loops and bad tool calls

Prompt and context design for Claude Cowork agents

Wiring MCP servers into Claude Cowork: the full guide