Building an Eval Loop for Claude Abstraction Agents

You cannot ship a clinical abstraction agent on vibes. The whole point of abstraction is that the output feeds something consequential — a quality registry, a research dataset, a reimbursement calculation — and a 3% silent error rate in the wrong field is a real-world problem, not a rounding artifact. Yet most teams build the agent, eyeball a few charts, and push it to production with no systematic measure of whether it's right. The first time they learn it's wrong is when a downstream auditor tells them. An eval loop is how you avoid that.

An eval loop is the engineering practice of measuring an agent's output against known-correct answers, every time you change the agent, and refusing to ship changes that make the numbers worse. For abstraction, this is unusually tractable because the task has objective answers: a chart either documents an ejection fraction of 40% or it doesn't. That objectivity lets you build sharp, trustworthy metrics — if you set them up right.

Start with a gold set

The foundation is a gold set: a collection of real charts, each with the correct abstracted values established by expert human abstractors. This is your ground truth. Invest in it. A few dozen carefully adjudicated charts beats a thousand sloppily labeled ones, because every disagreement between your agent and a wrong gold label is wasted debugging. Cover the distribution deliberately — common cases, but also the edge cases that break agents: missing fields, conflicting documentation, ambiguous phrasing, and charts where the right answer is "not documented."

Treat the gold set as versioned, governed data. When abstractors update a label because guidelines changed, version it so you can tell whether a metric moved because the agent changed or because the truth did. A gold set that silently mutates is worse than none, because it makes your trend lines lie.

Measure at the field level, not the chart level

A chart-level pass/fail metric hides everything useful. A chart with nine perfect fields and one wrong one is neither a pass nor a clean fail — and rolling it up to a single boolean throws away the signal you need. Score each field independently. For categorical fields like a diagnosis code, measure exact-match accuracy. For "present versus not documented" decisions, build a confusion matrix: the agent saying "not documented" when the chart does document the field is a very different, often more dangerous, error than the reverse.

flowchart TD
  A["Code change"] --> B["Run agent on gold set"]
  B --> C["Score each field vs truth"]
  C --> D{"Metrics >= baseline?"}
  D -->|No| E["Block release; inspect regressions"]
  D -->|Yes| F{"New errors on critical fields?"}
  F -->|Yes| E
  F -->|No| G["Promote & update baseline"]

Weight fields by consequence. Getting the principal diagnosis wrong matters more than getting an optional comment field wrong, and your release gate should reflect that. A useful pattern is a hard floor on critical fields — exact-match accuracy on the principal diagnosis must not drop at all — combined with an aggregate target across the rest. That way a change that quietly degrades the most important field can never slip through on the strength of improvements elsewhere.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Separate trajectory evals from output evals

For an agent, the final answer isn't the only thing worth scoring. Two runs can land on the same correct value while one took two clean tool calls and the other thrashed through nine, looped twice, and got lucky. The second agent is fragile even though its output passed. So run two kinds of eval: output evals that score the abstracted fields against the gold set, and trajectory evals that score how the agent got there — did it stay within its tool budget, did it ground every value in a source span, did it avoid redundant calls. A change can pass output evals while quietly degrading the trajectory, and that degradation is your next production incident waiting to happen.

Trajectory metrics are also early warning. Tool-call count creeping up, source-grounding rate dropping, retries rising — these move before output accuracy does, because the agent compensates for sloppier reasoning right up until it can't. Gating on trajectory health alongside output correctness catches regressions while they're still cheap to fix, and it keeps your agent's behavior predictable as the prompt, tools, and model evolve underneath it.

Use an LLM judge where exact match falls short

Some fields resist exact match. A free-text summary or a clinical rationale can be correct in several phrasings. Here, a second Claude instance acting as a judge — given the chart span, the gold answer, and the agent's answer, and asked whether they're clinically equivalent — gives you a usable score without brittle string matching. Keep the judge's rubric narrow and concrete, and validate the judge itself against human ratings on a sample, because an unchecked judge can drift and quietly bless wrong answers.

Don't over-apply the judge. For anything with a canonical answer — codes, dates, numeric values — exact or normalized matching is cheaper, faster, and more trustworthy than an LLM judge. Reserve the judge for the genuinely open-ended fields. A common mistake is judging everything with an LLM and inheriting its noise across fields that didn't need it.

Wire the evals into CI as a gate

An eval suite you run manually is an eval suite you'll stop running. Put it in your pipeline: every change to the prompt, the tools, the model version, or the harness triggers a run against the gold set, and the pipeline blocks the change if any gated metric regresses. This turns quality from an aspiration into a property the system enforces. It also catches the most insidious failures — the prompt tweak that fixes one chart and breaks five others — before they ship.

Make the gate's output legible. When a run fails, surface exactly which charts and which fields regressed, with the agent's answer, the gold answer, and the source span side by side. The faster a developer can see what broke, the faster they fix it, and the more the team trusts the gate instead of routing around it. A red gate nobody can interpret gets disabled within a week.

Close the loop with production feedback

Your gold set should grow from production. When human reviewers correct the agent's output downstream, capture those corrections as candidate gold labels. The charts that get corrected are, by definition, the ones your agent finds hard — exactly what your eval set is undersampling. Feeding adjudicated production errors back into the gold set makes each release cycle test against the agent's real weaknesses, not just the cases you imagined when you started.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Track regressions, not just averages

An aggregate accuracy number can stay flat while the agent quietly trades correctness on one slice of charts for gains on another. The dangerous changes are the ones that improve the common case and break a rare-but-critical one — say, charts where the field is genuinely not documented. Diff every eval run against the previous one at the per-chart, per-field level, and flag any case that flipped from correct to wrong even if the overall number rose. New regressions on critical fields should block a release outright, regardless of how good the average looks.

This is why the gate should compare against a stored baseline rather than an absolute threshold. "Accuracy above 95%" tells you nothing about whether this change made things worse than last week. "No critical-field regression versus the current baseline" does. Keep the baseline versioned alongside the gold set, and only advance it when a change clears the gate cleanly. That way your bar ratchets upward over time and never silently slips, which is exactly the property you want for something feeding regulated downstream systems.

Frequently asked questions

How big does the gold set need to be?

Big enough to cover your field distribution and edge cases, not big enough to be statistically perfect. Many teams start with a few dozen well-adjudicated charts and grow from production corrections. Quality and coverage of labels matter far more than raw count.

Can I trust an LLM judge for clinical correctness?

Only after you've validated it against human ratings and only for fields where exact match doesn't apply. Give it a tight rubric, check its agreement with experts on a sample, and re-check periodically. For coded and numeric fields, skip the judge and use deterministic matching.

What metric should gate the release?

Use per-field metrics with consequence weighting: a hard no-regression floor on critical fields like principal diagnosis, plus an aggregate target across the rest. A single chart-level pass rate hides the errors that matter most.

How do I keep evals from going stale?

Version the gold set, feed in adjudicated production corrections, and run the suite automatically on every change. An eval set that only contains the cases you thought of at the start slowly stops reflecting the charts your agent actually struggles with.

Evals for live conversations

CallSphere holds its voice and chat agents to the same bar — scored against real transcripts, gated before release, and improved from production feedback so quality goes up over time. See the approach at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Building an Eval Loop for Claude Abstraction Agents

Start with a gold set

Measure at the field level, not the chart level

Separate trajectory evals from output evals

Use an LLM judge where exact match falls short

Wire the evals into CI as a gate

Close the loop with production feedback

Track regressions, not just averages

Frequently asked questions

How big does the gold set need to be?

Can I trust an LLM judge for clinical correctness?

What metric should gate the release?

How do I keep evals from going stale?

Evals for live conversations

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild