Skip to content
Agentic AI
Agentic AI8 min read0 views

Evals for Claude Agents: Gating Finance Narrative Quality

Measure Claude agent quality and gate releases with an eval loop — build a finance dataset, grade prose and tool use, and catch regressions before production.

A finance team can ship a Claude agent that writes a beautiful variance narrative on Monday and a subtly wrong one on Friday, and never know why. You changed a tool description. You bumped a model version. You tweaked the system prompt to fix one complaint. Each change felt safe, and any of them could have quietly degraded the output. The only way to ship agent changes with confidence — the same way you ship code with confidence — is an eval loop: a repeatable way to measure quality and block releases that make it worse.

This post is about building that loop for a finance narrative agent. How to assemble a dataset that reflects the work, how to grade open-ended prose and tool-use behavior, and how to wire evals into a gate so a regression never reaches the controller's desk.

Why "it looks good" isn't a quality bar

The trap with agents is that bad output often reads fluently. A narrative that says "gross margin expanded on favorable mix" sounds authoritative whether or not the underlying decomposition is correct. Eyeballing a few outputs feels like testing but isn't — it doesn't catch the case you didn't look at, and it can't tell you whether last week's fix broke this week's behavior. An eval is a repeatable test that scores an agent's output against defined criteria so quality can be measured, compared across versions, and used to gate releases.

The shift in mindset is from "does this one look right" to "across a representative set of cases, did quality go up, down, or stay flat." That requires three things you have to build deliberately: a dataset of real tasks, graders that turn fuzzy quality into numbers, and a threshold that decides pass or fail. Without all three, you're guessing.

Building the eval dataset

Your dataset is the foundation, and it should come from real work, not invented examples. Mine your history: the quarters you've already closed, the variances you've already explained, the narratives a human approved. Each case is an input (the period, the data context, the goal) paired with what "good" looks like (the drivers that should be identified, the figures that must be cited, the claims that must be sourced). You don't need thousands of cases to start; a few dozen well-chosen ones that span the easy, the typical, and the genuinely tricky will catch most regressions.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Crucially, seed the dataset with your failures. Every time the agent produces a bad narrative in production — it hallucinated a driver, it missed a material variance, it cited the wrong period — that case becomes a permanent eval. This is how the suite gets sharp over time: it accumulates exactly the mistakes your agent is prone to. A common mistake is building a dataset of only happy paths; an eval suite that never includes the hard cases will happily pass a model that fails them.

flowchart TD
  A["Proposed change: prompt/tool/model"] --> B["Run agent over eval dataset"]
  B --> C["Deterministic graders: facts, sourcing"]
  B --> D["LLM-judge graders: clarity, completeness"]
  C --> E{"Score vs threshold"}
  D --> E
  E -->|Pass| F["Promote to production"]
  E -->|Fail| G["Block & review regressions"]
  G --> A

The loop above is the whole game: a change runs against the dataset, two kinds of graders score it, and a threshold decides whether it ships or goes back. The change never reaches production without passing the gate.

Grading prose and behavior

Finance narratives need two grader types because they have two kinds of correctness. Deterministic graders check the things that are objectively true or false: did the narrative cite the exact figures from the query results? Did it identify the three largest variances the data actually contains? Did it avoid mentioning any number that doesn't appear in a tool result? These are code you write — string and numeric checks against ground truth — and they're your strongest defense against hallucinated figures because they can't be fooled by fluent prose.

But "is this narrative clear, complete, and appropriately hedged" isn't a numeric check, and that's where an LLM judge comes in: you ask a separate Claude call to score the output against a rubric — does it explain the why, not just the what; is the tone right for a board; does it flag uncertainty where the data is thin. The key to a reliable judge is a specific rubric with concrete criteria, not a vague "rate this 1–10." Use the deterministic graders to catch factual errors and the judge for qualities only a reader can assess, and weight them so a single fabricated number is an automatic fail regardless of how well it reads.

Evaluating the tool-use trajectory

For agents, the final text isn't the only thing worth grading — the path matters too. Two runs can produce the same narrative, but one queried the right tables efficiently and the other thrashed through fifteen calls and got lucky. Trajectory evals score the process: did the agent call the actuals tool rather than the forecast tool? Did it validate the period before querying? Did it stay within a reasonable number of turns? A narrative that's correct by accident is a regression waiting to happen, and only a trajectory eval catches it.

This is also where you catch the failure modes that text grading misses: an agent that converged on the right answer but looped three times along the way is more expensive and more fragile than the trace shows on the surface. Grading the trajectory turns those silent inefficiencies into visible, scoreable signals you can drive down release over release.

Wiring evals into the release gate

An eval suite that runs only when someone remembers to run it provides little protection. The point is to make it a gate. Every proposed change — a new system prompt, a tweaked tool description, a model upgrade from Sonnet 4.6 to a newer release — runs the full suite automatically, and the change only ships if it clears the threshold and shows no regression on cases that previously passed. Tie this into CI so it happens on every pull request, exactly like a test suite.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Set the threshold from a measured baseline, not a wish. Run your current production agent over the dataset, record the score, and require new versions to match or beat it. When a change fails, the failing cases tell you precisely what broke — "three narratives now omit the largest variance" is an actionable diagnosis, not a vague unease. Over time the gate becomes the thing that lets you move fast: you can try an aggressive prompt rewrite or a model swap because the suite will catch it if you're wrong.

Frequently asked questions

How many eval cases do I need to start?

Fewer than you'd think. A few dozen cases spanning easy, typical, and hard scenarios catch most regressions and give you a meaningful score to track. Start small with real, approved examples, then grow the suite by adding every production failure as a permanent case. Coverage of failure modes matters far more than raw count.

When should I use an LLM judge versus a deterministic check?

Use deterministic graders for anything objectively verifiable — cited figures, identified drivers, sourcing — because they're reliable and can't be charmed by good writing. Use an LLM judge for qualities that need reading judgment, like clarity, completeness, and tone, and always pair it with a specific rubric. The two are complementary, not interchangeable.

Should evals grade the tool-use path or just the final answer?

Both. The final narrative tells you whether the output is right; the trajectory tells you whether it was produced reliably and efficiently. A correct answer reached by a thrashing, fifteen-call path is fragile and expensive, and only a trajectory eval surfaces that. Grade the process and the product.

How do I keep an LLM judge from being inconsistent?

Give it a concrete rubric with explicit criteria and examples of good and bad outputs, rather than a vague scale. Keep the judge's task narrow, and validate it periodically against human-graded cases to confirm its scores track human judgment. A well-specified judge is far more stable than an open-ended "rate this."

Quality gates for every conversation

The same eval discipline that keeps a finance narrative accurate is what keeps a voice agent trustworthy turn after turn. CallSphere applies these agentic-AI patterns to voice and chat — assistants measured against real conversations before they ship, so they answer every call and book work reliably. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.