Testing and Evals for Grounded Claude: Gate Releases
Build an eval loop for grounded Claude: faithfulness and citation metrics, a hand-labeled set, an LLM judge, and a CI gate that blocks regressions.
You cannot eyeball your way to a trustworthy grounded agent. The whole point of citations is that a human should not have to re-read the source corpus to believe the answer — but if you only spot-check a handful of outputs by hand, you are doing exactly that, slowly, and you will still ship regressions. The teams that keep grounded Claude systems reliable treat quality the way they treat code: a measurable property, defended by an eval suite that runs in CI and blocks bad releases automatically.
This post is about building that eval loop for citation-grounded answers. We will define the metrics that actually matter for grounding — faithfulness and citation correctness, not just "did it sound right" — build a small but honest eval set, use Claude as a judge without fooling yourself, and wire the whole thing into a release gate so a prompt change that quietly breaks citations cannot reach production.
Key takeaways
- Measure faithfulness and citation correctness separately — an answer can be true yet wrongly cited, or well-cited yet unsupported.
- A small, hand-labeled eval set beats a large noisy one: 50–100 honest cases catch most regressions.
- Deterministic checks first, LLM judge second — verify spans with code, judge nuance with Claude.
- Gate releases on thresholds so a prompt or model change that lowers grounding quality fails CI.
- Grow the eval set from production failures — every real bug becomes a permanent regression test.
The metrics that matter for grounded answers
Generic "answer quality" hides the failures that grounding is supposed to prevent. Break it into specific, measurable dimensions. Faithfulness: is every claim in the answer supported by the retrieved context, with no invented facts? Citation correctness: does each citation actually point to a span that supports the sentence it is attached to? Coverage: does every factual sentence carry a citation, or are there ungrounded assertions? Retrieval quality: did the right documents get retrieved in the first place, measured independently of the answer?
Keeping these separate is what makes debugging tractable. A drop in faithfulness with stable retrieval points at the generation prompt. A drop in retrieval quality points at the index or chunking. If you only track one blended score, a regression in one dimension can be masked by an improvement in another, and you will ship it. Each metric should map to a clear owner and a clear fix.
Build the eval set before the eval harness
An eval is only as good as its cases. Start by collecting real questions — from logs, from your team, from anticipated user intents — and hand-label the correct answer and the correct supporting chunk(s) for each. This is tedious and irreplaceable; a hand-labeled set of fifty to a hundred cases will catch the overwhelming majority of regressions, far more reliably than thousands of auto-generated ones. Include the hard cases on purpose: questions the corpus cannot answer (the correct response is "not found"), questions with conflicting sources, and questions whose answer spans multiple chunks.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
{
"id": "eval_017",
"question": "Does the refund policy cover digital goods?",
"gold_answer": "No. Digital goods are explicitly excluded from refunds.",
"supporting_chunks": ["kb_refunds_3"],
"must_cite": true,
"expected_behavior": "answer_with_citation",
"notes": "Tempting distractor in kb_refunds_2 about physical returns"
}
Note the expected_behavior field. Some cases should ground-and-answer; others should refuse or say "not found." An eval set that only contains answerable questions will happily reward a model that never learns to concede — and conceding correctly is half the job of a grounded agent.
flowchart TD
A["Commit / prompt change"] --> B["Run agent over eval set"]
B --> C["Deterministic checks: span & coverage"]
C --> D["Claude-as-judge: faithfulness"]
D --> E["Aggregate scores"]
E --> F{"Above thresholds?"}
F -->|Yes| G["Allow release"]
F -->|No| H["Fail CI & show diffs"]
H --> I["Add new failures to eval set"]
Deterministic checks before the LLM judge
Run the cheap, objective checks first. Citation correctness has a large deterministic component: you can verify in code that each cited chunk was retrieved and that each quoted span is a literal substring of its chunk. Coverage is countable — what fraction of factual sentences carry a citation. These checks are fast, free, and unarguable, and they catch the most common and most dangerous failures before you spend a single judge token.
Use Claude as a judge for what code cannot decide: whether a span genuinely supports a claim, whether two answers are semantically equivalent, whether a refusal was appropriate. Give the judge the question, the answer, and the cited spans, and ask for a structured verdict with a reason. Critically, validate your judge against your hand labels before trusting it — if the judge disagrees with humans on your gold set, fix the judge prompt before using it to grade anything else.
Wire it into a release gate
An eval suite that runs only when someone remembers is not a gate. Put it in CI on every change to prompts, tools, retrieval config, or model version, and set explicit thresholds: minimum faithfulness, minimum citation correctness, zero tolerance for citing a chunk that was never retrieved. A pull request that drops any metric below its floor fails, the same way a failing unit test fails. This is what turns "we think it got better" into "the numbers say it got better, and nothing regressed."
Run the suite through the Batches API to keep it cheap, since eval runs are latency-tolerant by nature. And close the loop: every production incident — a wrong citation a user reported, a loop you caught in logs — becomes a new labeled case in the eval set. Over time the suite becomes a precise map of every way your system has ever failed, and a wall against repeating any of them.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Common pitfalls
- Blended single scores. One "quality" number hides which dimension regressed. Track faithfulness, citation correctness, coverage, and retrieval separately.
- Only answerable questions in the eval set. Without "should refuse" cases, you reward confident guessing. Include unanswerable and conflicting cases.
- Trusting the LLM judge blind. An unvalidated judge can be systematically wrong. Calibrate it against human labels first.
- Evals that never run automatically. Manual evals get skipped under deadline. Gate releases in CI or the discipline evaporates.
- A frozen eval set. If it never grows from real failures, it slowly stops reflecting production. Add every incident as a case.
Stand up a grounding eval gate in 6 steps
- Define faithfulness, citation correctness, coverage, and retrieval quality as separate metrics.
- Hand-label 50–100 real cases, including unanswerable and conflicting ones.
- Implement deterministic checks for span validity and citation coverage.
- Add a Claude judge for faithfulness and calibrate it against your human labels.
- Set thresholds and fail CI when any metric drops below its floor.
- Feed every production failure back in as a new permanent eval case.
Deterministic vs. judge-based checks
| Check | Method | Catches |
|---|---|---|
| Chunk was retrieved | Deterministic | Hallucinated citations |
| Quote is verbatim | Deterministic | Fabricated spans |
| Coverage of factual sentences | Deterministic | Ungrounded claims |
| Span supports the claim | Claude-as-judge | Subtle unsupported answers |
Frequently asked questions
What metrics should I track for grounded Claude answers?
Track faithfulness (every claim supported by context), citation correctness (each citation points to a span that supports its sentence), coverage (every factual sentence is cited), and retrieval quality (the right documents were fetched). Keeping them separate lets you tell whether a regression came from generation, citation, or retrieval, each of which has a different fix.
How big does my eval set need to be?
Smaller and honest beats larger and noisy. A hand-labeled set of fifty to a hundred real cases — including questions the corpus cannot answer and questions with conflicting sources — catches most regressions. Grow it over time by adding every production failure as a new permanent case rather than chasing raw volume.
Can I trust Claude to grade its own outputs?
Use Claude as a judge for nuanced calls like whether a span truly supports a claim, but calibrate it first. Run the judge against your human-labeled gold set; if it disagrees with people, fix the judge prompt before relying on it. Always run cheap deterministic checks for spans and coverage before invoking the judge.
How do evals gate a release?
Run the eval suite in CI on every change to prompts, tools, retrieval, or model version, with explicit thresholds for each metric. A change that drops any metric below its floor fails the build, exactly like a failing unit test, so a regression in grounding quality cannot reach production silently.
Measured quality on every conversation
CallSphere holds its voice and chat agents to the same eval discipline — faithfulness and citation checks gating every release so the answers customers hear stay grounded and accurate. See the results live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.