---
title: "Evals for Contextual Retrieval RAG on Claude"
description: "Build an eval loop for contextual-retrieval Claude agents: recall and precision at k, a calibrated LLM judge, and CI gates that block quality regressions."
canonical: https://callsphere.ai/blog/evals-for-contextual-retrieval-rag-on-claude
category: "Agentic AI"
tags: ["agentic ai", "claude", "evals", "rag", "contextual retrieval", "llm-as-judge", "testing"]
author: "CallSphere Team"
published: 2026-01-30T12:09:33.000Z
updated: 2026-06-07T01:28:23.683Z
---

# Evals for Contextual Retrieval RAG on Claude

> Build an eval loop for contextual-retrieval Claude agents: recall and precision at k, a calibrated LLM judge, and CI gates that block quality regressions.

You cannot improve a contextual-retrieval agent you cannot measure, and "it feels better" is not a measurement. The teams that ship reliable agentic RAG share one habit: they have an eval loop that turns every prompt tweak, retrieval change, or model upgrade into a number, and they refuse to ship when the number drops. Without that loop, every change is a gamble — you fix one case and silently break three others, and you only find out from an angry user. With it, you catch regressions before release and you can actually tell whether contextual retrieval is earning its keep.

This post shows how to build a practical eval loop for a contextual-retrieval Claude agent: what to measure (retrieval quality and answer quality are different things), how to use an LLM-as-judge without fooling yourself, how to assemble a dataset that catches real regressions, and how to wire the whole thing into CI so a bad change can't merge.

## Key takeaways

- Measure **retrieval and answer quality separately** — a great answer can come from luck despite bad retrieval, and you need to know which layer failed.
- For retrieval, track **recall and precision at k** against a labeled set of expected chunks; for answers, score **faithfulness and correctness**.
- **LLM-as-judge** with Claude scales answer grading, but calibrate it against human labels and use a strict rubric, or you'll trust a biased grader.
- Build the dataset from **real production traces**, especially failures, not just happy-path questions.
- **Gate releases in CI**: define a passing threshold per metric and block the merge when a change regresses it.

## Why retrieval and answer quality are different metrics

The most common eval mistake in RAG is measuring only the final answer. A correct answer tells you the system worked *this time*, but not why. The model might have produced the right answer from prior knowledge despite retrieving useless chunks, which means your retrieval is broken and you don't know it — until a question comes along where the model has no prior knowledge to fall back on, and the bad retrieval is suddenly exposed.

So split the evaluation. Retrieval evals ask: did we surface the chunks that contain the answer? You measure this against a dataset where each question is labeled with the chunk IDs that should have been retrieved, computing recall (did we get the right chunks?) and precision (how much of what we retrieved was relevant?) at your chosen k. Answer evals ask a different question: given what we retrieved, is the final response faithful to it and correct? Contextual retrieval lives in the retrieval layer, so retrieval-specific metrics are exactly how you prove it's helping — you compare recall at k with and without the context headers.

```mermaid
flowchart TD
  A["Eval dataset: Q + expected chunks + reference answer"] --> B["Run agent on each item"]
  B --> C["Score retrieval: recall & precision @k"]
  B --> D["Score answer: faithfulness & correctness"]
  C --> E{"Metrics >= thresholds?"}
  D --> E
  E -->|Yes| F["Allow release"]
  E -->|No| G["Block merge, surface regressed cases"]
```

## Building a dataset that catches real regressions

A synthetic dataset of obvious questions will pass forever and protect nothing. The dataset that earns its place comes from production. Mine your logs for real questions, and weight heavily toward the cases that went wrong — the loops, the wrong answers, the "I don't know" failures. Each one becomes a fixed test that must keep passing. This turns your eval set into a growing regression suite that encodes every bug you've already fixed.

For each item, capture three things: the question, the chunk IDs that *should* be retrieved (label these by hand or with assisted review), and a reference answer or a rubric describing what a correct answer must contain. Aim for coverage across the categories that matter — common questions, known hard cases, adversarial or ambiguous queries, and questions whose answer simply isn't in the corpus (the right answer there is a graceful "I don't have that information"). A few hundred well-chosen items beat thousands of trivial ones.

## Using Claude as a judge without fooling yourself

Grading answer quality by hand doesn't scale, so you reach for an LLM judge — a Claude call that reads the question, the retrieved context, and the agent's answer, and scores it against a rubric. This works well, but only if you guard against its failure modes. Give the judge a strict, explicit rubric with concrete criteria rather than asking it to rate "quality" on a vague scale. Ask it to evaluate faithfulness (is every claim supported by the retrieved context?) separately from correctness (does it match the reference answer?), because an answer can be faithful to bad context yet wrong.

```
JUDGE_PROMPT = """You are grading a RAG answer. Use ONLY the rubric.

Question: {question}
Retrieved context: {context}
Agent answer: {answer}
Reference answer: {reference}

Score two axes, 0-2 each:
FAITHFULNESS: 2 = every claim supported by context; 1 = minor unsupported detail; 0 = contradicts or invents.
CORRECTNESS: 2 = matches reference; 1 = partially correct; 0 = wrong or missing.
Return JSON: {{"faithfulness": n, "correctness": n, "reason": "..."}}"""
```

The non-negotiable step is calibration. Hand-label a sample of cases yourself, run the judge on the same cases, and check agreement. If the judge disagrees with you often, tighten the rubric until it tracks human judgment. An uncalibrated judge gives you a number that feels rigorous and isn't, which is worse than no number at all.

## Wiring the eval loop into CI

An eval suite that runs only when someone remembers to run it will rot. Put it in your pipeline. On every change to prompts, tool definitions, retrieval config, or model version, run the suite and compare each metric to a threshold. If recall at k or the faithfulness score drops below the bar, fail the build and print the specific cases that regressed so the author sees exactly what broke. This is the gate that lets a team move fast without quietly degrading.

Make the output actionable: don't just report "score 0.82." Diff against the last passing run and list which previously-passing items now fail. Track the metrics over time so you can see slow drift, and re-run the full suite when you adopt a new model — a Sonnet-to-Opus move can change behavior on edge cases even when the average improves, and the eval loop is how you confirm the upgrade is a net win before it reaches users.

## Stand up an eval loop in five steps

1. Mine production logs for 100–300 real questions, over-sampling past failures.
2. Label each with expected chunk IDs and a reference answer or correctness rubric.
3. Add retrieval metrics (recall and precision at k) and answer metrics (faithfulness, correctness via a calibrated Claude judge).
4. Set a passing threshold per metric based on your current baseline.
5. Run the suite in CI on every relevant change and block merges that regress any metric.

## Common pitfalls

- **Only grading the final answer.** You'll miss broken retrieval that the model happens to paper over. Score retrieval separately.
- **An uncalibrated LLM judge.** A judge that doesn't agree with humans gives false confidence. Calibrate against hand labels first.
- **A happy-path dataset.** Easy questions pass forever and catch nothing. Build from real failures.
- **No "answer not in corpus" cases.** Without them you never test whether the agent admits ignorance instead of hallucinating.
- **Evals outside CI.** Manual evals get skipped under deadline pressure. The gate only works if it's automatic.

## Metric reference

| Metric | Layer | What it tells you | How to score |
| --- | --- | --- | --- |
| Recall @k | Retrieval | Did we surface the right chunks? | vs. labeled expected chunk IDs |
| Precision @k | Retrieval | How much retrieved content was relevant? | vs. labeled expected chunk IDs |
| Faithfulness | Answer | Are claims grounded in context? | Calibrated Claude judge |
| Correctness | Answer | Does it match the reference? | Judge or human review |

## Frequently asked questions

### How big does my eval dataset need to be?

Smaller than most people expect if the items are well-chosen. A few hundred questions that cover common cases, known hard cases, adversarial inputs, and out-of-corpus questions will catch the regressions that matter. Quality and coverage beat raw count; a thousand trivial questions tell you nothing a hundred sharp ones don't.

### Can I trust Claude to grade its own agent's answers?

Yes, with calibration and a strict rubric. Using an LLM as judge is standard practice and scales far better than manual grading, but you must verify it agrees with human labels on a sample before relying on it. Separate faithfulness from correctness so the judge isn't collapsing two different failure modes into one fuzzy score.

### How does this prove contextual retrieval is actually helping?

Run your retrieval metrics with and without the context headers on the same dataset. Because contextual retrieval works at the retrieval layer, recall at k is the direct measure — if recall improves with the headers, the technique is earning its cost. Answer metrics will usually follow, but retrieval recall is the cleanest signal.

### What threshold should gate a release?

Start by measuring your current production system to establish a baseline, then set the gate at or slightly above that baseline so changes can't regress below where you already are. Tighten thresholds over time as the system improves. The gate's job is to prevent backsliding, so anchoring it to your live baseline is the safe default.

## Measured quality on every call

CallSphere runs **voice and chat** agents behind the same kind of eval discipline — measuring retrieval and answer quality so assistants stay accurate as prompts and models evolve, and nothing ships that degrades a real conversation. See it at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/evals-for-contextual-retrieval-rag-on-claude
