Contextual Retrieval RAG: A Real End-to-End Build

Abstract advice about contextual retrieval is easy to nod at and hard to apply. So this post follows one realistic build from start to finish: a mid-sized SaaS support team whose plain-RAG help bot kept answering with the wrong plan's billing rules, and how they rebuilt it on contextual retrieval with Claude into something they actually trusted to face customers. The shape of the problem is common; the specific decisions are where the lessons live. Every choice below is one a real team has to make, with the tradeoffs called out.

Key takeaways

The win came from situating each chunk with document context at index time, not from a bigger model or a fancier vector database.
Hybrid retrieval — dense embeddings plus BM25 plus a rerank — fixed the exact-match billing-code failures that pure embeddings missed.
An evaluation set built from real failed tickets, made before any code, is what told the team they had actually improved.
Wrapping retrieval as a Claude tool let the agent retry with a reformulated query when the first fetch was weak.
Shipping behind a confidence gate — answer, ask, or hand off — turned a risky launch into a safe one.

The problem: confident, wrong, and unciteable

The team's existing help bot chunked every doc into 500-token slices and embedded them directly. A customer on the Growth plan would ask about overage charges and get the Enterprise plan's rules, because the chunk that said "overage is billed at the standard rate after 10,000 units" never mentioned which plan it belonged to. The embedding had no way to know. Worse, the bot gave no citation, so support agents could not tell when it was wrong until a customer complained. Trust collapsed, and agents started telling customers to ignore it.

The root cause was textbook context loss: chunks that are meaningful inside a document become ambiguous once isolated. That is exactly the failure contextual retrieval addresses. But the team resisted the urge to start coding. The first decision — and the right one — was to build a way to measure the problem before touching the pipeline.

It is worth pausing on why this failure was so corrosive. A bot that is obviously broken gets ignored and replaced. A bot that is right most of the time but wrong about billing — the one topic where a wrong answer costs the company money and trust — is worse, because people half-rely on it. The team's real goal was not a higher accuracy number in the abstract; it was to earn back enough trust that support agents would stop overriding the bot by reflex. That framing shaped every later decision toward traceability over cleverness.

Step one: a ground-truth set from real failures

Before any reindexing, an engineer pulled 180 real tickets where the old bot had answered wrong, plus 120 it had gotten right, and for each wrote the question and the IDs of the chunks that should have been retrieved. This became the evaluation set. It is unglamorous work, and it is the highest-leverage thing the team did, because every later change could now be judged: did retrieval recall on these 300 cases go up or down? The flow they built around this set looked like this.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Real failed tickets"] --> B["Build eval set: query + gold chunks"]
  B --> C["Re-chunk & contextualize docs"]
  C --> D["Hybrid index: dense + BM25"]
  D --> E["Rerank top results"]
  E --> F{"Recall up on eval set?"}
  F -->|No| C
  F -->|Yes| G["Ship behind confidence gate"]

The loop on the left is the whole method: change the pipeline, measure against the gold set, and only ship when the number moves the right way. Without box B, every later step would have been guesswork dressed up as progress.

Step two: contextualize chunks at index time

Next they reindexed. For each chunk, a Claude Haiku call generated a one-sentence situating context using the full document as a cached prefix, so the document was paid for once across all its chunks. The contextualized chunk that used to read "overage is billed at the standard rate after 10,000 units" now read, prefixed: "In the Growth plan billing section, overage on the Growth plan is billed at the standard rate after 10,000 API units." That single sentence is what made the chunk retrievable for a Growth-plan question.

The retrieval call itself became a tool the Claude agent invokes, so it could reformulate and retry. The tool definition looked like this:

{
  "name": "search_docs",
  "description": "Search support docs. Returns chunks with situating context, source doc, plan, and a relevance score. Re-query with a refined phrase if results are weak.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "plan": {"type": "string", "description": "customer plan to filter by"}
    },
    "required": ["query"]
  }
}

Because the tool exposed a plan filter and returned the relevance score, the agent could narrow to the customer's plan and decide whether the results were strong enough to answer on. That is the difference between single-shot RAG and agentic retrieval: the agent participates in getting good context, rather than accepting whatever the first lookup returns.

Step three: hybrid retrieval and reranking

Embeddings alone still missed exact identifiers — error codes, plan SKUs, API limit numbers — because semantic similarity blurs precise tokens. Adding a BM25 keyword index in parallel caught those, and a reranking pass over the combined candidates ordered them by true relevance. The before-and-after on the eval set is what convinced leadership.

Configuration	Top-5 retrieval recall	Wrong-plan answers
Plain RAG (baseline)	Low	Frequent
+ Contextual chunks	Much higher	Rare
+ Hybrid + rerank	Highest	Near zero on eval set

The numbers are kept qualitative here because every corpus differs, but the ordering is robust and repeatable: contextual chunks give the biggest single jump, and hybrid plus rerank closes the gap on exact-match queries that embeddings fumble. The team did not need a different vector database or a larger model — they needed better context and a second retrieval path.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step four: ship behind a confidence gate

Rather than flipping the bot to fully autonomous, they shipped it with three outcomes: if retrieval confidence was high, the agent answered with a citation to the source doc and plan; if medium, it answered but flagged for a human agent to confirm; if low, it asked a clarifying question or handed off. This made the launch safe. Support agents saw citations, could verify in one click, and trust returned. Over the following weeks the high-confidence path widened as the eval set grew and the index improved.

Common pitfalls this team hit (so you can skip them)

Almost coding before measuring. The instinct was to reindex first. Building the eval set first is what made every later decision objective instead of vibes.
Forgetting the document cache. The first contextualizing run re-sent the full document for every chunk and the cost spiked. Caching the document prefix cut it dramatically.
Embeddings-only confidence. Pure vector search kept missing exact error codes until BM25 was added in parallel. Identifiers need keyword retrieval.
Launching fully autonomous. A confidence gate with a human fallback turned a scary cutover into an incremental, trustworthy rollout.
Letting the eval set go stale. They scheduled a monthly refresh from new failed tickets so the metric kept reflecting reality.

Reproduce this in five steps

Collect 200–300 real queries with the chunks that should answer them; this is your gold set.
Reindex with a one-sentence situating context per chunk, generated by Claude Haiku with the document cached.
Add a parallel BM25 index and a reranking pass so exact identifiers survive.
Expose retrieval as a Claude tool the agent can filter and retry against.
Ship behind a confidence gate that answers, asks, or hands off, and grow the autonomous path as the metric proves it.

Frequently asked questions

How long does a build like this take?

For a single corpus, a small team can reach a measurable improvement in two to four weeks. The eval set takes a few days, contextual reindexing and hybrid retrieval take about a week, and the rest is the confidence gate and rollout. The slow part is discipline, not difficulty.

Did they need a new vector database?

No. The wins came from contextual chunks, a parallel keyword index, and reranking — all of which most existing stacks support. Swapping databases is rarely the lever; better context and a second retrieval path almost always are.

Why expose retrieval as a tool instead of a fixed lookup?

Because real questions are messy. As a tool, the agent can filter by plan, judge whether results are strong, and re-query with a better phrase when they are weak. A fixed single lookup accepts whatever it gets, which is exactly how the original bot failed.

What convinced leadership to ship?

The before-and-after recall on the eval set, plus citations that let support agents verify answers in one click. Concrete, traceable improvement on real failed tickets is far more persuasive than a demo, and it is what rebuilt trust internally.

Bringing agentic AI to your phone lines

CallSphere takes this same build pattern to voice and chat — agents that retrieve the right plan, account, or policy mid-conversation and cite it back. See a working version at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Contextual Retrieval RAG: A Real End-to-End Build

Key takeaways

The problem: confident, wrong, and unciteable

Step one: a ground-truth set from real failures

Step two: contextualize chunks at index time

Step three: hybrid retrieval and reranking

Step four: ship behind a confidence gate

Common pitfalls this team hit (so you can skip them)

Reproduce this in five steps

Frequently asked questions

How long does a build like this take?

Did they need a new vector database?

Why expose retrieval as a tool instead of a fixed lookup?

What convinced leadership to ship?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild