Wiring MCP Servers into Contextual RAG Agents
Expose contextual retrieval via MCP: auth at the boundary, strict schemas, structured errors, and idempotent results so Claude agents search safely.
Contextual retrieval is most useful when more than one agent can reach it. The clean way to share it is to put it behind a Model Context Protocol server, so any Claude agent — Claude Code, a Cowork plugin, a custom Agent SDK build — can call search_kb without knowing how the retriever works inside. But the moment retrieval becomes a network tool, you inherit the unglamorous concerns every API has: authentication, schema validation, error handling, and idempotency. This post is about doing those four things right.
Model Context Protocol is an open standard, introduced in late 2024, that lets Claude connect to external tools and data through a uniform server interface. Wrapping your retriever as an MCP server means the agent sees a typed tool; you keep full control over auth and failure behavior on the server side.
Key takeaways
- Expose contextual retrieval as a single, well-described MCP tool — keep the surface small and the schema strict.
- Authenticate at the MCP server boundary; never let the model hold raw credentials.
- Validate inputs against the schema and return structured, model-readable errors instead of stack traces.
- Make retrieval idempotent — identical queries return identical results within a version — so retries and caching are safe.
- Version your index so an agent can reason about freshness and so cache keys stay correct across re-indexing.
Designing the MCP tool surface
Resist the urge to expose ten tools. A retrieval server usually needs one: search_kb, with a strict input schema. A tight surface is easier for the model to use correctly and easier for you to secure. Put filters (date ranges, document types) into the schema as optional fields rather than spawning separate tools per filter.
{
"name": "search_kb",
"description": "Search the contextual knowledge base. "
"Call when you need grounded facts you are unsure of.",
"inputSchema": {
"type": "object",
"properties": {
"query": {"type": "string", "minLength": 1},
"top_k": {"type": "integer", "minimum": 1,
"maximum": 20, "default": 6},
"doc_type": {"type": "string",
"enum": ["policy", "contract", "faq"]}
},
"required": ["query"]
}
}The minLength, maximum, and enum constraints are not decoration. They let the server reject malformed calls before any expensive embedding or search happens, and they nudge the model toward valid inputs because the schema is part of what Claude reads.
Treat the description field as prompt engineering, because that is exactly what it is. Claude routes tool calls off the description, so "Search the contextual knowledge base. Call when you need grounded facts you are unsure of" produces sharper, better-timed calls than a bare "searches documents." Spell out when to call the tool, what kind of query works well, and what the tool will not do, so the model does not reach for it on turns that need no grounding. A few extra sentences here measurably reduces both spurious calls and missed ones, and it costs you nothing at runtime because the description is read once per turn regardless of length.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Authentication at the boundary
The agent should never see a database password or an embedding-provider key. Auth lives on the MCP server. The agent presents a token to the server (via the transport), the server validates it, and only then does the server use its own privileged credentials to hit the vector store, BM25 index, and any rerank service.
flowchart TD
A["Claude agent"] --> B["MCP server: validate token"]
B -->|invalid| C["Return auth error to model"]
B -->|valid| D["Validate input schema"]
D -->|bad input| E["Return structured error"]
D -->|ok| F["Retrieve: fuse + rerank"]
F -->|backend down| G["Return retryable error"]
F -->|ok| H["Return scored chunks + index_version"]Read the diagram as a gauntlet: token, then schema, then backend. Each gate fails closed with a clear, structured message the model can act on, never a leaked exception. The privileged credentials only come out after both the token and the input pass.
This boundary is also where you enforce authorization, not just authentication. A token does not only prove who is calling; it should carry the scope of what that caller may retrieve. The server reads the scope from the token and applies it as a filter on the index query, so a caller restricted to public policy documents can never surface a confidential contract chunk — the restriction is applied before retrieval runs, not bolted on afterward as a fragile post-filter. Doing it server-side means the agent has no way to escalate its own access by crafting a clever query, because the documents it lacks permission for are simply not in the candidate set the retriever ever considers.
Error handling the model can actually use
An MCP tool that throws raw exceptions teaches the agent nothing. Return errors as structured results that distinguish three cases: the caller's input was bad (do not retry, fix the query), the backend is temporarily down (retry with backoff), or the query is valid but found nothing (answer that you do not know). Each demands a different agent response.
{
"error": {
"type": "retryable",
"message": "Vector store unavailable; retry shortly.",
"retry_after_ms": 500
}
}Tag the type explicitly. A well-prompted Claude agent reads "type": "retryable" and waits, reads "type": "invalid_input" and rewrites its query, and reads an empty-result case and tells the user it could not find an answer rather than inventing one. The structure carries the recovery strategy.
Idempotency and safe retries
Retrieval is naturally read-only, which is a gift: identical inputs should produce identical outputs within a given index version. Lean into that. Make the server deterministic — stable tie-breaking in fusion, fixed rerank ordering — so a retried call after a timeout returns exactly what the first call would have. That lets the transport and the agent retry freely without fear of divergent results.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The subtlety is index versioning. If you re-index between two identical queries, results can legitimately change. Return an index_version field with every response so the agent (and any caching layer) can tell whether a difference came from a new query or a new index, and so cache keys can include the version.
Idempotency pays off most under failure, which is exactly when you need it. Networks drop responses, transports retry, and an agent may re-issue a call it is unsure completed. If your retriever is deterministic, none of that matters — the retry returns the same chunks the original would have, so there is no risk of the agent silently reasoning over two different result sets across an invisible retry. The only legitimate source of variation is a re-index, and that is precisely what the version field surfaces. Make determinism a property you test for: feed the same query twice and assert byte-identical results within a version, and you will catch unstable sorts or nondeterministic tie-breaking long before they confuse a production agent.
| Concern | Wrong way | Right way |
|---|---|---|
| Auth | Key in agent prompt | Token validated at server |
| Bad input | 500 + stack trace | Structured invalid_input error |
| Backend down | Hang or crash | Retryable error + backoff |
| Retries | Nondeterministic results | Idempotent + index_version |
Wire it up in 5 steps
- Define one
search_kbMCP tool with a strict input schema and an intent-rich description. - Add token validation at the server boundary; load privileged backend credentials only after auth passes.
- Validate every input against the schema and reject malformed calls before any retrieval work.
- Return structured errors typed as invalid_input, retryable, or empty — never raw exceptions.
- Make retrieval deterministic and attach an
index_versionto every successful response.
Common pitfalls
- Letting the agent hold backend secrets. Credentials belong on the server. The agent gets a scoped token, nothing more.
- Returning unstructured errors. A stack trace string tells Claude nothing about whether to retry; a typed error tells it exactly what to do.
- Nondeterministic fusion. Unstable tie-breaking makes retries return different chunks, which breaks idempotency and confuses caches.
- Exploding the tool surface. Ten near-identical search tools confuse routing; one tool with optional schema filters is cleaner and safer.
- Omitting index_version. Without it, you cannot tell a stale cache hit from a fresh result, and freshness reasoning becomes guesswork.
Frequently asked questions
What is Model Context Protocol, briefly?
Model Context Protocol is an open standard that connects Claude to external tools and data through a uniform server interface, so any compatible agent can call your tools without bespoke integration code. It pairs with Skills, which teach the model how to use those tools well.
How do I scope what each agent can retrieve?
Encode permissions in the token the agent presents, and enforce them on the server by filtering the index query. The model never sees documents its token does not authorize, because the filtering happens before retrieval runs, not after.
Should retrieval errors fail the whole agent run?
No. Return a typed error and let the agent decide. A retryable backend hiccup should trigger a retry; an empty result should produce an honest "I could not find that." Failing the entire run throws away the agent's ability to recover gracefully.
Bringing agentic AI to your phone lines
CallSphere runs MCP-backed retrieval behind voice and chat agents that authenticate, look up the right record mid-call, handle backend hiccups gracefully, and book the job — all day, every day. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.