Caching with Claude Tools and MCP Servers: A Practical Guide

Tool use is where prompt caching gets subtle. A static document at the front of a prompt is easy to cache and forget. A live tool surface backed by MCP servers is not: the schemas render at position zero, the auth flow can leak per-request data into the prefix, error results change the message stream, and a non-idempotent tool can corrupt a conversation that caching has made cheap to replay. Getting caching and tools to coexist means treating the tool layer with the same byte-level discipline you apply to the system prompt, plus a few rules that are specific to wiring external servers in.

This post is the practical guide to that intersection. We will cover how to keep MCP-backed tool schemas deterministic, where authentication credentials are allowed to live, how to structure tool results and errors so they do not silently invalidate the cache, and why idempotency matters more when caching makes replays nearly free. The throughline: tools are cacheable infrastructure, and you wire them so they stay byte-stable.

Key takeaways

Tool schemas live at position zero — sort them deterministically and never vary the set per request, or the entire cache is lost on every call.
Keep credentials out of the cached prefix. Auth tokens belong in transport headers or host-side handlers, never interpolated into tool descriptions or the system prompt.
MCP tool schemas must be converted to a stable, sorted shape; an unsorted property map breaks the prefix match across requests.
Tool error results (is_error: true) change the message stream — expected, and they only invalidate from that turn forward, not the cached tools-plus-system prefix.
Make tools idempotent (or guard them with keys) because caching makes cheap replays likely, and a replayed non-idempotent write double-charges the real world.

Why the tool surface is the most fragile part of the cache

Recall the render order: tools, then system, then messages. Tools occupy the very front of the hashed stream, which means any instability in how they are defined or serialized invalidates everything downstream — the system prompt and the entire conversation included. A document buried in the system prompt can change without touching the messages cache; a single reordered tool property cannot. That asymmetry is why the tool surface deserves the most attention.

The definition of "stable" here is byte-level. Two tool lists that are semantically identical but serialized with different JSON key orders produce different bytes and therefore different cache keys. MCP servers make this easy to get wrong, because the tool definitions arrive from an external process and you convert them into Anthropic's tool format on the fly. If that conversion is nondeterministic — iterating a dict whose order varies, or letting the server's ordering leak through — your position-zero block changes shape between requests and caching never engages.

Wiring MCP tools so they stay byte-stable

When you bring tools in from an MCP server, normalize them into a canonical, sorted shape before they ever reach the request. The pattern is to fetch the server's tool list once, convert each tool deterministically, and sort the result by name.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

def normalize_mcp_tools(mcp_tools):
    out = []
    for t in mcp_tools:
        out.append({
            "name": t.name,
            "description": t.description or "",
            "input_schema": t.inputSchema,  # already a JSON schema object
        })
    return sorted(out, key=lambda x: x["name"])  # canonical order

tools = normalize_mcp_tools(mcp_client_tools)  # build once, reuse every request

Two rules make this safe. First, build the normalized list once at startup and reuse the same object across requests rather than reconverting per call — reconversion reintroduces nondeterminism. Second, if a server's schema includes objects whose property order is not guaranteed, serialize those with sorted keys. The Python SDK's MCP conversion helpers produce stable tool types you can pass to the tool runner, but the moment you hand-roll the conversion, sorting becomes your responsibility.

flowchart TD
  A["MCP server tool list"] --> B["Normalize + sort by name"]
  B --> C["Frozen tools block (position 0)"]
  C --> D["System prompt + breakpoint"]
  D --> E["Claude requests a tool"]
  E --> F{"Tool succeeds?"}
  F -->|Yes| G["tool_result appended to messages"]
  F -->|No| H["tool_result is_error:true appended"]
  G --> I["Cache holds tools+system; only messages tail changes"]
  H --> I

The diagram makes the boundary explicit: whether the tool succeeds or errors, only the tail of the messages array changes. The frozen tools-plus-system prefix stays cached, so the cost of a tool round trip is just the new result block, not a re-processing of the whole prompt.

Authentication without poisoning the prefix

The dangerous instinct with MCP auth is to put credentials somewhere convenient — a token in the system prompt, an API key in a tool description "so the model knows." Both place per-request or per-tenant secrets into the cached prefix, which not only breaks caching across users but is a genuine security problem because prompt content is durably stored in conversation history.

The correct pattern keeps credentials entirely out of the model-visible prompt. For remote MCP servers, auth travels in the transport layer — the connection's headers or an OAuth credential the runtime injects after the request leaves your code. For tools you execute yourself, the credential lives in your host-side handler: the model emits a tool call with no secret, your handler reads the credential from its own environment, performs the authenticated call, and returns only the result. The model never sees the key, the prefix stays identical across tenants, and one cached tools-plus-system block serves every user.

Structuring tool results and errors

Tool results are appended to the messages array as tool_result blocks, each carrying the tool_use_id that matches the originating call. This is expected message growth and it only invalidates the cache from that turn forward — the tools-plus-system prefix is untouched. So you do not need to fear tool results breaking caching; you need to keep them well-formed so the message cache for prior turns stays valid.

tool_results = []
for block in response.content:
    if block.type == "tool_use":
        try:
            data = execute(block.name, block.input)
            tool_results.append({"type": "tool_result",
                "tool_use_id": block.id, "content": data})
        except ToolError as e:
            tool_results.append({"type": "tool_result",
                "tool_use_id": block.id,
                "content": f"Error: {e}", "is_error": True})
messages.append({"role": "user", "content": tool_results})

Mark genuine failures with is_error: true and a description the model can act on; it will typically retry differently or ask for clarification. The caching implication is mild: an error result is just another message block, so it invalidates only the cache from that turn on, exactly like a successful result. Keep results deterministic where you can — a tool that returns timestamps or random ordering in its result will, on a replayed turn, change the bytes and reduce downstream hits.

Idempotency: why caching raises the stakes

Caching makes it cheap to replay a conversation prefix — for retries, for forked sub-agents, for re-running a turn after an error. That cheapness is exactly why idempotency matters more once caching is in play. If a tool performs a non-idempotent side effect — charging a card, sending an email, creating a record — and the turn that calls it gets replayed, you double the real-world action even though the model and cache happily reuse state.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The pattern is to give every side-effecting tool an idempotency key derived from the request, and to have the handler deduplicate on it. A send_email tool takes a stable message ID; a create_order tool takes a client-supplied order key. The handler checks whether that key was already processed and returns the prior result instead of acting twice. This keeps replays — which caching actively encourages — safe. Read-only tools need no such guard, which is another reason to promote dangerous actions to dedicated, gated tools rather than running them through an opaque bash call.

Common pitfalls

Reconverting MCP tools every request. Each conversion can reorder properties; build the normalized, sorted tool list once and reuse it.
Putting an API key in a tool description. It poisons the prefix across tenants and persists the secret in history. Keep credentials host-side or in transport headers.
Letting tools return nondeterministic content. Timestamps and random ordering in results change replayed-turn bytes and erode downstream cache hits.
Treating an error result as a cache problem. It is normal message growth; it invalidates only from that turn forward, not the cached tools-plus-system prefix.
Non-idempotent side-effecting tools. Cached replays double the real-world effect; add idempotency keys and deduplicate in the handler.

Frequently asked questions

Does connecting an MCP server invalidate my cache every time?

Only if the tool definitions it contributes change between requests. A server whose tool list is fetched once, normalized, sorted, and reused produces a byte-stable position-zero block, so the cache holds. Invalidation comes from the schemas changing or being reordered, not from the mere presence of an MCP connection.

Where should MCP OAuth tokens be stored so they do not break caching?

Out of the prompt entirely. For hosted MCP servers, credentials are supplied through the connection or a vault the runtime injects after the request leaves your code; for self-executed tools, they live in your host-side handler's environment. Either way the model-visible prefix is identical across users, which is what keeps one cached entry serving everyone.

Do tool_result blocks count toward the twenty-block lookback window?

Yes. Each tool_use and tool_result is a content block, so a turn with many tool round trips can exceed the twenty-block window a breakpoint searches backward. In long agent turns, add an intermediate breakpoint roughly every fifteen blocks so the next request still finds the prior cache.

How does idempotency relate to caching specifically?

Caching lowers the cost of replaying a prefix, which makes retries and forks routine. A non-idempotent tool that runs during a replayed turn performs its side effect twice. Idempotency keys plus handler-side deduplication make those caching-encouraged replays safe without changing the cached prefix.

Bringing agentic AI to your phone lines

CallSphere wires tools and MCP servers into voice and chat agents the same careful way — assistants that answer every call and message, call external systems mid-conversation, and book work 24/7 while keeping the cached prefix intact. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Caching with Claude Tools and MCP Servers: A Practical Guide

Key takeaways

Why the tool surface is the most fragile part of the cache

Wiring MCP tools so they stay byte-stable

Authentication without poisoning the prefix

Structuring tool results and errors

Idempotency: why caching raises the stakes

Common pitfalls

Frequently asked questions

Does connecting an MCP server invalidate my cache every time?

Where should MCP OAuth tokens be stored so they do not break caching?

Do tool_result blocks count toward the twenty-block lookback window?

How does idempotency relate to caching specifically?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild