Skip to content
Agentic AI
Agentic AI7 min read0 views

MCP Token Cost: Caching, Batching & Cheap Agent Runs

Cut Model Context Protocol agent cost on Claude with prompt caching, batched tool calls, context trimming, and right-sized models for fast, cheap runs.

An agent that works is only half the job. An agent that works and costs forty cents per run while taking ninety seconds is a demo; an agent that costs four cents and finishes in twelve seconds is a product. The gap between those two is almost entirely engineering — and most of it lives in how you handle context, caching, and tool round-trips when Claude talks to your Model Context Protocol (MCP) servers.

This post is about the unglamorous economics of agent runs: where the tokens actually go, which levers move cost and latency the most, and how to pull them without making the agent dumber.

Where the tokens actually go

Before optimizing, measure. In a typical MCP agent loop, tokens accumulate in three buckets: the system prompt and tool definitions (sent every turn), the growing conversation history (every prior tool call and result), and the model's own output. People assume output tokens dominate. They usually do not. The silent budget-killer is the conversation history — by step fifteen, every turn re-sends fourteen turns of tool results, and a multi-step agent can re-pay for the same context dozens of times across a single run.

Tool definitions are the second surprise. If your MCP server exposes forty tools with verbose schemas, those definitions ride along on every turn. A bloated catalog is a tax you pay on every single model call, whether or not any of those tools get used.

Prompt caching is the biggest lever

Anthropic's prompt caching lets you mark a stable prefix of your request — system prompt, tool definitions, long reference documents — so that on subsequent calls Claude reads it from cache at a large discount instead of reprocessing it at full price. For an agent that makes ten model calls in one run, caching the system prompt and tool schemas once and reusing them nine times is the difference between paying for that prefix ten times and paying for it roughly once.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Build request"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Read prefix from cache, cheap"]
  B -->|No| D["Process full prefix, write cache"]
  C --> E["Process only new turn tokens"]
  D --> E
  E --> F["Claude calls MCP tool"]
  F --> G["Append result, next turn reuses cache"]

The rule for caching is to order your context from most stable to least stable. Put the system prompt and tool definitions first (they never change within a run), then long static documents, then the volatile conversation tail last. Caching only helps for the unchanged prefix, so anything dynamic near the top poisons the cache and forces a reprocess. Structure the request deliberately and the discount applies on nearly every turn of a long agent loop.

Caching also cuts latency, not just cost, because the cached prefix does not need to be re-encoded. For interactive agents and voice systems where a half-second matters, that is often the bigger win.

Batch and parallelize tool calls

Claude can request multiple tool calls in a single turn when they are independent. If your agent needs the customer record, their recent orders, and their support tickets, and none depends on the others, do not march through three sequential turns — let the model fire all three in one turn and have your MCP layer execute them in parallel. You collapse three round-trips into one, which saves both the per-turn context re-send and the wall-clock time of three sequential model calls.

For workloads that are not latency-sensitive — overnight enrichment, bulk classification, document processing — Anthropic's Message Batches API processes large volumes asynchronously at a significant discount versus real-time calls. The pattern is simple: if a human is not waiting on the answer right now, batch it. Reserve your expensive real-time budget for the interactions where latency is the product.

Trim context aggressively

The cheapest token is the one you never send. As an MCP agent runs, prune what it carries. Tool results are the worst offenders: a single API call can return a 5,000-token JSON blob when the model only needed three fields. Have your MCP server (or a thin wrapper) project results down to what matters before they enter the context. Return the order status and total, not the full nested object with shipping metadata the agent will never read.

For long-running agents, summarize-and-compact: once history grows past a threshold, replace the oldest turns with a short model-written summary of what was learned and done so far. You trade a little fidelity for a lot of recurring savings, since that summary then costs a fraction of the raw turns on every remaining step. Claude Code and the Agent SDK support this kind of context compaction, and on multi-step runs it is often the difference between a run that fits the budget and one that does not.

Match the model to the step

Not every step needs your most capable model. In 2026 the Claude 4.x family spans Opus 4.8 (most capable), Sonnet 4.6, and Haiku 4.5, and the price-per-token gap between them is large. A well-designed agent routes work: Haiku for cheap, high-volume classification and routing; Sonnet for most reasoning and tool orchestration; Opus reserved for the genuinely hard planning steps where a mistake is expensive. A common pattern is a cheap model deciding which tool to call and a stronger one only invoked when the cheap model flags low confidence.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Be deliberate about multi-agent designs here. Orchestrator-plus-subagent systems are powerful, but they typically burn several times more tokens than a single agent because each subagent carries its own context and the orchestrator pays to coordinate them. Use multi-agent when the parallelism genuinely pays for itself — independent research threads, fan-out over many documents — not as a default architecture.

Frequently asked questions

What makes Claude MCP agents expensive?

Usually the conversation history re-sent on every turn and a bloated tool catalog, not output tokens. A ten-step agent can re-pay for the same context many times unless you cache the stable prefix and prune results.

How much can prompt caching save on an agent run?

For a multi-call run, caching the system prompt and tool definitions means you process that prefix roughly once instead of once per call, and it cuts latency too. The longer the run, the bigger the saving.

When should I use batching instead of real-time calls?

Whenever no human is waiting on the result — bulk enrichment, classification, document processing. The Message Batches API runs them asynchronously at a meaningful discount versus real-time requests.

Should I always use the most capable Claude model?

No. Route by difficulty: Haiku for high-volume simple steps, Sonnet for most orchestration, Opus only for the hardest planning. Matching model to step is one of the largest cost levers available.

Bringing agentic AI to your phone lines

Caching, batching, and lean context are exactly what make a real-time voice agent both fast enough to hold a conversation and cheap enough to run at scale. CallSphere applies these agentic-AI patterns to voice and chat — assistants that answer every call and message, call tools mid-conversation, and book work 24/7. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.