---
title: "Cutting token cost in Claude agents: caching & batching (Agents Reach Production With MCP)"
description: "Keep Claude MCP agents fast and cheap with prompt caching, batching, context pruning, and model routing — the levers that actually move your token bill."
canonical: https://callsphere.ai/blog/cutting-token-cost-in-claude-agents-caching-batching-agents-reach-prod
category: "Agentic AI"
tags: ["agentic ai", "claude", "mcp", "prompt caching", "token cost", "performance"]
author: "CallSphere Team"
published: 2026-04-22T11:23:11.000Z
updated: 2026-06-06T21:47:43.280Z
---

# Cutting token cost in Claude agents: caching & batching (Agents Reach Production With MCP)

> Keep Claude MCP agents fast and cheap with prompt caching, batching, context pruning, and model routing — the levers that actually move your token bill.

An agent that works is the easy milestone. An agent that works *cheaply* is the one that survives a finance review. The moment a Claude agent starts looping through MCP tools against real systems, every turn re-sends a growing context: the system prompt, the full tool catalog, the accumulating transcript of calls and results. Left unmanaged, a single multi-step task can re-bill the same tokens a dozen times. The good news is that agentic workloads are unusually amenable to optimization, because so much of what they send is repetitive and predictable.

This post is about the levers that actually move the bill — prompt caching, batching, context discipline, and model routing — and the order in which to pull them.

## Where the tokens actually go

Before optimizing, measure. In a typical tool-using loop, the dominant cost is not the user's question or Claude's final answer; it is the *input* tokens re-sent on every turn. A ten-step task with a large system prompt and twenty tool definitions can pay for that fixed preamble ten times over. Add the transcript, which grows with each tool result appended, and input tokens often outweigh output tokens by a wide margin. The first instinct of "make the model write less" is therefore usually the wrong one — the savings live on the input side.

Instrument per-run token accounting that separates cached input, fresh input, and output. Once you can see that 80% of your spend is re-sent preamble, the optimization strategy becomes obvious, and you stop wasting effort trimming the parts that barely matter.

## Prompt caching: the single biggest win

Prompt caching lets you mark a stable prefix of the context — your system prompt, tool definitions, long reference material — so that on subsequent calls Claude reads it from cache rather than reprocessing it. Cached input tokens are billed at a steep discount compared to fresh ones, which is exactly what an agent loop needs, since that prefix is identical on every turn. Definitionally, **prompt caching is a mechanism that stores a reusable prefix of the model's input so repeated requests sharing that prefix skip most of the cost and latency of reprocessing it.**

```mermaid
flowchart TD
  A["Agent turn starts"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Read prefix from cache (cheap)"]
  B -->|No| D["Process prefix fresh (full price)"]
  D --> E["Write prefix to cache"]
  C --> F["Append new turn (tool result)"]
  E --> F
  F --> G["Model decides next tool"]
  G --> H{"Run complete?"}
  H -->|No| A
  H -->|Yes| I["Return answer + log token split"]
```

To benefit, order your context deliberately: put the truly stable material first — system prompt, tool schemas — and the volatile material (the latest tool result, the user's newest message) last. The cache matches on a prefix, so a single early change invalidates everything after it. A common mistake is interpolating a timestamp or a per-request ID near the top of the prompt; that silently busts the cache on every call and you pay full price while believing you are caching. Keep the volatile parts at the tail and your hit rate stays high across a long agent loop.

## Batching what does not need to be live

Not every Claude call in an agentic system is interactive. Overnight enrichment, bulk classification of yesterday's transcripts, generating summaries for a thousand records — these are throughput problems, not latency problems. For workloads that can tolerate a delay, batch processing trades immediacy for a substantial per-token discount. The pattern is to separate your traffic into two lanes: the synchronous lane that a user is waiting on, and the asynchronous lane that a queue is waiting on. Route everything that can wait into the batch lane and you reclaim a large fraction of spend that interactive pricing was costing you for no benefit.

Within the live lane, you can still batch at the tool level. If the agent needs to look up five records, a tool that accepts an array and returns all five in one MCP round trip beats five sequential calls — fewer turns, less re-sent context, lower latency. Design MCP tools to accept and return collections where it makes sense, so the agent naturally consolidates work instead of fanning out one item at a time.

## Context discipline: prune the transcript

The transcript is a tax that compounds. By the tenth tool call, the agent is re-reading nine prior tool results it may no longer need. Aggressive context management keeps runs cheap. One effective pattern is to summarize-and-compact: after a chunk of exploratory tool calls, replace the raw results with a short distilled summary of what was learned, dropping the verbose intermediate payloads. The agent keeps the conclusions and sheds the bulk.

Be ruthless about tool-result size, too. If an MCP tool can return a 50-field record but the agent only needs three fields, have the tool project down to those fields. Every byte a tool returns becomes input tokens on every subsequent turn for the rest of the run. Trimming tool output at the source is one of the highest-leverage, lowest-effort optimizations available, and it improves accuracy as a side effect by reducing the noise the model must sift.

## Routing to the right model

Not every step needs your most capable model. A practical agent architecture routes by difficulty: use a fast, inexpensive model like Claude Haiku for routine classification, extraction, and routing decisions, and reserve a flagship model like Opus for the genuinely hard reasoning steps. In an orchestrator–subagent design, the orchestrator's planning may warrant the strong model while the subagents executing well-scoped subtasks run on a cheaper one. The cost difference between tiers is large enough that thoughtful routing often matters more than any single prompt tweak.

The discipline is to make model choice a per-step decision rather than a global default. Profile which steps actually need the headroom, downgrade the rest, and watch the quality gates (covered separately) to confirm you have not traded away accuracy for the savings. Done well, you keep flagship quality where it counts and pay Haiku prices everywhere else.

## Putting it together without overfitting

Pull these levers in order of leverage: cache the stable prefix first, prune tool outputs and transcripts second, route models third, and batch the asynchronous lane fourth. Re-measure after each change, because the bottleneck shifts as you fix things — once caching lands, transcript growth becomes the new top line. Avoid premature micro-optimization of prompts; a tighter wording saves a few tokens, while a cache hit on a large preamble saves thousands. Optimize where the mass is.

## Frequently asked questions

### Does prompt caching change Claude's outputs?

No. Caching only affects how the input prefix is processed and billed; the model sees the identical context and produces the same kind of output. It is a cost-and-latency optimization, not a behavioral one.

### Why is my cache hit rate low even though I enabled caching?

Almost always because something volatile sits near the top of the prompt — a timestamp, request ID, or per-call variable — that invalidates the prefix. Move all dynamic content to the tail of the context so the stable prefix stays byte-identical across turns.

### When should I use batch processing instead of live calls?

Whenever no user is actively waiting on the result — overnight enrichment, bulk classification, report generation. Batch lanes offer a significant per-token discount in exchange for delayed delivery, so route any latency-tolerant work there.

### How much can context pruning actually save?

It depends on run length, but in long tool-using loops the re-sent transcript often dominates input cost. Summarizing intermediate results and projecting tool outputs to only the needed fields can cut input tokens substantially while also improving accuracy.

## Bringing agentic AI to your phone lines

CallSphere runs these same cost disciplines — caching, batching, and lean context — behind **voice and chat** agents that handle every call and message, call tools mid-conversation, and book work 24/7 without burning a fortune in tokens. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-token-cost-in-claude-agents-caching-batching-agents-reach-prod
