---
title: "Cut Claude agent token cost: caching, batching, speed (Extending Claude Skills MCP)"
description: "Keep Claude agents fast and cheap on Skills and MCP with prompt caching, batching, context discipline, model routing, and cost-per-run metrics."
canonical: https://callsphere.ai/blog/cut-claude-agent-token-cost-caching-batching-speed-extending-claude-sk
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "token cost", "mcp", "performance", "batching"]
author: "CallSphere Team"
published: 2026-02-18T11:23:11.000Z
updated: 2026-06-06T21:47:44.780Z
---

# Cut Claude agent token cost: caching, batching, speed (Extending Claude Skills MCP)

> Keep Claude agents fast and cheap on Skills and MCP with prompt caching, batching, context discipline, model routing, and cost-per-run metrics.

An agent that works is not the same as an agent you can afford to run. The first time you put a Claude agent into production with a handful of MCP servers and a few Skills, the demo feels magical. Then the bill arrives. Every turn re-sends the full system prompt, the tool definitions, the skill instructions, and the growing transcript — and a multi-step run can re-process the same tokens a dozen times. Performance work for agentic systems is mostly about not paying for the same context over and over, and not spawning more work than the task needs.

This post walks through the levers that actually move token cost and latency: prompt caching, batching, context discipline, model selection, and measuring cost per run so you can tell whether a change helped. None of it requires a different model; it requires using the one you have deliberately.

## Where the tokens actually go in an agent run

Start by understanding the shape of the spend. In an agentic loop, the input grows every turn. Turn one sends the system prompt plus tool schemas. Turn two re-sends all of that plus the first tool call and its result. By turn ten, the bulk of your input tokens are old context you've already paid to process several times. Output tokens — the model's reasoning and tool calls — are usually a minority of the cost in tool-heavy workflows.

This matters because it tells you where to optimize. Shaving the system prompt by ten percent helps a little; making sure that prompt is cached so you pay full price for it once instead of ten times helps enormously. The single highest-leverage performance technique for Claude agents is prompt caching, and most teams under-use it.

A working definition: prompt caching is a mechanism that stores the processed representation of a stable prefix of your prompt so that subsequent requests reusing that exact prefix are billed at a steep discount and served faster. The catch is the word "exact" — caching keys on a byte-identical prefix, so anything that changes early in the prompt invalidates everything after it.

## Designing the prompt for cache hits

Order your context from most stable to most volatile. Put the system prompt, tool definitions, and skill instructions first — they rarely change within a session — and mark the cache breakpoint after them. Put the conversation transcript and per-turn data last, where it's allowed to vary. If you interpolate a timestamp or a request ID into the top of your system prompt, you have just guaranteed a cache miss on every single turn. Move volatile values to the end.

```mermaid
flowchart TD
  A["New agent turn"] --> B["Assemble prompt: stable prefix first"]
  B --> C{"Prefix byte-identical to cached?"}
  C -->|Yes| D["Cache hit: cheap & fast input"]
  C -->|No| E["Cache miss: full price, re-warm cache"]
  D --> F["Claude reasons & calls tool"]
  E --> F
  F --> G{"Run complete?"}
  G -->|No| A
  G -->|Yes| H["Log tokens & cost per run"]
```

Skills make this easier than it looks. Because a skill is loaded only when relevant, you can keep the always-on system prompt lean and let task-specific instructions arrive just-in-time. But be aware: loading a new skill mid-run changes the prompt and can move your cache breakpoint. Group related work so that the skill loads early and the cached prefix stays stable for the rest of that task.

## Batching: do more per call, spawn fewer agents

Batching operates at two levels. At the tool level, prefer MCP tools that accept and return collections. An agent that calls `get_record` fifty times pays fifty round trips of overhead and fifty turns of growing context; an agent that calls `get_records` once with fifty IDs pays one. When you design or choose MCP servers, favor bulk endpoints; they cut both latency and token churn dramatically.

At the workflow level, batching means resisting the urge to spawn subagents for everything. Multi-agent systems are powerful, but a multi-agent run typically uses several times more tokens than a single-agent run because each subagent carries its own context and the orchestrator pays to summarize their outputs. Spawn parallel subagents when the subtasks are genuinely independent and the latency win justifies the token premium — and keep a single agent for linear work.

For offline, non-interactive jobs — bulk classification, enrichment, nightly summaries — use asynchronous batch processing rather than the real-time API. Throughput jobs that can tolerate latency are far cheaper run as a batch, and there is no user waiting on the other end.

## Context discipline keeps runs from bloating

Long runs rot. As the transcript grows, every turn costs more and the model has more to wade through. Compaction helps: periodically replace the verbose middle of a transcript with a concise summary of what was decided and what state exists, keeping the recent turns verbatim. The Claude Agent SDK supports this pattern, and it both lowers cost and improves quality by reducing distraction.

Be ruthless about what enters the context in the first place. If an MCP tool returns a 200-kilobyte JSON blob and the agent needs three fields, transform the response at the server or in a wrapper before it reaches the model. Feeding raw, oversized tool results into the context is one of the most common and most avoidable sources of token waste in agentic systems.

## Right-sizing the model for each step

Not every step needs the most capable model. The Claude 4.x family spans Opus 4.8, Sonnet 4.6, and Haiku 4.5, and a well-built agent can route by difficulty. Use a smaller, faster model for routing, classification, and simple extraction, and reserve the strongest model for the hard reasoning steps. A mixed-model agent often costs a fraction of an all-Opus agent with no quality loss on the easy steps, because the easy steps never needed the heavy model.

## Measuring cost per run so changes are real

You cannot optimize what you don't measure. Log, for every run, the input tokens, the cached versus uncached split, the output tokens, the number of tool calls, and the wall-clock latency. Track these as a distribution over many runs, not a single example. Then every optimization becomes testable: did the cache hit rate go up, did tokens per run go down, did latency improve? Treat cost per run as a first-class metric next to accuracy, and you will catch regressions — like a new field that broke the cache prefix — before they reach the invoice.

## Frequently asked questions

### Why is my prompt cache not hitting even though the system prompt is the same?

Something volatile is sneaking into the prefix. The most common culprits are an injected timestamp, a per-request ID, a reordered tool list, or a skill that loads at a different point between runs. Caching requires a byte-identical prefix, so move every changing value after the cache breakpoint.

### When is multi-agent worth the extra tokens?

When subtasks are truly independent and can run in parallel, and the latency reduction or quality gain outweighs the several-times token premium. For sequential, dependent work, a single agent with good context management is cheaper and usually just as accurate.

### Should I use batch processing or the real-time API?

Use batch processing for offline, latency-tolerant jobs like bulk enrichment or nightly summarization, where no user is waiting. Use the real-time API for interactive agents. The batch path is meaningfully cheaper for throughput-style work.

### What's the single biggest token win for most agents?

Prompt caching done correctly, followed closely by trimming oversized tool responses before they hit the context. Together they often cut input token cost substantially without touching the agent's logic at all.

## Fast, affordable agents on the phone

Latency and cost are not abstractions when a caller is waiting for an answer. CallSphere applies these same caching, batching, and context-discipline patterns to **voice and chat** agents that respond in real time and stay cheap to run at scale. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cut-claude-agent-token-cost-caching-batching-speed-extending-claude-sk
