Cutting Token Cost in Claude Agents: Caching & Batching (Building Agents With Skills)
Keep Claude agents fast and cheap with prompt caching, batching, smart model routing, and context discipline — without sacrificing answer quality.
The first invoice is where agent enthusiasm meets reality. An agent that felt almost free in development — a few dollars of testing across a week — can cost real money the moment hundreds of users hit it daily, especially if it loads heavy context and runs several tool turns per task. The good news is that agent cost is overwhelmingly a function of how you manage tokens, and tokens are something you can engineer. This post covers the levers that actually move the needle when you build agents with Claude Agent Skills: caching, batching, model routing, and ruthless context discipline.
Start with the right mental model. Every run costs you input tokens (everything the model reads — system prompt, loaded skills, tool results, conversation history) plus output tokens (what it writes). For agents, the input side usually dominates, because each tool turn re-sends the growing transcript back to the model. A ten-turn agent does not pay for one prompt; it pays for the prompt growing larger on every single turn. That compounding is where the money goes, and it is where your optimizations should focus.
Prompt caching is the biggest single win
Prompt caching lets you mark a stable prefix of your input — your system prompt, tool definitions, loaded skill instructions, any fixed reference material — so that on subsequent calls the model reuses the cached version at a steep discount instead of reprocessing it from scratch. For agents this is transformative, because that prefix is identical across every turn of a run and often across every run of a session. You are paying full price to process your 4,000-token system-and-skills block once, then a fraction of that on every turn afterward.
The discipline that makes caching pay off is ordering. Put everything stable at the front and everything volatile at the back. System prompt, tool schemas, and skill content go first; the live conversation and fresh tool results go last. If you interleave a changing timestamp or a per-request variable into the middle of your prefix, you invalidate the cache below it and lose the discount. Treat the cached prefix as sacred and immutable for the life of the run.
flowchart TD
A["Incoming task"] --> B{"Stable prefix cached?"}
B -->|Yes| C["Reuse cache: pay fraction for prefix"]
B -->|No| D["Process full prefix, write cache"]
C --> E{"Simple or complex task?"}
D --> E
E -->|Simple| F["Route to Haiku"]
E -->|Complex| G["Route to Sonnet/Opus"]
F --> H["Return result"]
G --> HRoute the model to the task, not the task to the model
A frequent and expensive mistake is running every step of an agent on your most capable model. In the Claude 4.x family, Opus is the most capable, Sonnet is the balanced workhorse, and Haiku is the fast, inexpensive option. Many agent steps — classifying intent, extracting a field, formatting a result, deciding which skill applies — do not need frontier reasoning. They need a quick, correct answer. Routing those steps to Haiku and reserving Opus for genuinely hard reasoning can cut cost by a large multiple while leaving quality on the hard parts untouched.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A practical pattern is a cheap triage step in front of an expensive worker step. A small, fast model reads the task and decides how hard it is; only the genuinely difficult cases escalate. You pay frontier prices only when frontier capability is actually required, which on most real workloads is a minority of requests.
Batch the work that does not need to be live
Not every agent task is interactive. Overnight enrichment, bulk classification, summarizing a backlog of tickets — these have no human waiting on the other end. For that class of work, the Message Batches API processes large volumes asynchronously at a significant discount compared to real-time calls. The trade is latency: you submit a batch and collect results later rather than instantly. For anything that runs on a schedule rather than in front of a user, that is a trade worth making every time, and it is one of the easiest cost wins to capture because it requires no change to your prompts at all.
Context discipline: the cheapest token is the one you never send
The most overlooked lever is simply sending less. Skills help here by design — they load only when relevant, so you are not paying to carry every capability in the base prompt on every turn. Lean into that. Keep skill instructions tight and free of filler. Have tools return only the fields the model needs rather than dumping entire API responses into the context; a tool that returns a 5,000-token JSON blob when the model needed three fields is taxing every subsequent turn of the run.
For long-running agents, manage history actively. Once a run has produced a settled intermediate result, summarize the earlier turns into a compact note and drop the verbose originals from the working context. The model keeps what it needs to proceed without re-reading a transcript that grows without bound. Uncapped history is the quiet killer of long agent sessions, and a periodic compaction step keeps cost roughly flat instead of climbing with every turn.
Parallelize independent work, serialize dependent work
Latency and cost are not the same axis, but they often move together, and the cleanest way to improve both is to stop doing sequentially what could happen at once. When an agent needs three independent lookups — pull the customer record, fetch the order history, check inventory — there is no reason to wait for each to finish before starting the next. Issuing those tool calls in parallel collapses three round-trips into roughly one, which shortens the run and, just as importantly, shrinks the number of turns that each re-send the growing context. Fewer turns is fewer copies of your transcript billed.
The discipline is recognizing what is genuinely independent versus what has a real data dependency. If step two needs the output of step one, parallelizing is impossible and pretending otherwise produces hallucinated inputs. But a surprising amount of agent work is embarrassingly parallel once you look — gathering context before reasoning over it is the classic example. Structure skills so that the gather phase fans out and only the reasoning phase, which truly depends on all of it, runs in sequence.
Measure before and after every change
Cost optimization without measurement is guesswork. Log tokens per run, broken down by input, output, and cache reads, and watch the average over time. When you make a change — reorder the prefix for caching, route a step to Haiku, trim a tool's output — confirm the number actually dropped and that quality on your eval set held. Some optimizations look good on paper and quietly degrade answers; the only way to know is to measure both axes together. Cheaper and worse is not a win.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Build a simple cost dashboard early, before you think you need one. Track median and p95 tokens per task, cache hit rate, and the share of runs that escalate to your expensive model. Those three numbers tell you almost everything: a falling cache hit rate means your prefix is drifting, a climbing p95 means some tasks are looping or carrying bloated context, and a rising escalation share means your triage is sending too much to Opus. Each has a specific, known fix, and watching the trend catches regressions long before they show up as a surprising invoice at the end of the month.
Frequently asked questions
Does prompt caching change the model's answers?
No. Caching is purely a cost and latency optimization — it reuses already-processed tokens and returns identical results. The only behavioral requirement is keeping your cached prefix byte-for-byte stable across calls; a single changed character in the cached region forces reprocessing and erases the savings.
When should I use the Batch API instead of normal calls?
Whenever no human is waiting on the result. Scheduled jobs, bulk backfills, and overnight enrichment are ideal because they tolerate the delayed, asynchronous delivery in exchange for a real per-token discount. Keep anything user-facing and interactive on standard real-time calls.
How do I decide which model handles which step?
Match capability to difficulty. Route classification, extraction, and formatting to Haiku; reserve Sonnet and Opus for steps that genuinely need deep reasoning. A cheap triage step that scores task difficulty and escalates only the hard cases captures most of the savings without sacrificing quality where it matters.
What is the single biggest cost driver in agents?
Growing input context across turns. Each tool turn re-sends an expanding transcript, so unbounded history and bloated tool outputs compound fast. Prompt caching on the stable prefix plus active history compaction on the volatile tail together address the largest share of agent spend.
Bringing agentic AI to your phone lines
CallSphere runs these same efficiency patterns on voice and chat — multi-agent assistants that stay fast and affordable at call-center scale by caching context, routing models smartly, and trimming every wasted token. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.