Cutting Claude Code Token Cost: Caching & Batching Wins
Keep Opus 4.8 agent runs cheap and fast — prompt caching, batching tool calls, and context pruning from a Built-with-Opus hackathon.
On the second day of the Built-with-Opus hackathon, a team opened their billing dashboard and went quiet. Their agent worked beautifully — and a single full run cost more than they expected, because every turn re-sent the same 40,000 tokens of context to the model. Nothing was broken. The agent was just expensive by construction. By the time the weekend ended, they had cut the cost of an identical run by a large margin without changing what the agent could do. This post is the set of moves that got them there, and what every team learned about keeping Claude Code runs cheap and fast.
The headline insight: agent cost is dominated by repeated input tokens, not output. An agent loop re-sends the system prompt, the tool definitions, and the growing conversation on every single turn. A ten-turn task can re-bill the same instructions ten times. Once you see cost as "input tokens times turns" rather than "answer length," the optimizations become obvious.
Where the tokens actually go
Before optimizing anything, measure. The hackathon teams that saved the most started by breaking a run into three buckets: the static prefix (system prompt, tool schemas, skill instructions), the accumulating conversation (every prior tool call and result), and the fresh generation each turn. For a typical run, the static prefix and accumulated history dwarfed the new output. That is the target.
Token cost is the sum, across every turn, of the input tokens sent plus the output tokens generated, priced per model. The practical lever is the input side, because it grows with both context size and turn count. A long-running agent with a fat unchanging prefix is paying for that prefix again and again — which is exactly the case prompt caching was built to fix.
Prompt caching: pay for the prefix once
Prompt caching lets you mark a stable prefix of your request — system prompt, tool definitions, long reference material — so that on subsequent calls the model reuses the cached computation at a steep discount instead of reprocessing it from scratch. For an agent that sends the same 40,000-token prefix on every turn, this is the single highest-leverage change available, and it requires no change to agent logic.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Agent turn N"] --> B{"Prefix unchanged from last call?"}
B -->|Yes| C["Cache hit: prefix billed at discount"]
B -->|No| D["Cache miss: full prefix reprocessed"]
C --> E["Only new tokens billed full price"]
D --> F["Write new cache entry"]
F --> E
E --> G["Lower cost per turn, faster latency"]The diagram shows why prefix stability matters so much. A cache hit only happens when the cached portion is byte-identical to the previous call. The hackathon mistake we saw repeatedly: teams put a timestamp or a per-turn counter near the top of the system prompt, which invalidated the cache every single turn and silently killed the savings. Move volatile content to the end of the prompt and keep the prefix frozen. Order your request as: cached static instructions first, then the changing conversation, then the fresh user turn. That ordering is the whole game.
Batching: do more per turn, loop less
The second big lever is reducing turn count. Every turn is a full round trip with full input billing, so an agent that calls one tool, waits, calls another, waits, is paying a tax on each hop. Where the work is independent, batch it. Claude on Opus 4.8 can request multiple tool calls in a single turn, and a well-designed harness runs them in parallel. A repo-analysis agent that reads eight files one per turn becomes an agent that requests all eight reads at once — eight times fewer round trips for the same information.
Batching also applies at the harness level for fan-out work. If your task spawns several subagents to investigate different parts of a codebase, launch them in one batch rather than serially. The token cost of the subagents is unchanged, but wall-clock time collapses and you avoid an orchestrator turn between each spawn. The rule the teams settled on: if two actions do not depend on each other's output, they should happen in the same turn.
Pruning context before it bloats the bill
The third lever is keeping the conversation from growing unbounded. Tool results are the worst offenders — a single file read or API response can dump thousands of tokens into context that then ride along, re-billed, for the rest of the run. Most of that content is irrelevant after the turn that used it.
Two pruning tactics worked well. First, summarize-and-drop: after a large tool result has been used, replace it in the running context with a short summary of what mattered. Second, fetch narrowly: instead of reading a whole file, read the specific range or use a search that returns only matching lines. The discipline is to never pull more tokens into context than the current step actually needs. On a long agent, aggressive context hygiene mattered as much as caching, because it slows the growth of the per-turn input that everything else is multiplied by.
Right-sizing the model to the step
Not every step needs the most capable model. Several teams used a tiered approach: Opus 4.8 for the planning and hard-reasoning turns, and a faster, cheaper model like Haiku 4.5 for mechanical sub-steps such as formatting, extraction, or simple classification. The orchestrator stays on Opus to keep judgment quality high, while routine subtasks run on a lighter model. This is not about cutting corners on quality; it is about matching capability to difficulty so you do not pay flagship rates to reformat a list.
The trap is over-tiering: if you push a step that needs real reasoning onto a small model, it fails, retries, and the retries cost more than doing it right once. Profile which steps are genuinely mechanical before downgrading them. The teams that got this right treated model choice as a per-step decision, not a global setting.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Measure, then optimize — in that order
The thread running through all of this: instrument cost per run before touching anything. Log input tokens, output tokens, cache hit rate, and turn count for every run. With that dashboard, the wins announce themselves — a zero percent cache hit rate screams "your prefix is unstable," and a high turn count screams "batch your tool calls." The hackathon teams that optimized by intuition often made runs slower; the teams that optimized by measurement consistently cut cost without losing a single capability.
Frequently asked questions
What is prompt caching and when should I use it?
Prompt caching marks a stable prefix of your request — system prompt, tool definitions, reference text — so the model reuses cached computation at a discount on later calls instead of reprocessing it. Use it for any agent that re-sends a large unchanging prefix across many turns; it is the highest-leverage cost cut available and requires no logic change.
Why is my agent so expensive even with short answers?
Because cost is dominated by input tokens multiplied by turn count, not output length. Each turn re-sends the system prompt, tool schemas, and the entire growing conversation. Reduce the static prefix with caching, prune accumulated context, and cut turns by batching tool calls.
How does batching tool calls save money?
Each agent turn carries full input billing, so fewer turns means less repeated input cost. When tool calls are independent, request them in a single turn and run them in parallel. A task that read eight files across eight turns can read them in one, cutting round trips and latency dramatically.
Should I use a smaller model to save cost?
For genuinely mechanical sub-steps — formatting, extraction, simple classification — a lighter model like Haiku 4.5 saves money with no quality loss. Keep Opus 4.8 for planning and hard reasoning. The risk is downgrading a step that needs real reasoning, where failed retries cost more than doing it right once.
Bringing agentic AI to your phone lines
Cheap, fast runs are not optional when an agent talks to customers in real time. CallSphere brings these same efficiency patterns — caching, batching, and tight context — to voice and chat assistants that answer every call, call tools mid-conversation, and stay responsive at scale. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.