Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Claude Token Cost: Caching, Batching, Fast Runs

Keep Claude finance agents cheap and fast with prompt caching, batching, context budgeting, and smart model tiering across Opus, Sonnet, and Haiku.

A finance team can fall in love with a Claude agent that turns a messy variance analysis into a clean board narrative — right up until the first monthly invoice. Agents are token-hungry by nature: every turn re-sends the system prompt, the tool definitions, the schema, and the entire growing conversation. Run that loop a few hundred times across a close cycle, sprinkle in a multi-agent setup, and a workflow that felt nearly free in a demo becomes a real line item. The good news is that most agent cost is waste, and waste is fixable without making the output worse.

This post is about keeping Claude finance agents cheap and fast — where the tokens actually go, and the specific levers (caching, batching, context budgeting, model selection) that cut cost by large factors without touching quality.

Where the tokens actually go

Before optimizing anything, you have to see the bill. In an agentic run, input tokens usually dwarf output tokens, and the reason is the loop. On turn one, Claude reads your system prompt, your tool schemas, and the user request. On turn two, it reads all of that again plus the first tool call and its result. By turn ten, you have re-sent the same fixed preamble ten times. For a finance agent whose system prompt embeds a chart of accounts, reporting calendar, and house style guide, that preamble can be thousands of tokens that you pay for on every single turn.

The mental model that fixes this: separate your context into stable and volatile. Stable context — instructions, tool definitions, reference schemas, your accounting policies — never changes within a run. Volatile context — the latest tool result, the user's follow-up — changes every turn. The entire cost game is about not paying full price for the stable part over and over. That is exactly what prompt caching is for.

Prompt caching: the biggest single lever

Prompt caching lets you mark the stable prefix of your prompt so Claude stores it and reuses it on later calls at a steep discount instead of re-charging full input price. Prompt caching is a technique where the model reuses a previously processed, unchanging prefix of the prompt so repeated agent turns avoid paying full input-token cost for the same context. For a finance agent that ships the same multi-thousand-token policy preamble on every turn, this is the difference between a workflow you ration and one you run freely.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The discipline that makes caching work is ordering. Put everything stable at the very front — system instructions, tool definitions, the chart of accounts, the reporting calendar — and everything volatile after it. Caching matches on an exact prefix, so a single changed character early in the prompt invalidates the whole cache. A common self-inflicted wound is jamming a timestamp or a run ID into the top of the system prompt; that one dynamic token means you cache nothing. Keep the volatile bits — "today is the 3rd business day of close" — below the cached boundary.

flowchart TD
  A["Agent turn"] --> B{"Stable prefix cached?"}
  B -->|Hit| C["Reuse prefix at discount"]
  B -->|Miss| D["Process full prefix, store cache"]
  C --> E["Process volatile suffix only"]
  D --> E
  E --> F{"More turns?"}
  F -->|Yes| A
  F -->|No| G["Return narrative"]

The diagram shows why caching compounds: the first turn pays to build the cache, and every turn after it rides the discount for the entire stable prefix, processing only the new volatile suffix at full price. Over a long agent run, the savings are not marginal — they grow with the length of the conversation.

Batching the work that isn't conversational

Not every finance task is an interactive loop. Generating the standalone commentary for forty cost centers, or classifying a thousand journal-entry descriptions, is embarrassingly parallel — each item is independent. For this shape of work, batching is the right tool. Instead of firing a thousand synchronous requests and paying for low-latency processing you don't need, submit them as a batch and accept results asynchronously, typically at a meaningful discount over real-time calls.

The trade-off is latency: batch results come back later, not instantly. That is exactly the right trade for an overnight close job where nobody is staring at the screen, and exactly the wrong trade for the interactive narrative the CFO is editing live. The rule of thumb: if a human is waiting on the result, run it synchronously with caching; if it's bulk work that can finish by morning, batch it. Many teams split their pipeline so the bulk enrichment runs as an overnight batch and only the final, human-in-the-loop narrative runs interactively.

Budgeting context so it doesn't balloon

Caching makes the stable prefix cheap, but the volatile part still grows every turn, and a long agent run can quietly accumulate a huge transcript. Two habits keep it in check. First, don't dump raw data into context — summarize at the tool boundary. If a query returns ten thousand rows, your tool should return the aggregates and the top variances, not the raw table. The agent reasons over the summary; the raw data stays in the database where it belongs. Second, compact the conversation. Once the agent has extracted what it needs from turns one through eight, replace those turns with a short structured summary so you stop re-sending stale tool dumps on every subsequent turn.

This matters even with Claude's very large context windows. A 1M-token window means you can hold an enormous transcript, not that you should pay to re-process it on every turn. Context budgeting is a cost discipline, not just a capacity one.

Choosing the right model for each step

The most expensive mistake is running every step on your most powerful model. A finance pipeline has steps of wildly different difficulty: classifying a transaction is easy; reconciling a tricky intercompany variance and explaining it in prose is hard. Match the model to the step. Use Haiku 4.5 for high-volume, low-judgment work like categorization and extraction. Use Sonnet 4.6 for the bulk of the analysis. Reserve Opus 4.8 for the genuinely hard reasoning and the final narrative where quality is non-negotiable.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

This tiering pairs naturally with a multi-agent design, but mind the multiplier: multi-agent runs typically consume several times more tokens than a single agent because each subagent carries its own context. That's worth it when subtasks are truly parallel and independent; it's pure waste when a single well-prompted agent could have done the job. Reach for orchestration deliberately, not by default.

Frequently asked questions

How much can prompt caching realistically save?

It depends on how large your stable prefix is relative to the volatile part and how many turns each run takes. For finance agents with big policy preambles and long tool-use loops, the cached prefix dominates the bill, so the savings on input cost are substantial — often the single biggest lever available. The longer the conversation, the more caching pays off.

When should I batch instead of streaming live?

Batch whenever no human is waiting and the items are independent — overnight enrichment, bulk classification, generating per-segment commentary. Stream live only for interactive, human-in-the-loop work. Mixing the two in one pipeline, with bulk work batched and the final draft interactive, usually gives the best cost-to-experience ratio.

Does a 1M-token context window mean I can stop optimizing?

No. A large window removes a hard limit but not the per-turn cost of re-processing context. Even with room to spare, summarize at tool boundaries and compact the transcript so you're not paying to re-read stale data on every turn. Capacity and cost are different problems.

Is multi-agent always more expensive?

Generally yes — each subagent maintains its own context, so token usage multiplies. It's justified when work is genuinely parallel and a single agent would be slower or lower quality. For linear finance tasks, a single well-tiered agent is usually both cheaper and simpler.

Bringing fast, frugal agents to the phone

The same caching, batching, and model-tiering discipline that keeps a finance narrative cheap is what makes real-time voice affordable at scale. CallSphere applies these agentic-AI patterns to voice and chat — assistants that answer every call and message, use tools mid-conversation, and book work 24/7 without runaway cost. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.