Skip to content
Agentic AI
Agentic AI7 min read0 views

Cut Claude Cowork Token Costs: Caching, Batching, Cheap Runs

Make Claude Cowork agents cheap and fast with prompt caching, batching, model right-sizing, and lean context discipline that slashes token spend.

Agentic workflows are gloriously capable and quietly expensive. A single Claude Cowork run can fire off dozens of tool calls, re-read the same documents on every turn, and re-send a multi-thousand-token system prompt over and over. None of that is wasted intelligence — but a lot of it is wasted tokens. When you move from a demo to something that runs hundreds of times a day, the bill and the latency both become design constraints. This post is about keeping agentic runs cheap and fast without dumbing them down.

Start with the right mental model. In an agentic run, cost scales with the total tokens that flow through the model across every turn, not with the number of tasks. Because each turn re-sends the growing conversation as input, long multi-turn runs spend most of their tokens re-reading context the model has already seen. That single fact — input tokens dominate, and they compound across turns — points at almost every optimization worth making.

Prompt caching: stop paying for the same prefix twice

The highest-leverage win is prompt caching. Most agentic runs share a large, stable prefix on every turn: the system prompt, the tool definitions, and any reference documents. Without caching you pay full input price to re-process that identical prefix on turn after turn. With caching, that prefix is processed once and reused, and the repeated portion is billed at a steep discount.

To benefit, you must keep the cacheable part stable and put it first. Order your context as static-then-dynamic: durable system instructions and tool schemas at the top, then the volatile, per-turn user content at the bottom. If you sprinkle a changing timestamp or a freshly shuffled list near the top, you invalidate the cache for everything after it and pay full freight again. Caching rewards discipline about what is constant and what changes.

Right-size the model for each step

Not every step of a workflow needs the most capable model. The Claude 4.x family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 as the balanced workhorse, and Haiku 4.5 for fast, cheap, high-volume steps. A common and costly mistake is running an entire workflow on the biggest model when most of its steps are routing, extraction, or formatting that a smaller model handles perfectly.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The flow below shows how to route each step by difficulty so you spend premium tokens only where they earn their keep.

flowchart TD
  A["Incoming task"] --> B{"Step type?"}
  B -->|Classify / extract| C["Haiku 4.5 — fast & cheap"]
  B -->|Standard reasoning| D["Sonnet 4.6 — balanced"]
  B -->|Hard multi-step| E["Opus 4.8 — most capable"]
  C --> F{"Confidence high?"}
  F -->|No| D
  F -->|Yes| G["Return result"]
  D --> G
  E --> G

A useful pattern is escalation: try the cheap model first, and only fall through to a larger one when the small model signals low confidence or fails validation. Many tasks resolve at the cheap tier, so you pay for the expensive tier only on the genuinely hard minority. This routing logic lives in your orchestration code, not in the prompt, which keeps it testable.

Batch independent work instead of looping serially

When a workflow processes many similar items — classify fifty support tickets, summarize forty documents — resist the instinct to feed them through one long conversational loop. A loop re-sends the whole accumulating context for every item, so the hundredth item carries the weight of the previous ninety-nine. That is the compounding-input problem at its worst.

Instead, treat independent items as independent requests. Each gets the cached shared prefix plus only its own small payload, and nothing accumulates. For high-volume offline work where you don't need answers immediately, batch processing trades latency for a meaningful per-token discount, which is ideal for nightly enrichment or backfills. The rule of thumb: if two items don't need to see each other's results, never put them in the same conversation.

Keep context lean across turns

Long agentic runs accumulate junk — verbose tool outputs, dead ends, intermediate scratch work — and every byte of it gets re-sent on the next turn. Two habits keep this under control. First, have tools return compact, structured results instead of raw dumps; a connector that returns a 200-line JSON blob when the agent needs three fields is paying to re-read 197 useless lines every subsequent turn. Second, summarize and prune. When a sub-task finishes, collapse its sprawling transcript into a short result the rest of the run can carry forward cheaply.

Multi-agent designs deserve special caution here. Spawning several sub-agents multiplies token usage severalfold compared to a single agent, because each carries its own context. That can be entirely worth it for genuinely parallel, hard problems — but reach for multiple agents deliberately, not reflexively. Many tasks that look like they want a team of agents are better and cheaper as one focused agent with good tools.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Measure before you optimize

You cannot tune what you don't measure. Log input and output token counts per run and per step, and watch which steps dominate. Most workflows have one or two hot spots — a giant document re-read on every turn, a verbose tool, a needlessly large model on a trivial step — that account for most of the cost. Fixing those handful of hot spots usually beats micro-optimizing everything else. Optimize the bill you actually have, not the one you imagine.

Frequently asked questions

What is prompt caching and when does it help?

Prompt caching reuses the processed form of a stable prompt prefix — system instructions, tool definitions, reference documents — so you don't pay full input price to re-process identical content on every turn. It helps most in multi-turn agentic runs where a large prefix repeats, provided you keep that prefix unchanged and placed first.

Should I use Opus, Sonnet, or Haiku for my agent?

Match the model to the step, not the whole workflow. Use Haiku 4.5 for high-volume extraction and routing, Sonnet 4.6 for standard reasoning, and Opus 4.8 for the hardest multi-step problems. Escalating from cheap to capable only when needed keeps quality high and cost low.

Why does my multi-turn run get more expensive each turn?

Because each turn re-sends the entire growing conversation as input, so later turns carry the weight of all earlier ones. Keep context lean by returning compact tool results, pruning finished sub-tasks, and avoiding processing many independent items inside one long conversation.

Is a multi-agent setup more expensive than a single agent?

Generally yes — running several sub-agents typically uses several times the tokens of a single agent because each maintains its own context. Use multiple agents when the problem is genuinely parallel and hard enough to justify the spend, not as a default.

Bringing efficient agents to the phone

The same cost discipline — caching stable prompts, right-sizing models, and trimming context — is what makes real-time voice agents both fast and affordable. CallSphere applies these agentic patterns to voice and chat, with assistants that answer every call, call tools mid-conversation, and book work 24/7. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.