---
title: "Cutting Claude Cowork token costs: caching and batching"
description: "Keep agentic Claude Cowork runs cheap and fast with prompt caching, batching, context trimming, and per-step model routing without sacrificing quality."
canonical: https://callsphere.ai/blog/cutting-claude-cowork-token-costs-caching-and-batching
category: "Agentic AI"
tags: ["agentic ai", "claude", "claude cowork", "prompt caching", "token cost", "performance", "batching"]
author: "CallSphere Team"
published: 2026-06-05T11:23:11.000Z
updated: 2026-06-06T00:48:34.356Z
---

# Cutting Claude Cowork token costs: caching and batching

> Keep agentic Claude Cowork runs cheap and fast with prompt caching, batching, context trimming, and per-step model routing without sacrificing quality.

Agentic AI has a billing problem that nobody warns you about until the invoice arrives. A single Claude Cowork task that touches five connectors and reasons across a dozen turns can consume an order of magnitude more tokens than a one-shot prompt, because every turn re-sends the accumulating context. Multiply that by a team running hundreds of tasks a day and "cheap automation" turns into a line item someone in finance is asking about. The good news: most of that cost is avoidable with a handful of disciplined techniques.

This post is about making agentic runs cheap and fast without dumbing them down. The two are linked — the same context bloat that drives cost also drives latency, because the model has to read everything you send before it can respond. Trim the tokens and you usually get a faster, cheaper, *and* more focused agent at the same time.

## Where the tokens actually go

Before optimizing, measure. In a typical agentic run the spend concentrates in three places: the system prompt and tool definitions re-sent on every turn, the growing transcript of prior tool calls and their results, and oversized tool outputs that dump raw payloads into context. A single connector that returns a 6,000-token JSON document, called four times, has just cost you 24,000 input tokens of pure noise.

The instinct to "give the agent everything just in case" is the single biggest cost driver. Context is not free working memory; you pay for it on every turn it stays resident. The discipline is to treat the context window like a budget and ask, for each thing you include, whether the model needs it *this* turn.

## Prompt caching: stop paying for the same prefix

Prompt caching is the highest-leverage lever available. The idea is simple: the stable prefix of your prompt — system instructions, tool definitions, long reference documents — is cached after the first call, and subsequent calls that reuse that exact prefix are billed at a steep discount for the cached portion. In an agentic loop, where the system prompt and tools are identical across every turn of a task, this turns a recurring cost into a one-time one.

```mermaid
flowchart TD
  A["Task starts"] --> B["Stable prefix: system + tools + docs"]
  B --> C{"Prefix cached?"}
  C -->|First call| D["Write to cache, full price"]
  C -->|Later turns| E["Read from cache, discounted"]
  D --> F["Append turn-specific context"]
  E --> F
  F --> G["Model responds"]
  G --> H{"More turns?"}
  H -->|Yes| C
  H -->|No| I["Task done"]
```

To benefit, you must keep the cached prefix byte-stable and put it at the very front. Anything that changes per turn — the latest observation, the current sub-goal — goes *after* the cached block. A common mistake is injecting a timestamp or a per-turn counter into the system prompt, which invalidates the cache on every call and silently throws the savings away. Order your prompt so the immutable material leads and the volatile material trails.

## Batching: amortize the overhead

When you have many independent items to process — classify 500 support tickets, extract fields from 200 documents — running them as separate live agent calls is the most expensive possible approach. Batching helps in two distinct ways. For truly independent items with no urgency, asynchronous batch processing trades latency for a large per-token discount, which is ideal for overnight enrichment jobs. For items that share context, grouping several into one prompt lets them ride on the same cached prefix instead of re-establishing it each time.

Be deliberate about which kind of batching fits. Interactive Cowork tasks where a human waits for the answer should stay synchronous and lean on caching. Bulk back-office work that can finish within hours rather than seconds is where asynchronous batch pricing earns its keep. Mixing them up — running urgent work through a slow batch queue, or pushing a giant overnight job through expensive live calls — is a common and costly mistake.

## Right-sizing the model per step

Not every step in an agentic run needs the most capable model. The 2026 Claude family spans Opus 4.8, Sonnet 4.6, and Haiku 4.5, and a well-engineered agent routes work to the cheapest model that can do each job reliably. Use a smaller, faster model for mechanical steps — classifying intent, extracting a field, deciding which tool to call — and reserve the most capable model for the genuinely hard reasoning, like synthesizing a final recommendation from conflicting sources.

This routing can cut cost dramatically because the cheap steps are usually the frequent ones. The pitfall is over-downgrading: if a small model makes the wrong tool choice, it can trigger a loop that costs far more than the model you saved on. Measure quality at each downgrade and keep the routing where accuracy holds.

## Trimming context and tool outputs

The cheapest token is the one you never send. Three habits compound: summarize stale parts of a long transcript instead of carrying every raw turn forward; have connectors return only the fields the task needs rather than full records; and scope each sub-agent to a narrow slice of context rather than the whole task history. A sub-agent that only needs to format an email should not be carrying the entire research transcript that produced its inputs.

Tool output shaping deserves special attention because it is invisible until you profile. Wrap noisy connectors so they project down to the relevant fields before the result hits the model's context. This single change often cuts both cost and loop rate, since a lean, legible result is also easier for the model to act on correctly.

## Frequently asked questions

### What is prompt caching and when does it help?

Prompt caching stores the stable prefix of your prompt — system instructions, tool definitions, reference documents — so reused prefixes are billed at a discount on later calls. It helps most in agentic loops where the same system prompt and tools repeat across every turn of a task.

### Why do agentic runs cost so much more than single prompts?

Because each turn re-sends the accumulating context, so a multi-turn run pays for its system prompt, tools, and transcript many times over. Caching the stable prefix and trimming tool outputs removes most of that repeated cost.

### Should I use a smaller model to save money?

Route mechanical steps like classification and tool selection to a smaller, faster model and reserve the most capable model for hard reasoning. Verify quality at each downgrade, because a wrong decision by a weak model can trigger an expensive loop.

### When is batch processing the right choice?

Use asynchronous batching for large volumes of independent, non-urgent work — overnight document enrichment or bulk classification — where you can trade latency for a per-token discount. Keep interactive tasks synchronous and rely on caching instead.

## Bringing agentic AI to your phone lines

Keeping runs cheap and fast is exactly what lets CallSphere run agentic **voice and chat** assistants at scale — caching stable context and routing work to the right model so every call is answered without the bill running away. See it live at [callsphere.ai](https://callsphere.ai).

---

Source: https://callsphere.ai/blog/cutting-claude-cowork-token-costs-caching-and-batching
