Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Claude Computer Use Token Cost: Caching & Batching

Keep Claude computer-use runs cheap with prompt caching, screenshot pruning, batching, and model routing. A 5-step plan and a cost-lever table.

Computer use has a cost shape unlike any other Claude workload: it sends a fresh image on almost every turn, and images are expensive in tokens. A single 1080p screenshot can cost as much as a couple of pages of text, and a real task might involve thirty turns. Multiply that across a fleet of agents running all day and the difference between a thoughtful token strategy and a naive one is the difference between a viable product and a runaway invoice.

The good news is that computer-use cost is highly controllable once you understand where the tokens go. They go into three buckets: the static prompt prefix (system prompt, tool definitions, instructions), the growing conversation history (every prior screenshot and action), and the new image each turn. Each bucket has a different lever — prompt caching, history pruning, and image discipline — and you can pull all three at once.

Key takeaways

  • Prompt caching on your stable prefix is the biggest single win; cached input tokens are billed at a fraction of the normal rate.
  • Screenshots dominate cost — control resolution and only send a new image when the screen actually changed.
  • Prune old screenshots from history; the model rarely needs a frame from twenty turns ago.
  • Batch independent tasks through the Message Batches API for non-interactive work at roughly half the per-token price.
  • Pick the right model per step — Haiku for simple navigation, Opus for tricky reasoning — instead of running everything on the most expensive tier.

Where the tokens actually go

Before optimizing, instrument. Every Claude API response reports usage, including input tokens, output tokens, and the cache read and write counts. Log all of them per turn and you will immediately see the pattern: input tokens climb steadily as history grows, and each turn carries a large fixed image cost. Output tokens, by contrast, are usually small — the model emits a short action, not an essay.

That asymmetry tells you where to aim. Optimizing output is mostly wasted effort. The prize is input: the repeated prefix and the accumulating screenshots. A computer-use run that looks expensive is almost always paying over and over for the same system prompt and a backlog of stale images it no longer needs.

Caching and batching flow

flowchart TD
  A["New turn"] --> B{"Screen changed since last frame?"}
  B -->|No| C["Reuse prior image, send action only"]
  B -->|Yes| D["Capture new screenshot"]
  D --> E{"Prefix cached?"}
  E -->|Yes| F["Cache read: pay reduced rate"]
  E -->|No| G["Cache write: pay once, reuse later"]
  F --> H["Prune screenshots older than N turns"]
  G --> H
  H --> I["Send compact request to Claude"]

The flow shows the two checks that should gate every turn: did the screen change (if not, do not pay for a new image), and is the prefix cached (so you pay the reduced read rate instead of full price). Pruning old screenshots before sending keeps the history bucket from ballooning.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Prompt caching: the biggest lever

Prompt caching lets you mark a stable portion of your request — system prompt, tool definitions, long instructions — so Claude stores it and bills subsequent reads at a steep discount. For computer use this is enormous, because the tool definitions and operating instructions are identical on every single turn of every task. You write them to cache once and read them cheaply thousands of times.

The mechanics are a cache_control breakpoint placed at the end of the content you want cached. Order matters: put everything stable first (system prompt, tools), then the volatile conversation. The cache is a prefix match, so any change before the breakpoint invalidates it. Keep your prefix byte-for-byte identical across turns.

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "You operate a desktop... [long stable instructions]",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "tools": [ /* computer-use + custom tools, stable */ ],
  "messages": [ /* volatile conversation goes here, after the cached prefix */ ]
}

With that breakpoint in place, the first turn pays a small write premium and every later turn reads the prefix at the reduced cache rate. On a thirty-turn task the savings on the prefix alone are substantial, and they compound across every concurrent agent sharing the same instructions.

Screenshot discipline

Images are the second-largest cost, and the rule is simple: send the fewest, smallest legible images you can. Two tactics do most of the work. First, do not re-send a screenshot when the screen has not changed — hash the frame and, if it matches the previous one, reuse it and send only the new action. Second, prune. The model almost never needs the screenshot from twenty turns ago; keep the last few frames in history and drop older images, replacing them with a short text summary of what happened.

Resolution is a genuine tradeoff. Too low and the model hallucinates coordinates; too high and you pay for pixels it does not need. The right answer is the lowest resolution at which buttons and text are clearly legible in the logged image. Test it visually: if you can read the UI, so can Claude.

Batching and model selection

Not every computer-use workload is interactive. If you are running many independent tasks — process these fifty documents, fill out these forms — and you can tolerate results coming back asynchronously, the Message Batches API processes them at roughly half the standard per-token price. The catch is that batching suits parallel, non-real-time work; it does not help a single user waiting on a live session.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Model selection is the other underused lever. A computer-use task is rarely uniformly hard. Routine navigation — clicking a known button, scrolling to a section — can run on Haiku at a fraction of Opus pricing, while you escalate to Opus only for the genuinely ambiguous steps. Routing by step difficulty, rather than running the whole task on the most capable model, often cuts cost without a visible quality drop.

Common pitfalls

  • Putting volatile content before the cache breakpoint. One changing token at the front invalidates the entire cached prefix. Keep stable content strictly first.
  • Letting history grow unbounded. Every old screenshot rides along on every turn. Prune aggressively.
  • Maxing out screenshot resolution "to be safe." You pay for every pixel. Use the lowest legible resolution.
  • Running everything on Opus. Most steps are easy. Reserve the top model for the hard ones.
  • Not logging cache hits. If you do not track cache read vs write counts, you cannot tell whether caching is even working.

Cut your bill in 5 steps

  1. Log per-turn usage including cache read and write counts; find the dominant bucket.
  2. Add a cache_control breakpoint after your stable system prompt and tool definitions.
  3. Hash screenshots and skip re-sending unchanged frames.
  4. Prune images older than the last few turns, replacing them with short text summaries.
  5. Route easy steps to Haiku and batch non-interactive jobs through the Batches API.

Cost levers compared

LeverTargetsEffortTypical impact
Prompt cachingRepeated prefixLowLarge
Screenshot pruningHistory growthLowLarge
Resolution tuningPer-image costMediumMedium
Message BatchesNon-interactive jobsMedium~50% on those
Model routingEasy stepsMediumMedium

Frequently asked questions

Why is computer use so much more expensive than text-only Claude calls?

Because it sends a screenshot nearly every turn, and images cost far more tokens than equivalent text. A single high-resolution frame can cost as much as a couple of pages of prose, and a task may run dozens of turns. Controlling image count and size is the core of cost control.

Does prompt caching work with computer use?

Yes, and it is the highest-leverage optimization. Your system prompt and tool definitions are identical on every turn, so caching that prefix means you write it once and read it at a reduced rate for the rest of the task. Just keep the cached prefix byte-for-byte stable.

When should I use the Message Batches API?

When the work is independent and you do not need real-time responses — processing many documents or forms in parallel. It runs at roughly half the standard per-token price. For a live, interactive session where a user is waiting, batching does not apply.

Can I really save money by switching models mid-task?

Often, yes. Most steps in a computer-use task are routine navigation that a smaller model handles fine. Route those to Haiku and escalate to Opus only for genuinely ambiguous steps, and you cut cost with little quality loss.

Efficient agents on your phone lines

CallSphere applies the same efficiency thinking — caching stable context, trimming what each turn carries, and routing work to the right model — to voice and chat agents that handle every call and message and book work 24/7 without runaway cost. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.