---
title: "Cutting Claude Cowork Token Cost on a 4,000-Account Book"
description: "Performance tuning for Claude Cowork: prompt caching, batching, model routing, and context discipline to run a 4,000-account book cheap and fast."
canonical: https://callsphere.ai/blog/cutting-claude-cowork-token-cost-on-a-4-000-account-book
category: "Agentic AI"
tags: ["agentic ai", "claude", "claude cowork", "token cost", "prompt caching", "performance", "batching"]
author: "CallSphere Team"
published: 2026-05-20T11:23:11.000Z
updated: 2026-06-06T21:47:42.138Z
---

# Cutting Claude Cowork Token Cost on a 4,000-Account Book

> Performance tuning for Claude Cowork: prompt caching, batching, model routing, and context discipline to run a 4,000-account book cheap and fast.

Running an agent over four thousand accounts is where token economics stop being theoretical. A workflow that costs a fraction of a cent per account feels free on a demo of ten records and turns into a real line item when you multiply it by four thousand and run it nightly. The good news: most of that cost is waste, and the same changes that make a run cheaper almost always make it faster too. This is a guide to squeezing a large Claude Cowork book down to a sensible cost and runtime without sacrificing quality.

## Where the tokens actually go

Before optimizing anything, measure. In a book-management run, tokens accumulate in four places: the instructions and tool definitions resent on every step, the account data pulled in from connectors, the model's own reasoning and tool-call output, and — the silent killer — the growing transcript the agent carries forward as it works through a record. The fourth one dominates surprisingly often, because every additional turn re-bills the entire conversation so far as input.

Token cost in an agentic run is the sum of input tokens (everything the model reads, re-billed each turn) plus output tokens (everything it generates), and the multi-turn nature of agents means input tokens usually dwarf output. That single fact reorders your priorities: shrinking what the model re-reads matters far more than shrinking what it writes.

## Prompt caching: stop paying for the same prefix

The highest-leverage optimization is prompt caching. Your system instructions, tool definitions, and skill content are identical across all four thousand accounts. Without caching, you pay full input price for that stable prefix on every single step of every single account. With caching, that prefix is stored and replayed at a steep discount, and you only pay full price for the part that changes — the specific account's data.

To make caching actually fire, structure your context so the stable parts come first and the variable parts come last. Put instructions, tool schemas, and skills at the top; put the per-account record at the bottom. If you interleave a dynamic timestamp or a shuffled account list into the prefix, you bust the cache on every call and pay full freight. Order discipline is the whole trick.

```mermaid
flowchart TD
  A["Run starts"] --> B["Stable prefix: instructions + tools + skills"]
  B --> C{"Prefix cached?"}
  C -->|Hit| D["Pay discounted rate for prefix"]
  C -->|Miss| E["Pay full rate, write to cache"]
  D --> F["Append per-account data"]
  E --> F
  F --> G["Model reasons + calls tools"]
  G --> H{"More accounts?"}
  H -->|Yes| B
  H -->|No| I["Run complete"]
```

## Batching: amortize the fixed overhead

The second lever is batching. If you process accounts one at a time with a fresh agent context for each, you pay the full setup cost — loading instructions, establishing the task — once per account. If instead you let the agent process a small batch of similar accounts in one context, the fixed overhead is shared. Grouping ten accounts that need the same treatment (say, all stale leads in one territory) lets the agent establish the pattern once and apply it ten times.

There's a tension here: batches that are too large blow up the context window and slow the model down, and one bad record can pollute the reasoning for the rest of the batch. The sweet spot is small, homogeneous batches — accounts that share a stage, a region, or a required action — so the agent's plan transfers cleanly across them. For genuinely independent work, run separate batches in parallel rather than one giant sequential context; parallelism cuts wall-clock time even when it doesn't cut tokens.

## Route models by difficulty, not by habit

Not every account needs your most capable model. A book of four thousand accounts is mostly routine — log an attempt, update a field, apply a tag — with a minority that need real judgment, like drafting a nuanced re-engagement note or untangling a messy account history. Sending everything to the most expensive model is like flying first class to the mailbox.

The pattern that works is tiered routing: a fast, cheap model such as Haiku 4.5 handles the bulk mechanical updates and triage, and escalates to a more capable model like Sonnet 4.6 or Opus 4.8 only when it hits ambiguity or a task it flags as hard. You can implement this as a triage pass — the cheap model classifies each account's difficulty, and only the hard bucket goes to the expensive model. On a typical book, the majority of accounts never need the premium tier, and your average cost per account drops accordingly while the hard cases still get the reasoning they deserve.

## Context discipline: the cheapest token is the one you never send

The most underrated optimization is simply sending less. Connectors love to return whole records — thirty fields when the agent needs four. Every unused field is input tokens you pay for on every turn the agent keeps that record in context. Project your tool outputs down to the fields the task actually uses, and you cut input cost across the board with zero quality loss.

The same applies to history. As the agent works a long batch, the transcript grows, and old reasoning that's no longer relevant keeps getting re-billed. Periodically compacting the context — summarizing completed accounts into a short "done" note and dropping their full detail — keeps the per-turn input bounded. Pair this with skills that load only when relevant rather than stuffing every instruction into the base prompt, and the prefix stays lean. Cheaper and faster turn out to be the same optimization: less to read means less to pay for and less to wait on.

## Measure cost per account, not per run

Optimize against the right metric. Total run cost hides where the waste is; cost per account, broken down by token type, tells you exactly what to fix. Instrument the run to emit, for each account, the input and output token counts and the cache hit rate. When you spot accounts costing ten times the median, you've usually found a record that triggered a loop or pulled an oversized payload — a performance bug and a debugging signal at once. Treat that per-account cost number as a first-class health metric, watch it as you scale, and you'll catch regressions before they show up on the invoice.

## Frequently asked questions

### What's the single biggest cost saver for a large Cowork run?

Prompt caching of the stable prefix — instructions, tool definitions, and skills — because that content is identical across every account and otherwise gets re-billed at full input price on every step. Order your context so stable content comes first and per-account data comes last, and the cache will actually hit.

### Does batching reduce token cost or just runtime?

Both, when done right. Small homogeneous batches amortize the fixed setup overhead across multiple accounts (cost) and let you run independent batches in parallel (runtime). Oversized batches backfire by bloating the context window and letting one bad record degrade the rest, so keep batches small and similar.

### How do I decide which model to use for each account?

Route by difficulty. Use a fast, cheap model like Haiku 4.5 for the bulk of mechanical updates and to triage difficulty, and escalate to Sonnet 4.6 or Opus 4.8 only for accounts that need real judgment. Most accounts in a sales book are routine, so tiered routing cuts average cost substantially.

### Why does context keep growing during a run, and does it matter?

The agent carries the conversation forward, so every completed account's reasoning is re-billed as input on later turns. It matters a lot at scale. Periodically compact finished work into short summaries and project tool outputs down to the needed fields to keep per-turn input bounded.

## Bringing agentic AI to your phone lines

Caching stable context, routing easy work to cheap models, and watching cost per interaction are exactly what make real-time voice agents economical at scale. CallSphere applies these agentic-AI patterns to **voice and chat** — assistants that answer every call and message, use tools mid-conversation, and book work 24/7 without runaway cost. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-claude-cowork-token-cost-on-a-4-000-account-book