Skip to content
Agentic AI
Agentic AI8 min read0 views

Prompt and context design for Claude batch jobs at scale

What to put in context for Claude Message Batches: frozen vs volatile prefixes, caching, few-shot examples, and per-item model routing.

A prompt that works once works once. A prompt that runs a hundred thousand times in a batch is a different object: every wasted token is multiplied by your request count, every ambiguity becomes a distribution of failures, and every byte you put in the shared prefix either earns a cache hit on the next 99,999 requests or doesn't. Designing context for batch is less about clever wording and more about deciding, ruthlessly, what belongs in context at all. This post is that discipline.

Key takeaways

  • Split context into a frozen shared prefix (instructions, schema, few-shot examples) and a volatile per-item suffix — the split is what makes prompt caching pay off across a batch.
  • Leave out anything the model can infer, anything not used in every request, and anything that varies per request but sits in the prefix — each kind silently costs you.
  • Put output format and constraints in the system prompt once, not repeated in every user message, so they cache instead of re-billing per request.
  • Use two to three sharp few-shot examples over a long prose specification; examples generalize better across a varied corpus and cost fewer tokens.
  • Right-size the model and effort per item — a five-way classifier does not need Opus; reserve reasoning budget for the requests that earn it.

The economics that change everything at batch scale

In an interactive call, a hundred extra tokens of context is invisible. In a batch of 100,000, it is ten million tokens you pay for, every run. That single fact reorders your priorities. The question stops being "what context might help?" and becomes "what context is load-bearing on every request, and where does it sit?" Context that helps one request in fifty is not free insurance — it is fifty-thousandfold waste.

Prompt and context design for batch processing is the practice of deciding which information belongs in every request's context window, which belongs only in some, and which belongs nowhere — then placing the universal part where it can be cached. Get the placement right and the shared instructions are billed near a tenth of full input price on every request after the first that writes them. Get it wrong and you re-pay for the same preamble a hundred thousand times.

The frozen-versus-volatile split

Every batch prompt decomposes into two zones. The frozen zone is identical across all requests: the persona, the task instructions, the output schema, the few-shot examples, any shared reference document. The volatile zone is the one thing that differs per request: this item's text, this document, this question. The architectural rule is to render frozen-before-volatile and put the cache breakpoint at the boundary.

flowchart TD
  A["Raw item"] --> B["Classify each input"]
  B --> C{"Same on every request?"}
  C -->|Yes| D["Frozen prefix:\ninstructions + schema + examples"]
  C -->|No, used every time| E["Volatile suffix:\nthis item's content"]
  C -->|Inferable or rarely used| F["Leave out of context"]
  D --> G["cache_control breakpoint"]
  G --> E
  D --> H{"Prefix byte-stable?"}
  H -->|Yes| I["cache_read on later items"]
  H -->|No| J["silent invalidation: full price"]
  E --> K["Model answers"]

The discipline this diagram enforces is a three-way sort of every candidate piece of context: frozen, volatile, or omitted. Most batch prompt bloat comes from items that should have been omitted entirely but got swept into the prefix "just in case."

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

What to leave out, and why each omission helps

Three categories of context earn their removal. First, anything the model can infer: telling Claude that "billing" relates to invoices wastes tokens it already knows. Second, anything used in only some requests: a special-case instruction relevant to 2% of items belongs in those items' suffix, not the shared prefix. Third, anything volatile that crept into the frozen zone: a per-item ID interpolated into the system prompt invalidates the cache for the entire prefix on every request.

# Anti-pattern: per-item value in the frozen system prompt
system = f"You are a classifier. Current item: {item_id}."  # breaks cache

# Correct: item_id lives in the volatile suffix, system stays frozen
SYSTEM = (
    "You are a classifier. Reply with exactly one label from: "
    "billing, bug, feature, other."
)
messages = [{"role": "user",
             "content": f"[{item_id}] {body}"}]   # volatile, after breakpoint

The verification is mechanical: check usage.cache_read_input_tokens on a handful of results. A consistent zero across requests that share a prefix means a volatile value leaked into the frozen zone — almost always a timestamp, a UUID, an unsorted JSON dump, or a per-item ID like the one above.

Examples over specifications

When you need the model to behave consistently across a wide variety of inputs, two or three precise few-shot examples in the frozen prefix do more than three paragraphs of prose rules. Examples are concrete, they demonstrate edge cases implicitly, and they generalize across the corpus better than abstract instructions. They also cost fewer tokens than the equivalent exhaustive specification, and because they live in the frozen zone, that cost is paid once and cached.

The counterintuitive part is restraint. More examples are not better past a small number — a handful of well-chosen, diverse cases beats a dozen near-duplicates that just inflate the prefix. Pick examples that sit at the boundaries of your label space or that demonstrate the format precisely, then stop.

Right-sizing model and effort per item

Context design is not only about tokens in the prompt; it is also about how much reasoning each item warrants. A batch is the place where per-item right-sizing pays off most, because the savings multiply. A five-way classification does not need Opus and does not need extended thinking — Haiku with no thinking handles it for a fraction of the cost. Save the expensive configuration for the items in the batch that genuinely require multi-step reasoning.

def params_for(item):
    if item.kind == "classify":
        return dict(model="claude-haiku-4-5", max_tokens=12,
                    thinking={"type": "disabled"})
    return dict(model="claude-opus-4-8", max_tokens=4096,
                thinking={"type": "adaptive"},
                output_config={"effort": "high"})

Because each batched request carries its own model and configuration, this routing happens at request-build time with no extra orchestration. The cheap path and the expensive path coexist in one submitted batch.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Common pitfalls

  • Kitchen-sink prefixes. Stuffing every conceivable instruction into the shared prompt "to be safe" multiplies token cost by your request count. Include only what is load-bearing on every request.
  • Volatile values in the frozen zone. A per-item ID, timestamp, or unsorted JSON in the system prompt silently invalidates the cache for every request.
  • Repeating constraints per user message. Output format and rules belong once in the cached system prompt, not re-stated in each volatile suffix where they re-bill every time.
  • Over-exampling. A dozen near-identical few-shots inflate the prefix without improving accuracy. Use a few diverse, boundary-defining cases.
  • One model for everything. Running simple classification on Opus wastes money at scale. Right-size per item.

Design a batch prompt in 5 steps

  1. List every piece of context and sort each into frozen, volatile, or omit.
  2. Assemble the frozen zone — instructions, schema, a few sharp examples — and mark its end with cache_control.
  3. Put only the per-item content in the volatile suffix, after the breakpoint.
  4. Verify a cache hit by inspecting cache_read_input_tokens on sample results.
  5. Route model and effort per item so cheap tasks run cheap and only hard items pay for reasoning.

Keep, move, or cut

Context pieceDecisionWhy
Task instructions, output schemaFrozen prefixUniversal — cache once, reuse everywhere
This item's textVolatile suffixDiffers per request, must stay after breakpoint
Per-item ID / timestampVolatile suffix (never prefix)In the prefix it shatters the cache
Facts the model already knowsCutPure token waste at scale
Rule used by 2% of itemsSuffix of those items onlyNot load-bearing on the other 98%

Frequently asked questions

Why does context placement matter more in a batch than in a chat?

Because the same prompt runs thousands of times. A token of waste or a cache-busting value is multiplied by your request count, and a correctly frozen prefix earns a cache hit on every request after the first — savings that only exist at batch scale.

How do I tell if my shared prefix is actually caching?

Inspect usage.cache_read_input_tokens on several results. If it is consistently zero across requests that share the prefix, a volatile value — timestamp, UUID, unsorted JSON, per-item ID — has leaked into the frozen zone and is invalidating it.

Should output instructions go in the system prompt or each user message?

Once in the system prompt, which sits in the cached frozen zone. Repeating them in every user message re-bills the same tokens on every request and gains nothing.

Is it worth mixing models within one batch?

Yes. Each request carries its own model, so routing simple items to Haiku and reasoning-heavy items to Opus within a single batch is straightforward and meaningfully cheaper than running everything on the most capable model.

Bringing agentic AI to your phone lines

The same context discipline — freeze what is shared, cut what is inferable, right-size per task — is what keeps a real-time agent fast and affordable. CallSphere applies these Claude patterns to voice and chat: agents that answer every call, use tools mid-conversation, and book work 24/7. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.