Skip to content
Agentic AI
Agentic AI9 min read0 views

Reusable code patterns for Claude batch processing jobs

Code-level patterns for Claude batch processing: request factories, cache-friendly context layering, structured outputs, and self-describing custom_ids.

Your first Claude batch job is a script. Your tenth is a system. Somewhere between those two, the ad-hoc create() call you copy-pasted starts to creak: the prompt assembly is duplicated across files, the cache never hits, and reassembling results has become a fragile web of string parsing. This post is a set of reusable patterns — request factories, context layering, structured output, and result joins — that turn batch processing from a one-off script into a component you can trust at scale.

Key takeaways

  • Wrap request construction in a factory function so the stable parts (model, system, tools) live in one place and only the per-item content varies.
  • Layer your prompt as frozen prefix then volatile suffix — a cacheable shared block followed by the per-request question — to make prompt caching actually hit inside a batch.
  • Use structured outputs (output_config.format) so every result is machine-parseable JSON, eliminating brittle text scraping during reassembly.
  • Encode reassembly metadata in the custom_id itself (a delimited key) so the result join needs no side table.
  • Build a resubmission helper that takes errored and expired custom_ids and rebuilds exactly those requests from your source data.

Pattern 1: the request factory

The enemy of a maintainable batch job is duplication in request construction. Every request shares 90% of its body — same model, same system prompt, same tool list — and differs only in the user content. Centralize the constant part in a factory. This single move makes the shared prefix byte-identical across requests, which is also the precondition for caching to work.

from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

SHARED_SYSTEM = [
    {"type": "text", "text": "You are a precise data extraction engine."},
    {"type": "text", "text": EXTRACTION_GUIDE,
     "cache_control": {"type": "ephemeral"}},   # frozen, cacheable
]

def make_request(key: str, document: str) -> Request:
    return Request(
        custom_id=key,
        params=MessageCreateParamsNonStreaming(
            model="claude-opus-4-8",
            max_tokens=2048,
            system=SHARED_SYSTEM,            # identical bytes every call
            messages=[{"role": "user", "content": document}],
        ),
    )

requests = [make_request(k, doc) for k, doc in corpus]

Because SHARED_SYSTEM is constructed once and reused, every request renders the same prefix. Reorder the keys in a dict or interpolate a timestamp into that block and you would silently shatter the cache — keep the frozen prefix truly frozen.

Pattern 2: layer context, frozen before volatile

Caching is a prefix match: any byte change invalidates everything after it. The design rule that follows is mechanical. Put everything stable — persona, reference documents, few-shot examples — at the front, marked with cache_control. Put the one thing that changes per request — the actual question or document — after the breakpoint, unmarked.

flowchart TD
  A["Per-request body"] --> B["Frozen prefix:\npersona + guide + examples"]
  B --> C["cache_control breakpoint"]
  C --> D["Volatile suffix:\nthis item's content"]
  B --> E{"Prefix byte-identical\nacross requests?"}
  E -->|Yes| F["cache_read on later items\n~0.1x input price"]
  E -->|No| G["cache miss: full price\nevery request"]
  D --> H["Claude processes\nsuffix + cached prefix"]
  F --> H

The practical test is to inspect usage.cache_read_input_tokens on a few results. If it is consistently zero across requests that should share a prefix, a silent invalidator has crept in — most often a non-deterministic JSON serialization or a per-item value that leaked into the frozen block.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

One timing nuance is worth internalizing so you do not misread the metrics. A cache entry becomes readable only after the request that writes it begins processing, and in a batch the requests sharing a prefix do not all start at once. So the very first items to run will show a cache write rather than a read, and the read rate climbs as the batch drains. If you sample only the earliest results, you may conclude caching is broken when it is simply warming up. Sample across the run, and look at the aggregate write-versus-read ratio rather than any single request.

Pattern 3: structured outputs for clean reassembly

Text scraping is the second-most-common source of batch reassembly bugs after positional joins. If you ask for "the category and confidence" in prose, you will spend the afternoon writing regexes for the model's three favorite phrasings. Constrain the output to a schema instead and parse JSON deterministically.

SCHEMA = {
    "type": "object",
    "properties": {
        "category": {"type": "string",
                     "enum": ["billing", "bug", "feature", "other"]},
        "confidence": {"type": "string",
                       "enum": ["low", "medium", "high"]},
    },
    "required": ["category", "confidence"],
    "additionalProperties": False,
}

def make_request(key: str, text: str) -> Request:
    return Request(
        custom_id=key,
        params=MessageCreateParamsNonStreaming(
            model="claude-haiku-4-5",
            max_tokens=128,
            output_config={"format": {"type": "json_schema", "schema": SCHEMA}},
            messages=[{"role": "user", "content": text}],
        ),
    )

On the result side, the first text block is now guaranteed to be valid JSON matching your schema, so reassembly is json.loads(text) with no defensive parsing. Structured outputs compose cleanly with batches — the constraint applies per request exactly as it would synchronously.

Pattern 4: self-describing custom_ids

You can push reassembly metadata directly into the custom_id and avoid a side table entirely. A delimited composite key — entity type, primary key, and a version or shard tag — survives the round trip and tells you everything you need to route the result.

def encode_id(entity: str, pk: int, shard: str) -> str:
    return f"{entity}|{pk}|{shard}"

def decode_id(cid: str) -> tuple[str, int, str]:
    entity, pk, shard = cid.split("|")
    return entity, int(pk), shard

# On the way out:
for r in client.messages.batches.results(batch.id):
    if r.result.type == "succeeded":
        entity, pk, shard = decode_id(r.custom_id)
        route_result(entity, pk, shard, r.result.message)

Keep the delimiter out of your raw data values, and keep the whole string within length limits, but otherwise this pattern eliminates an entire class of "which row was this again?" bugs. A good rule of thumb is to pick a delimiter that cannot appear in any field you encode — a pipe or a double colon works well for numeric keys — and to validate on decode so a malformed id fails loudly rather than silently routing a result to the wrong place. When the keys themselves might contain arbitrary text, reach for a structured encoding like a short base64 blob instead of raw concatenation, so the round trip is lossless regardless of content.

Pattern 5: a resubmission helper

Errored and expired requests are normal at scale, not exceptional. Build the retry path as a first-class function from day one. It takes the list of custom_ids that need rework, rebuilds exactly those requests from your source data using the same factory, and submits a fresh, smaller batch.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

def resubmit(failed_ids: list[str], source: dict) -> str:
    retry_requests = [
        make_request(cid, source[cid]) for cid in failed_ids
    ]
    new_batch = client.messages.batches.create(requests=retry_requests)
    return new_batch.id

Because the factory is the single source of truth for request shape, retries are guaranteed to match the original job's configuration. No drift, no special-casing.

Common pitfalls

  • Rebuilding the system block per request. Constructing SHARED_SYSTEM inside the loop risks subtle byte differences and kills caching. Build it once, outside the comprehension.
  • Marking the volatile suffix with cache_control. If you put the breakpoint after the per-item content, every request writes a unique cache entry and nothing is ever read. Mark the end of the shared prefix only.
  • Prose outputs for structured data. Free-text answers force brittle parsing. Use output_config.format whenever the downstream consumer is code.
  • Unbounded custom_id length. Composite keys are great until they overflow length limits. Keep them compact.
  • No retry path. A batch job without a resubmission helper means hand-editing failures at 2am. Write it up front.

Adopt these patterns in 5 steps

  1. Extract request construction into a single make_request() factory.
  2. Split your prompt into a frozen, cache_control-marked prefix and an unmarked volatile suffix.
  3. Replace prose instructions with an output_config.format schema wherever code consumes the result.
  4. Adopt a delimited, self-describing custom_id scheme that encodes your join keys.
  5. Write a resubmit() helper that rebuilds failed requests from the same factory.

Pattern tradeoffs at a glance

PatternBuys youCosts you
Request factoryOne source of truth, cache-friendly prefixA little upfront structure
Frozen/volatile layeringCache reads at ~0.1x input priceDiscipline about what is frozen
Structured outputsDeterministic, parse-free reassemblySchema maintenance
Self-describing custom_idNo side table for the joinLength and delimiter care

Frequently asked questions

Does prompt caching really help inside a batch?

Yes, when many requests share a large identical prefix. The savings accrue as the batch drains rather than all at once, because cache entries become readable only after the first writing request begins, but you still pay roughly a tenth of the input price for the cached prefix on later requests — stacked on top of the 50% batch discount.

Can I mix models in one batch using a factory?

Yes. Each request carries its own model, so a factory can branch — Haiku for simple classification, Opus for the reasoning-heavy items — within a single submitted batch.

Why prefer structured outputs over a tool definition for extraction?

When you only need a typed JSON object back and nothing is executed, output_config.format is the lighter path: it constrains the response shape directly without the overhead of a tool-use round trip.

What goes in the custom_id versus a side table?

Put the minimal join keys you need to route the result — entity type and primary key — in the custom_id, and keep bulky context in your own store. The id is a routing label, not a payload.

Bringing agentic AI to your phone lines

These structuring patterns — factories, layered context, schema-bound outputs — are exactly what makes a Claude agent reliable in production. CallSphere applies them to voice and chat: agents that answer every call, use tools mid-conversation, and book work around the clock. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.