Inside the Claude Message Batches API architecture
How Anthropic's Message Batches API works end to end: async queue, per-request isolation, the 50% discount, caching, and the 29-day result store.
The first time you submit a hundred thousand prompts to Claude and get a single batch ID back in a few hundred milliseconds, it feels like the work disappeared into a black box. It did not. Behind that ID sits a deliberate pipeline that trades latency for throughput and cost, and understanding the shape of that pipeline is the difference between a batch job that finishes in twenty minutes and one that silently expires after a day. This post pulls the box apart and walks the path a single request takes from your create() call to a line in the results file.
Key takeaways
- The Message Batches API is a fire-and-submit async queue: you hand over up to 100,000 requests, get an ID, and poll for a terminal status — there is no streaming and no per-request callback.
- Every request inside a batch runs the full Messages API independently — vision, tools, structured outputs, and prompt caching all work, and one failure never poisons its neighbors.
- The 50% discount is the economic counterpart to relaxed latency: Anthropic backfills batch work into spare capacity rather than reserving low-latency lanes for it.
- Results are written to a 29-day result store keyed by your
custom_id, which is the only thread connecting an output back to its input. - Batches are capped at 100,000 requests or 256 MB, complete within 24 hours (usually under one), and surface four terminal per-request states: succeeded, errored, canceled, expired.
What problem does batching actually solve?
Interactive Messages API calls optimize for time-to-first-token. That optimization costs money and capacity: the platform has to keep a fast lane open for your request the instant it arrives. When you are classifying a backlog of two million support tickets overnight, or generating embeddings-style summaries for an entire document corpus, none of that urgency applies. You care about the aggregate completing before the morning standup, not about any single response landing in 800 milliseconds.
The Message Batches API is the surface Anthropic exposes for exactly this trade. The Message Batches API is an asynchronous endpoint that accepts a large array of independent Messages API requests, processes them off the latency-critical path, and returns results as a downloadable file at 50% of standard token prices. You give up streaming, immediate responses, and tight latency guarantees. In exchange you get half-price tokens and the ability to enqueue a hundred thousand jobs in one HTTP call.
The mental model that matters: a batch is not a single giant prompt. It is a container of fully separate conversations that happen to be submitted together. Each carries its own model, system, messages, tools, and token limits. This independence is the architectural keystone — it is why one malformed request returns an invalid_request error while the other 99,999 complete untouched.
The path of a single request through the pipeline
When you call client.messages.batches.create(), the SDK serializes your requests array and POSTs it to /v1/messages/batches. The platform validates the envelope — array size, total payload under 256 MB, well-formed JSON — and synchronously returns a batch object with an id and a processing_status of in_progress. Nothing has been inferred yet; the body has merely been accepted into durable storage and fanned out into a work queue.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["client.messages.batches.create()"] --> B{"Envelope valid?\n<=100k reqs, <=256MB"}
B -->|No| C["413 / 400 returned synchronously"]
B -->|Yes| D["Batch persisted, id issued\nstatus: in_progress"]
D --> E["Fan-out: each request enqueued\nwith its custom_id"]
E --> F["Worker picks request\noff spare capacity"]
F --> G{"Per-request outcome"}
G -->|ok| H["succeeded: full Message"]
G -->|bad input| I["errored: invalid_request"]
G -->|infra| J["errored: api_error (retryable)"]
H & I & J --> K["Write to 29-day result store\nkeyed by custom_id"]
K --> L{"All requests terminal?"}
L -->|No| F
L -->|Yes| M["status: ended\nresults available"]From the queue, individual workers pull requests as capacity frees up. This is the crucial scheduling decision: batch work is opportunistic. It runs in the gaps left by interactive traffic, which is why completion time is a window (up to 24 hours) rather than a promise. Most batches finish inside an hour because spare capacity is usually plentiful, but the contract is the ceiling, not the average.
Each worker executes the request exactly as the synchronous Messages API would — same model, same tool loop, same caching behavior. The result, whether a completed Message or an error object, is written into the result store under your custom_id. Only when every request in the batch has reached a terminal outcome does the batch flip to processing_status: ended.
Why per-request isolation changes how you design jobs
Because requests are isolated, the batch has no shared failure mode at the inference layer. A request that references a retired model fails with invalid_request; the rest proceed. This is liberating and dangerous in equal measure. Liberating, because you can mix models and prompt shapes freely in one batch. Dangerous, because a silent 2% error rate is easy to miss when you only inspect the succeeded results.
The request_counts object on the batch is your early-warning system. It breaks down into processing, succeeded, errored, canceled, and expired. Reconcile those numbers against the count you submitted before you trust the output. An errored result with type invalid_request means your payload was malformed — fix and resubmit only that request. An errored result that is an api_error is a transient server-side problem and is safe to retry as-is.
Where prompt caching fits the architecture
Prompt caching composes beautifully with batches, and the architecture is the reason. If a thousand requests share a 40 KB system preamble — an analyst persona plus a reference document — you mark that block with cache_control once and every request that renders the identical prefix reads it from cache at roughly a tenth of the input price, on top of the 50% batch discount.
shared_system = [
{"type": "text", "text": "You are a contracts analyst."},
{"type": "text", "text": reference_doc, "cache_control": {"type": "ephemeral"}},
]
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"clause-{i}",
params=MessageCreateParamsNonStreaming(
model="claude-opus-4-8",
max_tokens=1024,
system=shared_system,
messages=[{"role": "user", "content": q}],
),
)
for i, q in enumerate(questions)
]
)One caveat the pipeline imposes: cache entries become readable only after the first request that writes them begins processing. In a batch, requests sharing a prefix do not all start simultaneously, so cache reads accrue as the batch drains rather than all at once. You still capture the savings — the timing just spreads across the run.
The result store and the 29-day clock
Results are not pushed to you; they are pulled. Once the batch is ended, you stream them with client.messages.batches.results(batch_id), which returns one entry per request, in no guaranteed order. This is the architectural reason custom_id is mandatory and must be unique: it is the only key that maps an output back to the row of input data it came from. If your custom_id encodes a database primary key or a file offset, reassembly is trivial. If it is just an array index you have since lost, you are stuck.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The store keeps results for 29 days. After that they are gone, and an attempt to read them returns nothing. Treat the batch as a transient compute layer, not a database — pull the results into your own storage as soon as the batch ends, then forget the batch ID.
Common pitfalls
- Polling too aggressively. The batch will not finish faster because you poll every second. Poll at a 30–60 second cadence; tighter polling just burns your request rate limit.
- Trusting succeeded-only output. Always reconcile
request_counts. A job that submitted 50,000 and succeeded 49,100 dropped 900 results somewhere — find out which before downstream code consumes the file. - Reusing custom_ids. Duplicate
custom_idvalues make result reassembly ambiguous. Keep them unique and meaningful. - Ignoring the 29-day window. Results are not permanent. Persist them immediately on
ended. - Assuming order. Results come back unordered. Never zip the results stream against your input array positionally — join on
custom_id.
Ship a batch in 6 steps
- Build a
requestsarray where each entry has a unique, meaningfulcustom_idand a fullparamsbody. - Mark any large shared prefix with
cache_controlto stack caching on top of the batch discount. - Call
batches.create()and store the returnediddurably. - Poll
batches.retrieve(id)every 30–60s untilprocessing_status == "ended". - Reconcile
request_countsagainst your submitted count; queue anyinvalid_requestrows for repair. - Stream
batches.results(id), join each result oncustom_id, and write to your own store before the 29-day clock runs out.
Batch vs. synchronous Messages API
| Dimension | Synchronous Messages API | Message Batches API |
|---|---|---|
| Latency | Seconds, streamed | Up to 24h, usually <1h |
| Token price | Standard | 50% off |
| Max requests per call | 1 | 100,000 (or 256 MB) |
| Result delivery | Inline response | Pulled from 29-day store |
| Best for | Chat, agents, anything interactive | Classification, summarization, backfills |
Frequently asked questions
Is a batch a single large prompt to Claude?
No. A batch is a container of independent Messages API requests submitted together. Each has its own model, system prompt, messages, and tools, and each succeeds or fails on its own. The grouping is purely a submission and billing convenience.
Why does the same job cost half as much through batches?
Because you relax the latency requirement, Anthropic can schedule the work opportunistically into spare capacity instead of reserving a low-latency lane. That scheduling freedom is what the 50% discount pays for.
What happens if my batch is not finished after 24 hours?
Any requests still unprocessed move to the expired terminal state. The succeeded ones remain available; you resubmit only the expired requests, identified by their custom_id.
Do prompt caching and tool use work inside a batch?
Yes. A batched request runs the full Messages API, so vision, tool calls, structured outputs, and prompt caching all behave as they do synchronously. Caching savings stack on top of the batch discount.
Bringing agentic AI to your phone lines
CallSphere takes the same Claude-powered patterns described here and points them at voice and chat — multi-agent assistants that answer every call, use tools mid-conversation, and book work around the clock. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.