Skip to content
Agentic AI
Agentic AI9 min read0 views

A real Message Batches API project, end to end

A realistic Claude Message Batches API project from problem to shipped outcome: enriching 600k product records overnight, with the real decisions and gotchas.

Most writing about batch processing stays abstract — submit requests, poll, fetch results — and skips the part that actually consumes a sprint: the messy middle. So this post follows one realistic project from the moment the problem lands on your desk to the moment the result ships. The scenario is composite but every decision and gotcha is drawn from real batch work on the Message Batches API. The task: a catalog of roughly 600,000 product records, each with a sparse, inconsistent description, needs clean structured attributes — category, key features, and a one-line summary — so the search and recommendation teams can use them. There is no budget for a human to touch each row, and the data team wants it by next week.

Key takeaways

  • A real batch project is mostly data shaping and verification, not prompt writing.
  • Start with a 100-row pilot to lock the prompt and schema before spending on volume.
  • Use stable custom_ids so the inevitable re-runs are safe and cheap.
  • Land results in staging, validate, then promote — never write straight to the catalog.
  • Budget time for the tail: the last few percent of weird rows takes as long as the first 95%.

Day one: framing the problem, not the prompt

The instinct is to start writing the prompt. The right first move is to define the output contract. What exactly does "clean structured attributes" mean? After a short conversation with the search team, we settle on a strict JSON shape: category from a fixed list of 40 values, features as an array of up to five short strings, and summary as a single sentence under 120 characters. Fixing this contract first does two things: it gives the eval something concrete to check, and it makes malformed output detectable rather than a judgment call.

Next we pick a model. This is a high-volume, bounded extraction task, not deep reasoning, so a smaller, faster model is the right default — reserve the larger models for the genuinely hard rows. We will start everything on the smaller model and only escalate the rows that fail validation. That tiering decision alone shapes the cost and the timeline.

Day two: the pilot that saves the week

Before touching 600,000 rows, we run 100. We hand-pick them to include the ugly cases: empty descriptions, descriptions in the wrong language, joke entries, and a few perfect ones. We submit this tiny batch, fetch the results, and read every single one by hand. This is the highest-value hour of the whole project. The diagram below is the loop we run until the pilot passes.

flowchart TD
  A["Pick 100 hard rows"] --> B["Submit pilot batch"]
  B --> C["Read every result by hand"]
  C --> D{"Schema & quality OK?"}
  D -->|No| E["Fix prompt / contract"]
  E --> B
  D -->|Yes| F["Lock prompt + eval set"]
  F --> G["Scale to full 600k"]

The pilot surfaces exactly the problems we hoped it would. The model invents categories outside our list of 40 — fixed by listing the allowed values explicitly in the prompt and validating against them. Empty descriptions produce confident hallucinated features — fixed by instructing the model to return a null category and empty arrays when the input is too sparse, which we then route to a human queue rather than the catalog. By the end of day two we have a prompt we trust and a 100-row eval set that encodes "good output." That eval set is now the contract every future run is graded against.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Day three: building the request set

Now we generate the full request set. The key discipline is the custom_id: it is derived from the product's stable SKU, so every request maps unambiguously back to its row and re-runs are idempotent. Here is the core of the builder.

ALLOWED = load_categories()  # 40 fixed values

def build(row):
    return {
        "custom_id": f"sku_{row['sku']}",
        "params": {
            "model": "claude-haiku-4-5",
            "max_tokens": 300,
            "messages": [{"role": "user", "content": PROMPT.format(
                allowed=", ".join(ALLOWED),
                desc=row["description"] or "(empty)")}],
        },
    }

requests = [build(r) for r in catalog]   # 600k of these
submit_in_chunks(requests)               # respect per-batch size limits

Two practical notes. We chunk the submission to respect per-batch size limits rather than assuming one giant batch. And we cap max_tokens at 300 — generous for our schema, but a hard ceiling so no single runaway response can balloon the bill.

Day four: the run, and the tail

We submit, then poll. The bulk finishes within the batch window. Now the part nobody budgets for: the tail. Out of 600,000 rows, a few thousand come back errored or with output that fails validation, and a few hundred expired. We do not panic, because the architecture anticipated this. Every failed row is quarantined with a reason code. We bucket them: the unparseable ones get a single retry; the sparse-input ones go to the human queue by design; the few genuinely hard rows get re-submitted to a larger model as a second, tiny batch. This escalation batch is where the smaller-model-first decision pays off — we spend premium tokens on only the rows that need them.

By the end of day four, 600,000 rows have either a validated result in staging or a documented reason for being in a queue. Nothing has touched the production catalog yet.

Day five: reconcile, promote, ship

Reconciliation is the final gate. We count: succeeded, errored, expired, quarantined, human-queued. The sum must equal 600,000 exactly. It does — and that equality is the single most important number in the project, because it proves no row silently vanished. Then we promote staging to the catalog in a transaction keyed by SKU, so the write is reversible. The search team gets clean attributes; we hand them the eval set and the quarantine report so they trust the numbers.

Before/after: what the project actually delivered

DimensionBeforeAfter
Catalog coverageSparse, inconsistentStructured attributes on ~99% of rows
Human effortInfeasible by handOne queue of sparse rows only
Cost profileUnknownCapped, mostly small-model
TrustNoneEval set + exact row reconciliation

What the retrospective taught us

A week later, the search team found a small systematic issue: products in one obscure category were being summarized as if they belonged to an adjacent one. This is exactly the kind of error that no per-row check catches, because each individual summary looked plausible — the bias only showed up in aggregate. Because we had staged and kept the eval set, the fix was contained: we added two examples of the confusing category to the prompt, re-ran only the affected SKUs as a tiny targeted batch keyed by their stable IDs, and promoted the corrections. No full re-run, no downtime, no scramble. That is the payoff of the discipline — the architecture turned a scary "the catalog is wrong" into a routine 200-row correction.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The broader lesson is that batch projects are never truly "done" at promotion. The honest finish line is when you have handed over the eval set, the quarantine report, and the runbook for re-running a slice — so that the next person who spots a problem can fix a few hundred rows instead of redoing six hundred thousand. Build for that handoff from day one and the project ages well.

Common pitfalls in a project like this

  • Skipping the pilot. Running full volume before locking the prompt means discovering the category bug 600,000 rows too late.
  • No reason codes on failures. Quarantining without recording why turns the tail into an unsolvable mystery.
  • One model for everything. Using a large model for all rows wastes money; using a small model for the hard rows wastes quality. Tier them.
  • Writing straight to the catalog. Without staging there is no clean rollback when someone spots a systematic error next week.
  • Forgetting expired rows. The tail includes requests that timed out; if you do not requeue them, your coverage quietly drops.

Run your own batch project in five steps

  1. Define the output contract (strict schema) before writing any prompt.
  2. Run a 100-row hard-case pilot and read every result by hand.
  3. Build requests with stable custom_ids and capped max_tokens.
  4. Quarantine failures with reason codes; escalate only hard rows to a larger model.
  5. Reconcile exact counts, then promote from staging in a reversible write.

Frequently asked questions

How long does a project like this really take?

The model work is fast; the verification work sets the timeline. Expect roughly half your time on framing, piloting, and reconciliation, and the rest on the run and its tail. The last few percent of weird rows often takes as long as the first 95%.

Why run a 100-row pilot instead of just starting?

Because a prompt bug found at 100 rows costs an hour and a prompt bug found at 600,000 rows costs the whole budget. The pilot locks the prompt and produces the eval set that grades every later run.

When should I escalate rows to a larger model?

Only after a smaller model has tried and the output failed validation. Tiering — small model first, large model for the failures — gives you most of the quality at a fraction of the cost, because premium tokens go only to genuinely hard rows.

How do I prove to stakeholders that nothing was lost?

Reconcile exact counts: succeeded plus errored plus expired plus quarantined plus human-queued must equal your source row count. That single equality is the proof that no row silently disappeared, and it is what earns the data's trust.

Bringing agentic AI to your phone lines

CallSphere takes the same pilot-then-scale, verify-then-ship discipline into voice and chat — agents that handle every call and message, use tools mid-conversation, and book work 24/7. See the full system at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.