Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.

The cost problem

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

Most chat agent traffic is not actually realtime. Post-call summaries, lead scoring, sentiment analysis, follow-up email drafting, knowledge-base re-indexing — all of these can wait minutes or hours. They do not need a 400ms voice-to-voice latency budget.

If you serve dozens or hundreds of tenants from one stack, you can batch this work to get 50%+ discounts. But you have to do it without leaking PII across tenants and without creating month-end attribution chaos.

How batching prices it

OpenAI Batch API (May 2026):

50% discount on input + output tokens vs synchronous
24-hour SLA (typically delivered in under an hour)
Same models supported, same prompt caching applies
Single line item per batch job

Anthropic Message Batches (May 2026):

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

50% discount on input + output tokens
24-hour SLA
Compatible with prompt caching

Continuous batching at the inference server level (vLLM, TGI):

Not a vendor discount — an architectural pattern for self-hosted
Throughput improvement 2–4× on the same GPU
Effectively turns one $4/hr H100 into the equivalent of 2–4× higher concurrent capacity

Honest math

Profile A — 10,000 post-call summaries per day, 8k input + 1k output, GPT-4o-mini:

Synchronous: 10k × (8k × $0.15/M + 1k × $0.60/M) = 10k × ($0.0012 + $0.0006) = $18/day
Cached prompts (90% cache rate): $3.60/day
Batch API (no cache): $9/day
Batch + cache combined: $1.80/day — 90% cheaper than naive sync

Profile B — 50k embedding jobs per day for retrieval, text-embedding-3-large:

Synchronous: 50k × $0.13/M × 2k tokens avg = $13/day
Batch: $6.50/day

Profile C — 100 self-hosted Llama-3-70B inferences per minute, vLLM continuous batching on 1 × H100:

Without continuous batching: 25 reqs/sec sustained max
With continuous batching: 80 reqs/sec sustained
Same hardware ($3.95/hr Modal H100), 3.2× throughput
Effective $/req drops 70%

Multi-tenant attribution gotchas

Pre-batch enrichment leaks identity. If you put tenant_id into the prompt, you cannot share batches across tenants — caching breaks.
Post-batch routing is required. You need a job ID → tenant_id mapping table to fan out results.
Per-tenant cost tracking. Without explicit cost attribution, 3% of tenants typically eat 60% of tokens. We have seen this on every multi-tenant deployment.
Latency variance. Some tenants will tolerate batch latency, others will not. Add a per-tenant policy.
PII isolation. Batches that span tenants need PII redaction or pre-tagging.

How CallSphere optimizes

CallSphere runs three batching patterns across 6 verticals — 37 agents, 90+ tools, 115+ DB tables:

1. Post-call analytics (Healthcare, Sales, OneRoof Real Estate). Every call ends with a summary, sentiment score (–1 to +1), and lead score (0–100). These are not realtime — they queue and run in 5-minute Batch API windows. 50% Batch discount on top of 90% prompt-cache discount = ~95% off vs naive sync.

2. Lead scoring and follow-up draft generation (Sales, Salon GlamBook). Daily batch run scores yesterday's leads and drafts tomorrow's outreach mail through the email_marketing pipeline. Generated mails are wrapped by the existing GTM v7 HTML template. Cost: under $40/day across all 6 verticals.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

3. Knowledge base re-indexing. Whenever a tenant uploads new docs, we batch the embeddings via OpenAI Batch API and pay 50% less for vector index builds. Average tenant onboarding embedding cost: $0.40 vs $0.80 sync.

Per-tenant cost attribution lives in our Postgres ledger — every API call is tagged with tenant_id, vertical, agent_id, and cost in micro-dollars. Without that ledger, the pricing tiers ($149 / $499 / $1499) would not be sustainable. The ROI calculator on the site reads from the same ledger to show prospective customers what they would actually pay. Try it on the 14-day no-card trial.

Optimization checklist

Identify async workloads — anything that can wait 5+ minutes.
Move post-call analytics, lead scoring, and embedding jobs to Batch API.
Combine Batch API with prompt caching — both discounts stack.
Build a per-tenant cost ledger from day one.
Tag every span with tenant_id; tenant-less spans are a debugging nightmare.
Use continuous batching (vLLM) only on self-hosted — vendors handle it server-side.
Set per-tenant rate limits to prevent one tenant from blowing the batch budget.
Pre-warm batches at off-peak times to smooth GPU cost.
Watch your p99 latency — Batch API rarely returns at the SLA limit but plan for it.
Re-evaluate which workloads are truly realtime — most "live" chat features can be 200ms-buffered.

FAQ

What is OpenAI Batch API? A separate endpoint that accepts a JSONL file of requests and returns results within 24 hours at 50% discount.

Can I use prompt caching with Batch API? Yes — both discounts stack. We routinely combine them.

How do I attribute cost across tenants in a batched job? Tag every input row with tenant_id; the cost ledger entry references the tenant on the way out.

What is continuous batching? Server-side technique (vLLM, TensorRT-LLM, TGI) that batches multiple incoming requests into a single GPU forward pass, increasing throughput 2–4×.

When should I avoid batching? Anything user-facing with sub-second latency requirements, anything where input data has not been generated yet at batch submission time.

Sources

OpenAI Batch API docs — https://platform.openai.com/docs/guides/batch
Anthropic Message Batches — https://platform.claude.com/docs/en/build-with-claude/message-batches
TokenMix Batch API pricing — https://tokenmix.ai/blog/openai-batch-api-pricing
Mavik Labs LLM cost optimization — https://www.maviklabs.com/blog/llm-cost-optimization-2026
Paxrel agent cost optimization guide — https://paxrel.com/blog-ai-agent-cost-optimization

Multi-Tenant Batching Strategies for Chat Agents in 2026

The cost problem

How batching prices it

Honest math

Multi-tenant attribution gotchas

How CallSphere optimizes

Optimization checklist

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Outbound Sales Chat in 2026: 11x, Artisan, and Why Pure-AI BDR Replacement Reverted

Claude Sonnet 4.6 Vision Capabilities for Document and Chart Unders...

Claude for Equity Research: Workflows from Buy-Side Analysts

Multilingual Chat Agents in 2026: The 57-Language Gap and How to Close It

AWS Bedrock + Transcribe + Polly Stitched vs Realtime: Real Cost