Multi-Tenant Batching Strategies for Chat Agents in 2026
Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.
Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.
The cost problem
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Most chat agent traffic is not actually realtime. Post-call summaries, lead scoring, sentiment analysis, follow-up email drafting, knowledge-base re-indexing — all of these can wait minutes or hours. They do not need a 400ms voice-to-voice latency budget.
If you serve dozens or hundreds of tenants from one stack, you can batch this work to get 50%+ discounts. But you have to do it without leaking PII across tenants and without creating month-end attribution chaos.
How batching prices it
OpenAI Batch API (May 2026):
- 50% discount on input + output tokens vs synchronous
- 24-hour SLA (typically delivered in under an hour)
- Same models supported, same prompt caching applies
- Single line item per batch job
Anthropic Message Batches (May 2026):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- 50% discount on input + output tokens
- 24-hour SLA
- Compatible with prompt caching
Continuous batching at the inference server level (vLLM, TGI):
- Not a vendor discount — an architectural pattern for self-hosted
- Throughput improvement 2–4× on the same GPU
- Effectively turns one $4/hr H100 into the equivalent of 2–4× higher concurrent capacity
Honest math
Profile A — 10,000 post-call summaries per day, 8k input + 1k output, GPT-4o-mini:
- Synchronous: 10k × (8k × $0.15/M + 1k × $0.60/M) = 10k × ($0.0012 + $0.0006) = $18/day
- Cached prompts (90% cache rate): $3.60/day
- Batch API (no cache): $9/day
- Batch + cache combined: $1.80/day — 90% cheaper than naive sync
Profile B — 50k embedding jobs per day for retrieval, text-embedding-3-large:
- Synchronous: 50k × $0.13/M × 2k tokens avg = $13/day
- Batch: $6.50/day
Profile C — 100 self-hosted Llama-3-70B inferences per minute, vLLM continuous batching on 1 × H100:
- Without continuous batching: 25 reqs/sec sustained max
- With continuous batching: 80 reqs/sec sustained
- Same hardware ($3.95/hr Modal H100), 3.2× throughput
- Effective $/req drops 70%
Multi-tenant attribution gotchas
- Pre-batch enrichment leaks identity. If you put tenant_id into the prompt, you cannot share batches across tenants — caching breaks.
- Post-batch routing is required. You need a job ID → tenant_id mapping table to fan out results.
- Per-tenant cost tracking. Without explicit cost attribution, 3% of tenants typically eat 60% of tokens. We have seen this on every multi-tenant deployment.
- Latency variance. Some tenants will tolerate batch latency, others will not. Add a per-tenant policy.
- PII isolation. Batches that span tenants need PII redaction or pre-tagging.
How CallSphere optimizes
CallSphere runs three batching patterns across 6 verticals — 37 agents, 90+ tools, 115+ DB tables:
1. Post-call analytics (Healthcare, Sales, OneRoof Real Estate). Every call ends with a summary, sentiment score (–1 to +1), and lead score (0–100). These are not realtime — they queue and run in 5-minute Batch API windows. 50% Batch discount on top of 90% prompt-cache discount = ~95% off vs naive sync.
2. Lead scoring and follow-up draft generation (Sales, Salon GlamBook). Daily batch run scores yesterday's leads and drafts tomorrow's outreach mail through the email_marketing pipeline. Generated mails are wrapped by the existing GTM v7 HTML template. Cost: under $40/day across all 6 verticals.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
3. Knowledge base re-indexing. Whenever a tenant uploads new docs, we batch the embeddings via OpenAI Batch API and pay 50% less for vector index builds. Average tenant onboarding embedding cost: $0.40 vs $0.80 sync.
Per-tenant cost attribution lives in our Postgres ledger — every API call is tagged with tenant_id, vertical, agent_id, and cost in micro-dollars. Without that ledger, the pricing tiers ($149 / $499 / $1499) would not be sustainable. The ROI calculator on the site reads from the same ledger to show prospective customers what they would actually pay. Try it on the 14-day no-card trial.
Optimization checklist
- Identify async workloads — anything that can wait 5+ minutes.
- Move post-call analytics, lead scoring, and embedding jobs to Batch API.
- Combine Batch API with prompt caching — both discounts stack.
- Build a per-tenant cost ledger from day one.
- Tag every span with tenant_id; tenant-less spans are a debugging nightmare.
- Use continuous batching (vLLM) only on self-hosted — vendors handle it server-side.
- Set per-tenant rate limits to prevent one tenant from blowing the batch budget.
- Pre-warm batches at off-peak times to smooth GPU cost.
- Watch your p99 latency — Batch API rarely returns at the SLA limit but plan for it.
- Re-evaluate which workloads are truly realtime — most "live" chat features can be 200ms-buffered.
FAQ
What is OpenAI Batch API? A separate endpoint that accepts a JSONL file of requests and returns results within 24 hours at 50% discount.
Can I use prompt caching with Batch API? Yes — both discounts stack. We routinely combine them.
How do I attribute cost across tenants in a batched job? Tag every input row with tenant_id; the cost ledger entry references the tenant on the way out.
What is continuous batching? Server-side technique (vLLM, TensorRT-LLM, TGI) that batches multiple incoming requests into a single GPU forward pass, increasing throughput 2–4×.
When should I avoid batching? Anything user-facing with sub-second latency requirements, anything where input data has not been generated yet at batch submission time.
Sources
- OpenAI Batch API docs — https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches — https://platform.claude.com/docs/en/build-with-claude/message-batches
- TokenMix Batch API pricing — https://tokenmix.ai/blog/openai-batch-api-pricing
- Mavik Labs LLM cost optimization — https://www.maviklabs.com/blog/llm-cost-optimization-2026
- Paxrel agent cost optimization guide — https://paxrel.com/blog-ai-agent-cost-optimization
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.