By Sagar Shankaran, Founder of CallSphere
Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.
Key takeaways
Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Most chat agent traffic is not actually realtime. Post-call summaries, lead scoring, sentiment analysis, follow-up email drafting, knowledge-base re-indexing — all of these can wait minutes or hours. They do not need a 400ms voice-to-voice latency budget.
If you serve dozens or hundreds of tenants from one stack, you can batch this work to get 50%+ discounts. But you have to do it without leaking PII across tenants and without creating month-end attribution chaos.
OpenAI Batch API (May 2026):
Anthropic Message Batches (May 2026):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Continuous batching at the inference server level (vLLM, TGI):
Profile A — 10,000 post-call summaries per day, 8k input + 1k output, GPT-4o-mini:
Profile B — 50k embedding jobs per day for retrieval, text-embedding-3-large:
Profile C — 100 self-hosted Llama-3-70B inferences per minute, vLLM continuous batching on 1 × H100:
CallSphere runs three batching patterns across 6 verticals — 37 agents, 90+ tools, 115+ DB tables:
1. Post-call analytics (Healthcare, Sales, OneRoof Real Estate). Every call ends with a summary, sentiment score (–1 to +1), and lead score (0–100). These are not realtime — they queue and run in 5-minute Batch API windows. 50% Batch discount on top of 90% prompt-cache discount = ~95% off vs naive sync.
2. Lead scoring and follow-up draft generation (Sales, Salon GlamBook). Daily batch run scores yesterday's leads and drafts tomorrow's outreach mail through the email_marketing pipeline. Generated mails are wrapped by the existing GTM v7 HTML template. Cost: under $40/day across all 6 verticals.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
3. Knowledge base re-indexing. Whenever a tenant uploads new docs, we batch the embeddings via OpenAI Batch API and pay 50% less for vector index builds. Average tenant onboarding embedding cost: $0.40 vs $0.80 sync.
Per-tenant cost attribution lives in our Postgres ledger — every API call is tagged with tenant_id, vertical, agent_id, and cost in micro-dollars. Without that ledger, the pricing tiers ($149 / $499 / $1499) would not be sustainable. The ROI calculator on the site reads from the same ledger to show prospective customers what they would actually pay. Try it on the 14-day no-card trial.
What is OpenAI Batch API? A separate endpoint that accepts a JSONL file of requests and returns results within 24 hours at 50% discount.
Can I use prompt caching with Batch API? Yes — both discounts stack. We routinely combine them.
How do I attribute cost across tenants in a batched job? Tag every input row with tenant_id; the cost ledger entry references the tenant on the way out.
What is continuous batching? Server-side technique (vLLM, TensorRT-LLM, TGI) that batches multiple incoming requests into a single GPU forward pass, increasing throughput 2–4×.
When should I avoid batching? Anything user-facing with sub-second latency requirements, anything where input data has not been generated yet at batch submission time.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.
78% of issues resolve via AI bots and 87% of users report positive experiences. Here is how 2026 chat agents fire inline 1–5 stars, NPS chips, and follow-up CSAT without survey fatigue.
Companies that safely automate 60 to 80 percent of refund requests with verifiable accuracy reduce costs and improve customer experience. Here is how to ship a chat-driven refund and cancellation flow without losing the customer.
11x.ai and Artisan promised to replace BDRs entirely. By 2026 most adopters reverted to hybrid models. Here is the outbound chat pattern that actually works.
How leaders should think about Claude equity research — adoption patterns, ROI, competitive dynamics, and what financial AI means for the next 12 months.
A practical engineering deep dive into Claude Sonnet 4.6 vision, covering architecture, tradeoffs, and what production teams need to know about multimodal AI.
© 2026 CallSphere LLC. All rights reserved.