---
title: "Multi-Tenant Batching Strategies for Chat Agents in 2026"
description: "Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly."
canonical: https://callsphere.ai/blog/vw2c-multi-tenant-batching-strategies-chat-agents-2026
category: "AI Engineering"
tags: ["Batching", "Multi-Tenant", "Cost", "LLM", "Chat Agents"]
author: "CallSphere Team"
published: 2026-04-05T00:00:00.000Z
updated: 2026-05-07T09:32:11.135Z
---

# Multi-Tenant Batching Strategies for Chat Agents in 2026

> Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.

> Batching async workloads across tenants can cut LLM costs 50%. Here is when to use OpenAI Batch API, when to use continuous batching, and how to attribute cost per tenant correctly.

## The cost problem

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

Most chat agent traffic is not actually realtime. Post-call summaries, lead scoring, sentiment analysis, follow-up email drafting, knowledge-base re-indexing — all of these can wait minutes or hours. They do not need a 400ms voice-to-voice latency budget.

If you serve dozens or hundreds of tenants from one stack, you can batch this work to get 50%+ discounts. But you have to do it without leaking PII across tenants and without creating month-end attribution chaos.

## How batching prices it

**OpenAI Batch API (May 2026):**

- 50% discount on input + output tokens vs synchronous
- 24-hour SLA (typically delivered in under an hour)
- Same models supported, same prompt caching applies
- Single line item per batch job

**Anthropic Message Batches (May 2026):**

- 50% discount on input + output tokens
- 24-hour SLA
- Compatible with prompt caching

**Continuous batching at the inference server level (vLLM, TGI):**

- Not a vendor discount — an architectural pattern for self-hosted
- Throughput improvement 2–4× on the same GPU
- Effectively turns one $4/hr H100 into the equivalent of 2–4× higher concurrent capacity

## Honest math

**Profile A — 10,000 post-call summaries per day, 8k input + 1k output, GPT-4o-mini:**

- Synchronous: 10k × (8k × $0.15/M + 1k × $0.60/M) = 10k × ($0.0012 + $0.0006) = **$18/day**
- Cached prompts (90% cache rate): **$3.60/day**
- Batch API (no cache): **$9/day**
- Batch + cache combined: **$1.80/day** — 90% cheaper than naive sync

**Profile B — 50k embedding jobs per day for retrieval, text-embedding-3-large:**

- Synchronous: 50k × $0.13/M × 2k tokens avg = **$13/day**
- Batch: **$6.50/day**

**Profile C — 100 self-hosted Llama-3-70B inferences per minute, vLLM continuous batching on 1 × H100:**

- Without continuous batching: 25 reqs/sec sustained max
- With continuous batching: 80 reqs/sec sustained
- Same hardware ($3.95/hr Modal H100), 3.2× throughput
- Effective $/req drops 70%

## Multi-tenant attribution gotchas

1. **Pre-batch enrichment leaks identity.** If you put tenant_id into the prompt, you cannot share batches across tenants — caching breaks.
2. **Post-batch routing is required.** You need a job ID → tenant_id mapping table to fan out results.
3. **Per-tenant cost tracking.** Without explicit cost attribution, 3% of tenants typically eat 60% of tokens. We have seen this on every multi-tenant deployment.
4. **Latency variance.** Some tenants will tolerate batch latency, others will not. Add a per-tenant policy.
5. **PII isolation.** Batches that span tenants need PII redaction or pre-tagging.

## How CallSphere optimizes

CallSphere runs three batching patterns across 6 verticals — 37 agents, 90+ tools, 115+ DB tables:

**1. Post-call analytics (Healthcare, Sales, OneRoof Real Estate).** Every call ends with a summary, sentiment score (–1 to +1), and lead score (0–100). These are not realtime — they queue and run in 5-minute Batch API windows. 50% Batch discount on top of 90% prompt-cache discount = ~95% off vs naive sync.

**2. Lead scoring and follow-up draft generation (Sales, Salon GlamBook).** Daily batch run scores yesterday's leads and drafts tomorrow's outreach mail through the email_marketing pipeline. Generated mails are wrapped by the existing GTM v7 HTML template. Cost: under $40/day across all 6 verticals.

**3. Knowledge base re-indexing.** Whenever a tenant uploads new docs, we batch the embeddings via OpenAI Batch API and pay 50% less for vector index builds. Average tenant onboarding embedding cost: $0.40 vs $0.80 sync.

Per-tenant cost attribution lives in our Postgres ledger — every API call is tagged with tenant_id, vertical, agent_id, and cost in micro-dollars. Without that ledger, the [pricing tiers](/pricing) ($149 / $499 / $1499) would not be sustainable. The [ROI calculator](/tools/roi-calculator) on the site reads from the same ledger to show prospective customers what they would actually pay. Try it on the [14-day no-card trial](/trial).

## Optimization checklist

1. Identify async workloads — anything that can wait 5+ minutes.
2. Move post-call analytics, lead scoring, and embedding jobs to Batch API.
3. Combine Batch API with prompt caching — both discounts stack.
4. Build a per-tenant cost ledger from day one.
5. Tag every span with tenant_id; tenant-less spans are a debugging nightmare.
6. Use continuous batching (vLLM) only on self-hosted — vendors handle it server-side.
7. Set per-tenant rate limits to prevent one tenant from blowing the batch budget.
8. Pre-warm batches at off-peak times to smooth GPU cost.
9. Watch your p99 latency — Batch API rarely returns at the SLA limit but plan for it.
10. Re-evaluate which workloads are truly realtime — most "live" chat features can be 200ms-buffered.

## FAQ

**What is OpenAI Batch API?**
A separate endpoint that accepts a JSONL file of requests and returns results within 24 hours at 50% discount.

**Can I use prompt caching with Batch API?**
Yes — both discounts stack. We routinely combine them.

**How do I attribute cost across tenants in a batched job?**
Tag every input row with tenant_id; the cost ledger entry references the tenant on the way out.

**What is continuous batching?**
Server-side technique (vLLM, TensorRT-LLM, TGI) that batches multiple incoming requests into a single GPU forward pass, increasing throughput 2–4×.

**When should I avoid batching?**
Anything user-facing with sub-second latency requirements, anything where input data has not been generated yet at batch submission time.

## Sources

- OpenAI Batch API docs — [https://platform.openai.com/docs/guides/batch](https://platform.openai.com/docs/guides/batch)
- Anthropic Message Batches — [https://platform.claude.com/docs/en/build-with-claude/message-batches](https://platform.claude.com/docs/en/build-with-claude/message-batches)
- TokenMix Batch API pricing — [https://tokenmix.ai/blog/openai-batch-api-pricing](https://tokenmix.ai/blog/openai-batch-api-pricing)
- Mavik Labs LLM cost optimization — [https://www.maviklabs.com/blog/llm-cost-optimization-2026](https://www.maviklabs.com/blog/llm-cost-optimization-2026)
- Paxrel agent cost optimization guide — [https://paxrel.com/blog-ai-agent-cost-optimization](https://paxrel.com/blog-ai-agent-cost-optimization)

---

Source: https://callsphere.ai/blog/vw2c-multi-tenant-batching-strategies-chat-agents-2026
