By Sagar Shankaran, Founder of CallSphere
Use Cloudflare AI Gateway to route voice agent inference across OpenAI, Anthropic, Google, and Workers AI with automatic fallback, caching, and per-tenant rate limits.
Key takeaways
TL;DR — Cloudflare AI Gateway sits between your voice agent and any LLM provider, giving you cache, observability, rate limits, and automatic failover across providers via the
universalendpoint. Point your OpenAI client athttps://gateway.ai.cloudflare.com/v1/{account}/{gw}/openaiand you immediately get analytics + caching with no code change.
A voice agent fronted by AI Gateway that tries OpenAI gpt-realtime first, falls back to Azure Voice Live on rate-limit, then to Google gemini-2.5-flash-live on full outage. Per-tenant token budgets enforced via Gateway; cached answers for FAQ-style turns saving 60% on input tokens.
gateway.ai.cloudflare.com).flowchart LR
V[Voice Bridge] -->|gateway URL| GW[Cloudflare AI Gateway]
GW -->|primary| OAI[OpenAI Realtime]
GW -->|fallback 1| AZ[Azure Voice Live]
GW -->|fallback 2| GG[Google Gemini Live]
GW --> CACHE[(Cache)]
GW --> LOG[(Analytics + Logs)]
GW --> LIM[Per-tenant Rate Limits]
In the Cloudflare dashboard → AI → AI Gateway → Create gateway named voice-prod. Note the URL: https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod.
```python from openai import OpenAI client = OpenAI( api_key=os.environ["OPENAI_API_KEY"], base_url=f"https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai" ) ```
That's it — every request now flows through the gateway. For Realtime WebSockets, use wss://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai/realtime.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The Universal endpoint accepts a JSON array of provider attempts; AI Gateway tries them in order until one succeeds:
```bash curl https://gateway.ai.cloudflare.com/v1/$ACCOUNT/voice-prod \ -H "Content-Type: application/json" \ -d '[ { "provider": "openai", "endpoint": "chat/completions", "headers": { "authorization": "Bearer sk-..." }, "query": { "model": "gpt-5", "messages": [{"role":"user","content":"hi"}] } }, { "provider": "azure-openai", "endpoint": "chat/completions?api-version=2025-05-01-preview", "headers": { "api-key": "..." }, "query": { "messages": [{"role":"user","content":"hi"}] } }, { "provider": "google-vertex-ai", "endpoint": "publishers/google/models/gemini-2.5-flash:generateContent", "headers": { "authorization": "Bearer ya29..." }, "query": { "contents": [{"role":"user","parts":[{"text":"hi"}]}] } } ]' ```
In the gateway settings, enable cache with a 1-hour TTL and a custom cache key that includes the system prompt + user message hash. Voice agents often re-handle the same intent ("what are your hours?") — cache hits return in <50ms with no token cost.
```bash curl ... -H "cf-aig-cache-ttl: 3600" -H "cf-aig-cache-key: $(echo -n 'hours' | sha256sum)" ```
Use cf-aig-metadata to tag every call with a tenant ID, then create a rate-limit rule in the dashboard: "if metadata.tenant == X, max 50 req/min".
```python client.chat.completions.create(..., extra_headers={"cf-aig-metadata": '{"tenant":"acme-co"}'}) ```
Every request lands in the AI Gateway dashboard with: latency, token counts, cache hits, errors, and a full request/response replay (gated by RBAC). Pipe to your warehouse via the Logpush sink.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Replace the upstream URL in your existing bridge (any of the previous posts) with the gateway URL. WebSocket realtime calls work the same — Cloudflare proxies the bidirectional socket transparently.
api-version in the URL; Vertex requires a Google bearer token (refreshed). Wrap in your code, not the gateway.CallSphere routes between OpenAI Realtime, Anthropic Claude on Bedrock, and Gemini Flash through our own model router that sits in FastAPI :8084 because we need per-tenant routing tied to our 115+ Postgres tables (Healthcare PHI tenants must hit Bedrock; OneRoof multi-family hits OpenAI). AI Gateway is excellent for teams without that complexity. 37 voice agents, 90+ tools, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.
Q: Can I cache speech-to-speech audio? Not directly through Gateway — caching is text-payload-aware. Cache the LLM tier of your sandwich; STT/TTS layers handle their own caching.
Q: Does Gateway speak the OpenAI Realtime WS protocol? Yes — it transparently proxies; no translation needed.
Q: How does Gateway compare to LiteLLM? LiteLLM is self-hosted and gives you full control. Gateway is managed and on Cloudflare's edge; lower latency, less ops.
Q: Can I do A/B testing across models?
Yes — use the JSON-array endpoint with different weights, or split at the tenant level via cf-aig-metadata.
Q: What's the latency overhead? ~10-30ms vs going direct, because Cloudflare's edge POPs are often closer to your users than the LLM provider.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
AWS Multi-Agent Orchestrator ships supervisor routing, classifier, and shared memory. How to compose a customer-support agent team on Bedrock that scales cleanly.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Replace expensive outbound SDR tooling with a self-hosted dialer that runs OpenAI Realtime agents at 100 concurrent calls. Full architecture and code.
Each Cloudflare agent runs on a Durable Object with its own SQLite, WebSockets, and scheduling. Agents Week 2026 shipped MCP, Code Mode, and 10GB SQLite per agent.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI