Build Multi-LLM Voice Routing with Cloudflare AI Gateway (2026)
Use Cloudflare AI Gateway to route voice agent inference across OpenAI, Anthropic, Google, and Workers AI with automatic fallback, caching, and per-tenant rate limits.
TL;DR — Cloudflare AI Gateway sits between your voice agent and any LLM provider, giving you cache, observability, rate limits, and automatic failover across providers via the
universalendpoint. Point your OpenAI client athttps://gateway.ai.cloudflare.com/v1/{account}/{gw}/openaiand you immediately get analytics + caching with no code change.
What you'll build
A voice agent fronted by AI Gateway that tries OpenAI gpt-realtime first, falls back to Azure Voice Live on rate-limit, then to Google gemini-2.5-flash-live on full outage. Per-tenant token budgets enforced via Gateway; cached answers for FAQ-style turns saving 60% on input tokens.
Prerequisites
- Cloudflare account with AI Gateway enabled (
gateway.ai.cloudflare.com). - API keys for OpenAI, Azure, and Google AI Studio.
- Existing voice bridge (any of the previous tutorials in this series).
Architecture
flowchart LR
V[Voice Bridge] -->|gateway URL| GW[Cloudflare AI Gateway]
GW -->|primary| OAI[OpenAI Realtime]
GW -->|fallback 1| AZ[Azure Voice Live]
GW -->|fallback 2| GG[Google Gemini Live]
GW --> CACHE[(Cache)]
GW --> LOG[(Analytics + Logs)]
GW --> LIM[Per-tenant Rate Limits]
Step 1 — Create the gateway
In the Cloudflare dashboard → AI → AI Gateway → Create gateway named voice-prod. Note the URL: https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod.
Step 2 — Point your OpenAI client at the gateway
```python from openai import OpenAI client = OpenAI( api_key=os.environ["OPENAI_API_KEY"], base_url=f"https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai" ) ```
That's it — every request now flows through the gateway. For Realtime WebSockets, use wss://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai/realtime.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Use the Universal endpoint for failover
The Universal endpoint accepts a JSON array of provider attempts; AI Gateway tries them in order until one succeeds:
```bash curl https://gateway.ai.cloudflare.com/v1/$ACCOUNT/voice-prod \ -H "Content-Type: application/json" \ -d '[ { "provider": "openai", "endpoint": "chat/completions", "headers": { "authorization": "Bearer sk-..." }, "query": { "model": "gpt-5", "messages": [{"role":"user","content":"hi"}] } }, { "provider": "azure-openai", "endpoint": "chat/completions?api-version=2025-05-01-preview", "headers": { "api-key": "..." }, "query": { "messages": [{"role":"user","content":"hi"}] } }, { "provider": "google-vertex-ai", "endpoint": "publishers/google/models/gemini-2.5-flash:generateContent", "headers": { "authorization": "Bearer ya29..." }, "query": { "contents": [{"role":"user","parts":[{"text":"hi"}]}] } } ]' ```
Step 4 — Enable cache for FAQ-like turns
In the gateway settings, enable cache with a 1-hour TTL and a custom cache key that includes the system prompt + user message hash. Voice agents often re-handle the same intent ("what are your hours?") — cache hits return in <50ms with no token cost.
```bash curl ... -H "cf-aig-cache-ttl: 3600" -H "cf-aig-cache-key: $(echo -n 'hours' | sha256sum)" ```
Step 5 — Per-tenant rate limits
Use cf-aig-metadata to tag every call with a tenant ID, then create a rate-limit rule in the dashboard: "if metadata.tenant == X, max 50 req/min".
```python client.chat.completions.create(..., extra_headers={"cf-aig-metadata": '{"tenant":"acme-co"}'}) ```
Step 6 — Observability
Every request lands in the AI Gateway dashboard with: latency, token counts, cache hits, errors, and a full request/response replay (gated by RBAC). Pipe to your warehouse via the Logpush sink.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 7 — Wire into your voice agent
Replace the upstream URL in your existing bridge (any of the previous posts) with the gateway URL. WebSocket realtime calls work the same — Cloudflare proxies the bidirectional socket transparently.
Pitfalls
- WebSocket support is universal-only: you can't currently use the JSON-array failover for streaming WS endpoints; the failover applies to HTTP.
- Cache key collisions: don't cache by user prompt alone — include system prompt + temperature.
- Provider quirks: Azure OpenAI requires
api-versionin the URL; Vertex requires a Google bearer token (refreshed). Wrap in your code, not the gateway. - Per-request logs are sampled at high QPS; turn on full logging only for forensic analysis.
- Cost: Gateway itself is free up to 100k req/day; beyond that it's $1 per 1M requests on the Pro plan.
How CallSphere does this in production
CallSphere routes between OpenAI Realtime, Anthropic Claude on Bedrock, and Gemini Flash through our own model router that sits in FastAPI :8084 because we need per-tenant routing tied to our 115+ Postgres tables (Healthcare PHI tenants must hit Bedrock; OneRoof multi-family hits OpenAI). AI Gateway is excellent for teams without that complexity. 37 voice agents, 90+ tools, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.
FAQ
Q: Can I cache speech-to-speech audio? Not directly through Gateway — caching is text-payload-aware. Cache the LLM tier of your sandwich; STT/TTS layers handle their own caching.
Q: Does Gateway speak the OpenAI Realtime WS protocol? Yes — it transparently proxies; no translation needed.
Q: How does Gateway compare to LiteLLM? LiteLLM is self-hosted and gives you full control. Gateway is managed and on Cloudflare's edge; lower latency, less ops.
Q: Can I do A/B testing across models?
Yes — use the JSON-array endpoint with different weights, or split at the tenant level via cf-aig-metadata.
Q: What's the latency overhead? ~10-30ms vs going direct, because Cloudflare's edge POPs are often closer to your users than the LLM provider.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.