Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Build Multi-LLM Voice Routing with Cloudflare AI Gateway (2026)

Use Cloudflare AI Gateway to route voice agent inference across OpenAI, Anthropic, Google, and Workers AI with automatic fallback, caching, and per-tenant rate limits.

TL;DR — Cloudflare AI Gateway sits between your voice agent and any LLM provider, giving you cache, observability, rate limits, and automatic failover across providers via the universal endpoint. Point your OpenAI client at https://gateway.ai.cloudflare.com/v1/{account}/{gw}/openai and you immediately get analytics + caching with no code change.

What you'll build

A voice agent fronted by AI Gateway that tries OpenAI gpt-realtime first, falls back to Azure Voice Live on rate-limit, then to Google gemini-2.5-flash-live on full outage. Per-tenant token budgets enforced via Gateway; cached answers for FAQ-style turns saving 60% on input tokens.

Prerequisites

  1. Cloudflare account with AI Gateway enabled (gateway.ai.cloudflare.com).
  2. API keys for OpenAI, Azure, and Google AI Studio.
  3. Existing voice bridge (any of the previous tutorials in this series).

Architecture

flowchart LR
  V[Voice Bridge] -->|gateway URL| GW[Cloudflare AI Gateway]
  GW -->|primary| OAI[OpenAI Realtime]
  GW -->|fallback 1| AZ[Azure Voice Live]
  GW -->|fallback 2| GG[Google Gemini Live]
  GW --> CACHE[(Cache)]
  GW --> LOG[(Analytics + Logs)]
  GW --> LIM[Per-tenant Rate Limits]

Step 1 — Create the gateway

In the Cloudflare dashboard → AI → AI Gateway → Create gateway named voice-prod. Note the URL: https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod.

Step 2 — Point your OpenAI client at the gateway

```python from openai import OpenAI client = OpenAI( api_key=os.environ["OPENAI_API_KEY"], base_url=f"https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai" ) ```

That's it — every request now flows through the gateway. For Realtime WebSockets, use wss://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai/realtime.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Use the Universal endpoint for failover

The Universal endpoint accepts a JSON array of provider attempts; AI Gateway tries them in order until one succeeds:

```bash curl https://gateway.ai.cloudflare.com/v1/$ACCOUNT/voice-prod \ -H "Content-Type: application/json" \ -d '[ { "provider": "openai", "endpoint": "chat/completions", "headers": { "authorization": "Bearer sk-..." }, "query": { "model": "gpt-5", "messages": [{"role":"user","content":"hi"}] } }, { "provider": "azure-openai", "endpoint": "chat/completions?api-version=2025-05-01-preview", "headers": { "api-key": "..." }, "query": { "messages": [{"role":"user","content":"hi"}] } }, { "provider": "google-vertex-ai", "endpoint": "publishers/google/models/gemini-2.5-flash:generateContent", "headers": { "authorization": "Bearer ya29..." }, "query": { "contents": [{"role":"user","parts":[{"text":"hi"}]}] } } ]' ```

Step 4 — Enable cache for FAQ-like turns

In the gateway settings, enable cache with a 1-hour TTL and a custom cache key that includes the system prompt + user message hash. Voice agents often re-handle the same intent ("what are your hours?") — cache hits return in <50ms with no token cost.

```bash curl ... -H "cf-aig-cache-ttl: 3600" -H "cf-aig-cache-key: $(echo -n 'hours' | sha256sum)" ```

Step 5 — Per-tenant rate limits

Use cf-aig-metadata to tag every call with a tenant ID, then create a rate-limit rule in the dashboard: "if metadata.tenant == X, max 50 req/min".

```python client.chat.completions.create(..., extra_headers={"cf-aig-metadata": '{"tenant":"acme-co"}'}) ```

Step 6 — Observability

Every request lands in the AI Gateway dashboard with: latency, token counts, cache hits, errors, and a full request/response replay (gated by RBAC). Pipe to your warehouse via the Logpush sink.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 7 — Wire into your voice agent

Replace the upstream URL in your existing bridge (any of the previous posts) with the gateway URL. WebSocket realtime calls work the same — Cloudflare proxies the bidirectional socket transparently.

Pitfalls

  • WebSocket support is universal-only: you can't currently use the JSON-array failover for streaming WS endpoints; the failover applies to HTTP.
  • Cache key collisions: don't cache by user prompt alone — include system prompt + temperature.
  • Provider quirks: Azure OpenAI requires api-version in the URL; Vertex requires a Google bearer token (refreshed). Wrap in your code, not the gateway.
  • Per-request logs are sampled at high QPS; turn on full logging only for forensic analysis.
  • Cost: Gateway itself is free up to 100k req/day; beyond that it's $1 per 1M requests on the Pro plan.

How CallSphere does this in production

CallSphere routes between OpenAI Realtime, Anthropic Claude on Bedrock, and Gemini Flash through our own model router that sits in FastAPI :8084 because we need per-tenant routing tied to our 115+ Postgres tables (Healthcare PHI tenants must hit Bedrock; OneRoof multi-family hits OpenAI). AI Gateway is excellent for teams without that complexity. 37 voice agents, 90+ tools, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: Can I cache speech-to-speech audio? Not directly through Gateway — caching is text-payload-aware. Cache the LLM tier of your sandwich; STT/TTS layers handle their own caching.

Q: Does Gateway speak the OpenAI Realtime WS protocol? Yes — it transparently proxies; no translation needed.

Q: How does Gateway compare to LiteLLM? LiteLLM is self-hosted and gives you full control. Gateway is managed and on Cloudflare's edge; lower latency, less ops.

Q: Can I do A/B testing across models? Yes — use the JSON-array endpoint with different weights, or split at the tenant level via cf-aig-metadata.

Q: What's the latency overhead? ~10-30ms vs going direct, because Cloudflare's edge POPs are often closer to your users than the LLM provider.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.