---
title: "Build Multi-LLM Voice Routing with Cloudflare AI Gateway (2026)"
description: "Use Cloudflare AI Gateway to route voice agent inference across OpenAI, Anthropic, Google, and Workers AI with automatic fallback, caching, and per-tenant rate limits."
canonical: https://callsphere.ai/blog/vw5h-build-multi-llm-voice-routing-cloudflare-ai-gateway
category: "AI Infrastructure"
tags: ["Cloudflare", "AI Gateway", "Multi-LLM", "Routing", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-08T00:00:00.000Z
updated: 2026-05-07T16:30:07.814Z
---

# Build Multi-LLM Voice Routing with Cloudflare AI Gateway (2026)

> Use Cloudflare AI Gateway to route voice agent inference across OpenAI, Anthropic, Google, and Workers AI with automatic fallback, caching, and per-tenant rate limits.

> **TL;DR** — Cloudflare AI Gateway sits between your voice agent and any LLM provider, giving you cache, observability, rate limits, and automatic failover across providers via the `universal` endpoint. Point your OpenAI client at `https://gateway.ai.cloudflare.com/v1/{account}/{gw}/openai` and you immediately get analytics + caching with no code change.

## What you'll build

A voice agent fronted by AI Gateway that tries OpenAI `gpt-realtime` first, falls back to Azure Voice Live on rate-limit, then to Google `gemini-2.5-flash-live` on full outage. Per-tenant token budgets enforced via Gateway; cached answers for FAQ-style turns saving 60% on input tokens.

## Prerequisites

1. Cloudflare account with AI Gateway enabled (`gateway.ai.cloudflare.com`).
2. API keys for OpenAI, Azure, and Google AI Studio.
3. Existing voice bridge (any of the previous tutorials in this series).

## Architecture

```mermaid
flowchart LR
  V[Voice Bridge] -->|gateway URL| GW[Cloudflare AI Gateway]
  GW -->|primary| OAI[OpenAI Realtime]
  GW -->|fallback 1| AZ[Azure Voice Live]
  GW -->|fallback 2| GG[Google Gemini Live]
  GW --> CACHE[(Cache)]
  GW --> LOG[(Analytics + Logs)]
  GW --> LIM[Per-tenant Rate Limits]
```

## Step 1 — Create the gateway

In the Cloudflare dashboard → AI → AI Gateway → Create gateway named `voice-prod`. Note the URL: `https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod`.

## Step 2 — Point your OpenAI client at the gateway

```python
from openai import OpenAI
client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=f"[https://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai](https://gateway.ai.cloudflare.com/v1/%7BACCOUNT%7D/voice-prod/openai)"
)
```

That's it — every request now flows through the gateway. For Realtime WebSockets, use `wss://gateway.ai.cloudflare.com/v1/{ACCOUNT}/voice-prod/openai/realtime`.

## Step 3 — Use the Universal endpoint for failover

The Universal endpoint accepts a JSON array of provider attempts; AI Gateway tries them in order until one succeeds:

```bash
curl [https://gateway.ai.cloudflare.com/v1/$ACCOUNT/voice-prod](https://gateway.ai.cloudflare.com/v1/$ACCOUNT/voice-prod) \
  -H "Content-Type: application/json" \
  -d '[
    {
      "provider": "openai",
      "endpoint": "chat/completions",
      "headers": { "authorization": "Bearer sk-..." },
      "query": { "model": "gpt-5", "messages": [{"role":"user","content":"hi"}] }
    },
    {
      "provider": "azure-openai",
      "endpoint": "chat/completions?api-version=2025-05-01-preview",
      "headers": { "api-key": "..." },
      "query": { "messages": [{"role":"user","content":"hi"}] }
    },
    {
      "provider": "google-vertex-ai",
      "endpoint": "publishers/google/models/gemini-2.5-flash:generateContent",
      "headers": { "authorization": "Bearer ya29..." },
      "query": { "contents": [{"role":"user","parts":[{"text":"hi"}]}] }
    }
  ]'
```

## Step 4 — Enable cache for FAQ-like turns

In the gateway settings, enable cache with a 1-hour TTL and a custom cache key that includes the system prompt + user message hash. Voice agents often re-handle the same intent ("what are your hours?") — cache hits return in <50ms with no token cost.

```bash
curl ... -H "cf-aig-cache-ttl: 3600" -H "cf-aig-cache-key: $(echo -n 'hours' | sha256sum)"
```

## Step 5 — Per-tenant rate limits

Use `cf-aig-metadata` to tag every call with a tenant ID, then create a rate-limit rule in the dashboard: "if metadata.tenant == X, max 50 req/min".

```python
client.chat.completions.create(..., extra_headers={"cf-aig-metadata": '{"tenant":"acme-co"}'})
```

## Step 6 — Observability

Every request lands in the AI Gateway dashboard with: latency, token counts, cache hits, errors, and a full request/response replay (gated by RBAC). Pipe to your warehouse via the Logpush sink.

## Step 7 — Wire into your voice agent

Replace the upstream URL in your existing bridge (any of the previous posts) with the gateway URL. WebSocket realtime calls work the same — Cloudflare proxies the bidirectional socket transparently.

## Pitfalls

- **WebSocket support is universal-only**: you can't currently use the JSON-array failover for streaming WS endpoints; the failover applies to HTTP.
- **Cache key collisions**: don't cache by user prompt alone — include system prompt + temperature.
- **Provider quirks**: Azure OpenAI requires `api-version` in the URL; Vertex requires a Google bearer token (refreshed). Wrap in your code, not the gateway.
- **Per-request logs** are sampled at high QPS; turn on full logging only for forensic analysis.
- **Cost**: Gateway itself is free up to 100k req/day; beyond that it's $1 per 1M requests on the Pro plan.

## How CallSphere does this in production

CallSphere routes between OpenAI Realtime, Anthropic Claude on Bedrock, and Gemini Flash through our own model router that sits in FastAPI :8084 because we need per-tenant routing tied to our 115+ Postgres tables (Healthcare PHI tenants must hit Bedrock; OneRoof multi-family hits OpenAI). AI Gateway is excellent for teams without that complexity. 37 voice agents, 90+ tools, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.

## FAQ

**Q: Can I cache speech-to-speech audio?**
Not directly through Gateway — caching is text-payload-aware. Cache the LLM tier of your sandwich; STT/TTS layers handle their own caching.

**Q: Does Gateway speak the OpenAI Realtime WS protocol?**
Yes — it transparently proxies; no translation needed.

**Q: How does Gateway compare to LiteLLM?**
LiteLLM is self-hosted and gives you full control. Gateway is managed and on Cloudflare's edge; lower latency, less ops.

**Q: Can I do A/B testing across models?**
Yes — use the JSON-array endpoint with different weights, or split at the tenant level via `cf-aig-metadata`.

**Q: What's the latency overhead?**
~10-30ms vs going direct, because Cloudflare's edge POPs are often closer to your users than the LLM provider.

## Sources

- [Cloudflare AI Gateway Overview](https://developers.cloudflare.com/ai-gateway/)
- [AI Gateway Universal Endpoint](https://developers.cloudflare.com/ai-gateway/providers/universal/)
- [Cloudflare AI Platform — inference layer for agents](https://blog.cloudflare.com/ai-platform/)
- [Cloudflare AI Gateway pricing — Truefoundry](https://www.truefoundry.com/blog/cloudflare-ai-gateway-pricing)
- [Top 5 LLM Gateways in 2026 — Maxim AI](https://www.getmaxim.ai/articles/top-5-llm-gateways-in-2026-a-production-ready-comparison/)

---

Source: https://callsphere.ai/blog/vw5h-build-multi-llm-voice-routing-cloudflare-ai-gateway
