Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Voice Agent on Azure AI Foundry with GPT-5 Realtime (2026)

Use Microsoft Foundry's GPT Realtime API plus Voice Live API for a sub-second voice agent. Real C# and Python code, Speech Service config, Azure AD auth, deploy to Container Apps.

TL;DR — Microsoft Foundry (formerly Azure AI Foundry) ships the GPT Realtime API and a higher-level Voice Live API that bundles STT + LLM + TTS. Both speak the OpenAI Realtime WebSocket protocol. Pair with Azure AD token auth and Container Apps for HIPAA/SOC2-aligned voice agents.

What you'll build

A Python service hosted on Azure Container Apps that exposes a WebSocket bridge for browser callers, talks to the Voice Live API at /voice-agent/realtime, uses a custom system prompt, and falls back to GPT-5 with manual STT/TTS for tenants that need a custom voice. AAD-authenticated, scoped to a specific Azure AI Foundry project.

Prerequisites

  1. Azure subscription with Microsoft Foundry resource provisioned.
  2. Speech Service resource in the same region (East US 2 or West Europe).
  3. Python 3.11, openai>=1.55, azure-identity, websockets.
  4. az CLI logged in; an AAD-managed identity for Container Apps.

Architecture

flowchart LR
  B[Browser Caller] -->|wss| BR[FastAPI Bridge Container Apps]
  BR -->|AAD token| KV[(Key Vault)]
  BR -->|Voice Live API wss| VL[Foundry Voice Live]
  VL --> GPT5[GPT-5 / GPT-Realtime-Mini]
  BR -->|fallback| STT[Azure Speech STT]
  STT --> GPT5
  GPT5 --> TTS[Azure Speech Neural TTS]
  TTS --> BR
  BR --> B

Step 1 — Provision Foundry + Speech

```bash az group create -n vox -l eastus2 az cognitiveservices account create -g vox -n vox-foundry --kind AIServices --sku S0 -l eastus2 az cognitiveservices account create -g vox -n vox-speech --kind SpeechServices --sku S0 -l eastus2 ```

Note the endpoint URL — it will be https://vox-foundry.cognitiveservices.azure.com.

Step 2 — Get an AAD token (no API keys in code)

```python from azure.identity import DefaultAzureCredential cred = DefaultAzureCredential() def aad_token(): return cred.get_token("https://cognitiveservices.azure.com/.default").token ```

Set Cognitive Services User role on the Container Apps managed identity for both resources.

Step 3 — Connect to Voice Live API

```python import asyncio, websockets, json, base64 ENDPOINT = "wss://vox-foundry.cognitiveservices.azure.com/voice-agent/realtime?api-version=2025-05-01-preview&model=gpt-realtime"

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

async def voice_session(): headers = {"Authorization": f"Bearer {aad_token()}"} async with websockets.connect(ENDPOINT, additional_headers=headers) as ws: await ws.send(json.dumps({ "type": "session.update", "session": { "instructions": "You are a friendly receptionist. Keep replies short.", "voice": "en-US-AvaMultilingualNeural", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "turn_detection": {"type": "server_vad", "threshold": 0.5}, "tools": [], } })) async for msg in ws: ev = json.loads(msg) # handle response.audio.delta, response.done, etc. ```

The Voice Live wire format mirrors OpenAI Realtime; only the URL and auth differ.

Step 4 — Bridge browser audio

```python from fastapi import FastAPI, WebSocket app = FastAPI()

@app.websocket("/agent") async def agent(ws: WebSocket): await ws.accept() headers = {"Authorization": f"Bearer {aad_token()}"} async with websockets.connect(ENDPOINT, additional_headers=headers) as az: async def in_loop(): async for frame in ws.iter_bytes(): await az.send(json.dumps({ "type": "input_audio_buffer.append", "audio": base64.b64encode(frame).decode() })) async def out_loop(): async for msg in az: ev = json.loads(msg) if ev["type"] == "response.audio.delta": await ws.send_bytes(base64.b64decode(ev["delta"])) await asyncio.gather(in_loop(), out_loop()) ```

Step 5 — Add tool calling (function calls)

In session.update, include:

```json "tools": [{ "type": "function", "name": "lookup_appointment", "description": "Get next appointment for a patient", "parameters": {"type":"object","properties":{"patient_id":{"type":"string"}},"required":["patient_id"]} }], "tool_choice": "auto" ```

When the model calls the tool, you get a response.function_call_arguments.done event; reply with conversation.item.create of type function_call_output then response.create.

Step 6 — Fallback to GPT-5 + Azure Speech (custom voices)

For tenants who need a Custom Neural Voice, swap the Voice Live socket for the standard GPT-5 chat completions API plus Azure Speech SDK STT/TTS:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

```python import azure.cognitiveservices.speech as speechsdk sc = speechsdk.SpeechConfig(auth_token=f"aad#{cred.get_token('https://cognitiveservices.azure.com/.default').token}", region="eastus2") sc.speech_synthesis_voice_name = "en-US-MyCustomVoice" ```

Step 7 — Deploy to Container Apps

```bash az containerapp env create -g vox -n vox-env -l eastus2 az containerapp create -g vox -n vox-agent --environment vox-env \ --image ghcr.io/you/vox:latest --target-port 8080 --ingress external \ --user-assigned my-managed-identity --min-replicas 1 --max-replicas 50 ```

Container Apps' built-in WebSocket support means no Front Door needed for dev.

Pitfalls

  • Region locking: Voice Live API GA regions in May 2026 are East US 2, Sweden Central, West Europe. Deploy your container in the same region as Foundry.
  • AAD token TTL is 1 hour — refresh on long calls.
  • Voice list: en-US-AvaMultilingualNeural is the default; pick from the Speech Studio voice gallery.
  • Custom voices need a Custom Neural Voice deployment — only available for "Pro" tier customers.
  • Cost: Voice Live API is roughly $0.06/min; GPT-5 + STT/TTS combined is cheaper (~$0.04/min) but you eat the integration cost.

How CallSphere does this in production

CallSphere runs a multi-cloud strategy: OpenAI Realtime as primary, Azure Voice Live API as a fallback for tenants in Microsoft procurement frameworks (we have several enterprise Healthcare customers locked to Azure). Same FastAPI :8084 surface, same 90+ tools, same 115+ DB tables, just a different upstream socket. 37 agents across 6 verticals. Pricing: $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: GPT-5 vs GPT-Realtime-Mini for voice? Mini is purpose-built for voice — lower latency and 30% cheaper. GPT-5 wins on complex reasoning but adds 200-400ms.

Q: Can I use my own STT/TTS with Voice Live? No — Voice Live is fully managed. For BYO STT/TTS, drop down to the standalone Realtime API + Azure Speech SDK.

Q: HIPAA? Sign a BAA via the Azure portal, enable Customer Managed Keys on the Foundry resource, use Private Endpoints to keep audio off the public internet.

Q: Latency target? ~600-800ms voice-to-voice on Voice Live in East US 2.

Q: Streaming function-call args? Yes — listen to response.function_call_arguments.delta and parse incrementally for ultra-low latency tool dispatch.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.