---
title: "Build a Voice Agent on Azure AI Foundry with GPT-5 Realtime (2026)"
description: "Use Microsoft Foundry's GPT Realtime API plus Voice Live API for a sub-second voice agent. Real C# and Python code, Speech Service config, Azure AD auth, deploy to Container Apps."
canonical: https://callsphere.ai/blog/vw5h-build-voice-agent-azure-ai-foundry-gpt5-realtime
category: "AI Voice Agents"
tags: ["Azure", "AI Foundry", "GPT-5", "Realtime", "Speech", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-30T00:00:00.000Z
updated: 2026-05-07T16:30:06.929Z
---

# Build a Voice Agent on Azure AI Foundry with GPT-5 Realtime (2026)

> Use Microsoft Foundry's GPT Realtime API plus Voice Live API for a sub-second voice agent. Real C# and Python code, Speech Service config, Azure AD auth, deploy to Container Apps.

> **TL;DR** — Microsoft Foundry (formerly Azure AI Foundry) ships the GPT Realtime API and a higher-level Voice Live API that bundles STT + LLM + TTS. Both speak the OpenAI Realtime WebSocket protocol. Pair with Azure AD token auth and Container Apps for HIPAA/SOC2-aligned voice agents.

## What you'll build

A Python service hosted on Azure Container Apps that exposes a WebSocket bridge for browser callers, talks to the **Voice Live API** at `/voice-agent/realtime`, uses a custom system prompt, and falls back to **GPT-5** with manual STT/TTS for tenants that need a custom voice. AAD-authenticated, scoped to a specific Azure AI Foundry project.

## Prerequisites

1. Azure subscription with Microsoft Foundry resource provisioned.
2. Speech Service resource in the same region (East US 2 or West Europe).
3. Python 3.11, `openai>=1.55`, `azure-identity`, `websockets`.
4. `az` CLI logged in; an AAD-managed identity for Container Apps.

## Architecture

```mermaid
flowchart LR
  B[Browser Caller] -->|wss| BR[FastAPI Bridge Container Apps]
  BR -->|AAD token| KV[(Key Vault)]
  BR -->|Voice Live API wss| VL[Foundry Voice Live]
  VL --> GPT5[GPT-5 / GPT-Realtime-Mini]
  BR -->|fallback| STT[Azure Speech STT]
  STT --> GPT5
  GPT5 --> TTS[Azure Speech Neural TTS]
  TTS --> BR
  BR --> B
```

## Step 1 — Provision Foundry + Speech

```bash
az group create -n vox -l eastus2
az cognitiveservices account create -g vox -n vox-foundry --kind AIServices --sku S0 -l eastus2
az cognitiveservices account create -g vox -n vox-speech --kind SpeechServices --sku S0 -l eastus2
```

Note the endpoint URL — it will be `https://vox-foundry.cognitiveservices.azure.com`.

## Step 2 — Get an AAD token (no API keys in code)

```python
from azure.identity import DefaultAzureCredential
cred = DefaultAzureCredential()
def aad_token():
    return cred.get_token("[https://cognitiveservices.azure.com/.default").token](https://cognitiveservices.azure.com/.default%22).token)
```

Set `Cognitive Services User` role on the Container Apps managed identity for both resources.

## Step 3 — Connect to Voice Live API

```python
import asyncio, websockets, json, base64
ENDPOINT = "wss://vox-foundry.cognitiveservices.azure.com/voice-agent/realtime?api-version=2025-05-01-preview&model=gpt-realtime"

async def voice_session():
    headers = {"Authorization": f"Bearer {aad_token()}"}
    async with websockets.connect(ENDPOINT, additional_headers=headers) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "instructions": "You are a friendly receptionist. Keep replies short.",
                "voice": "en-US-AvaMultilingualNeural",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad", "threshold": 0.5},
                "tools": [],
            }
        }))
        async for msg in ws:
            ev = json.loads(msg)
            # handle response.audio.delta, response.done, etc.
```

The Voice Live wire format mirrors OpenAI Realtime; only the URL and auth differ.

## Step 4 — Bridge browser audio

```python
from fastapi import FastAPI, WebSocket
app = FastAPI()

@app.websocket("/agent")
async def agent(ws: WebSocket):
    await ws.accept()
    headers = {"Authorization": f"Bearer {aad_token()}"}
    async with websockets.connect(ENDPOINT, additional_headers=headers) as az:
        async def in_loop():
            async for frame in ws.iter_bytes():
                await az.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(frame).decode()
                }))
        async def out_loop():
            async for msg in az:
                ev = json.loads(msg)
                if ev["type"] == "response.audio.delta":
                    await ws.send_bytes(base64.b64decode(ev["delta"]))
        await asyncio.gather(in_loop(), out_loop())
```

## Step 5 — Add tool calling (function calls)

In `session.update`, include:

```json
"tools": [{
  "type": "function",
  "name": "lookup_appointment",
  "description": "Get next appointment for a patient",
  "parameters": {"type":"object","properties":{"patient_id":{"type":"string"}},"required":["patient_id"]}
}],
"tool_choice": "auto"
```

When the model calls the tool, you get a `response.function_call_arguments.done` event; reply with `conversation.item.create` of type `function_call_output` then `response.create`.

## Step 6 — Fallback to GPT-5 + Azure Speech (custom voices)

For tenants who need a Custom Neural Voice, swap the Voice Live socket for the standard GPT-5 chat completions API plus Azure Speech SDK STT/TTS:

```python
import azure.cognitiveservices.speech as speechsdk
sc = speechsdk.SpeechConfig(auth_token=f"aad#{cred.get_token('[https://cognitiveservices.azure.com/.default').token}](https://cognitiveservices.azure.com/.default').token%7D)", region="eastus2")
sc.speech_synthesis_voice_name = "en-US-MyCustomVoice"
```

## Step 7 — Deploy to Container Apps

```bash
az containerapp env create -g vox -n vox-env -l eastus2
az containerapp create -g vox -n vox-agent --environment vox-env \
  --image ghcr.io/you/vox:latest --target-port 8080 --ingress external \
  --user-assigned my-managed-identity --min-replicas 1 --max-replicas 50
```

Container Apps' built-in WebSocket support means no Front Door needed for dev.

## Pitfalls

- **Region locking**: Voice Live API GA regions in May 2026 are East US 2, Sweden Central, West Europe. Deploy your container in the same region as Foundry.
- **AAD token TTL is 1 hour** — refresh on long calls.
- **Voice list**: `en-US-AvaMultilingualNeural` is the default; pick from the Speech Studio voice gallery.
- **Custom voices** need a Custom Neural Voice deployment — only available for "Pro" tier customers.
- **Cost**: Voice Live API is roughly $0.06/min; GPT-5 + STT/TTS combined is cheaper (~$0.04/min) but you eat the integration cost.

## How CallSphere does this in production

CallSphere runs a multi-cloud strategy: OpenAI Realtime as primary, Azure Voice Live API as a fallback for tenants in Microsoft procurement frameworks (we have several enterprise Healthcare customers locked to Azure). Same FastAPI :8084 surface, same 90+ tools, same 115+ DB tables, just a different upstream socket. 37 agents across 6 verticals. Pricing: $149/$499/$1499, 14-day trial, 22% affiliate.

## FAQ

**Q: GPT-5 vs GPT-Realtime-Mini for voice?**
Mini is purpose-built for voice — lower latency and 30% cheaper. GPT-5 wins on complex reasoning but adds 200-400ms.

**Q: Can I use my own STT/TTS with Voice Live?**
No — Voice Live is fully managed. For BYO STT/TTS, drop down to the standalone Realtime API + Azure Speech SDK.

**Q: HIPAA?**
Sign a BAA via the Azure portal, enable Customer Managed Keys on the Foundry resource, use Private Endpoints to keep audio off the public internet.

**Q: Latency target?**
~600-800ms voice-to-voice on Voice Live in East US 2.

**Q: Streaming function-call args?**
Yes — listen to `response.function_call_arguments.delta` and parse incrementally for ultra-low latency tool dispatch.

## Sources

- [Use the GPT Realtime API for speech and audio with Azure OpenAI](https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/realtime-audio)
- [Voice Live API Overview — Microsoft Learn](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/voice-live)
- [How to build a voice agent — Microsoft Foundry](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-voice-agent-integration)
- [Foundry Models sold directly by Azure](https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure)
- [What's new in Microsoft Foundry — Dec 2025 / Jan 2026](https://devblogs.microsoft.com/foundry/whats-new-in-microsoft-foundry-dec-2025-jan-2026/)

---

Source: https://callsphere.ai/blog/vw5h-build-voice-agent-azure-ai-foundry-gpt5-realtime