Skip to content
AI Voice Agents
AI Voice Agents13 min read0 views

Build an AI Voice Agent with Azure Communication Services Call Automation (2026)

Wire ACS Call Automation bidirectional streaming to Voice Live API for production PSTN AI agents. Real C# Web App, EventGrid hookup, midcall barge-in, transfer-to-human flow.

TL;DR — As of January 2026, ACS Call Automation bidirectional streaming is GA. You purchase a phone number through ACS, hook EventGrid to your webhook, accept the call with AnswerCall, start MediaStreamingOptions with bidirectional=true, and pipe frames into Voice Live API. The whole loop fits in a small ASP.NET service.

What you'll build

A C# ASP.NET service that answers ACS-routed calls, opens a bidirectional media stream, and bridges audio to Voice Live API. Mid-call the agent can call a tool to fetch order status, and a "transfer to human" intent triggers AddParticipant with a queue's phone number.

Prerequisites

  1. Azure subscription with ACS resource and a purchased PSTN number.
  2. EventGrid topic subscribed to ACS events.
  3. Voice Live API enabled on a Foundry resource (same region recommended).
  4. .NET 8, Azure.Communication.CallAutomation NuGet, Azure.Identity.
  5. Public HTTPS callback URL (use ngrok or Container Apps for dev).

Architecture

flowchart TD
  PSTN[Caller PSTN] --> ACS[ACS Call Automation]
  ACS -->|EventGrid IncomingCall| API[ASP.NET Webhook]
  API -->|AnswerCall + MediaStreaming| ACS
  ACS <-->|wss audio frames| API
  API <-->|wss| VL[Voice Live API]
  VL --> GPT[gpt-realtime-mini]
  API -->|AddParticipant| QUEUE[Human Queue Number]

Step 1 — Wire the EventGrid IncomingCall handler

```csharp [HttpPost("/incoming")] public async Task Incoming([FromBody] EventGridEvent[] events) { foreach (var e in events) { if (e.EventType == "Microsoft.EventGrid.SubscriptionValidationEvent") { var data = e.Data.ToObjectFromJson(); return Ok(new { validationResponse = data.ValidationCode }); } if (e.EventType == "Microsoft.Communication.IncomingCall") { var d = e.Data.ToObjectFromJson(); await _calls.AnswerCallAsync(new AnswerCallOptions(d.IncomingCallContext, new Uri($"{Host}/callback")) { MediaStreamingOptions = new MediaStreamingOptions( new Uri($"wss://{HostNoScheme}/media"), MediaStreamingContent.Audio, MediaStreamingAudioChannel.Mixed, startMediaStreaming: true) { EnableBidirectional = true, AudioFormat = AudioFormat.Pcm24KMono } }); } } return Ok(); } ```

Step 2 — Accept the WebSocket and bridge to Voice Live

```csharp [HttpGet("/media")] public async Task Media() { if (!HttpContext.WebSockets.IsWebSocketRequest) { HttpContext.Response.StatusCode = 400; return; } var acs = await HttpContext.WebSockets.AcceptWebSocketAsync(); using var vl = new ClientWebSocket(); vl.Options.SetRequestHeader("Authorization", $"Bearer {await GetAadToken()}"); await vl.ConnectAsync(new Uri("wss://vox-foundry.cognitiveservices.azure.com/voice-agent/realtime?api-version=2025-05-01-preview&model=gpt-realtime"), default);

await SendSessionUpdate(vl);
var t1 = Pump(acs, vl, ParseAcsFrame);   // ACS -> Voice Live
var t2 = Pump(vl, acs, FormatAcsFrame);  // Voice Live -> ACS
await Task.WhenAny(t1, t2);

} ```

ACS bidirectional frames are JSON-wrapped base64 PCM at 24kHz mono — the same sample rate Voice Live wants natively. No resampling.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Frame format helpers

```csharp record AcsAudioFrame(string kind, AcsAudioData audioData); record AcsAudioData(string data, string timestamp, string participantRawID, bool silent);

byte[] ParseAcsFrame(byte[] raw) { var f = JsonSerializer.Deserialize(raw); return f.kind == "AudioData" && !f.audioData.silent ? Convert.FromBase64String(f.audioData.data) : Array.Empty(); } byte[] FormatAcsFrame(byte[] pcm) { var f = new { kind = "AudioData", audioData = new { data = Convert.ToBase64String(pcm), timestamp = DateTime.UtcNow.ToString("o") } }; return JsonSerializer.SerializeToUtf8Bytes(f); } ```

Step 4 — Tool: lookup_order via function calling

In session.update, register the tool. When Voice Live emits response.function_call_arguments.done, dispatch to your CRM SDK and reply with conversation.item.create (function_call_output) + response.create. Same pattern as OpenAI Realtime.

Step 5 — Transfer to human

When the user says "agent please", parse the model's signal (a tool call request_transfer is the cleanest), then:

```csharp await _calls.GetCallConnection(callConnectionId).AddParticipantAsync(new CallInvite(new PhoneNumberIdentifier("+1800SUPPORT"), new PhoneNumberIdentifier("+1YourACSNumber"))); ```

ACS handles SIP REFER under the covers; the AI can stay in the call as a transcriber or drop with HangUpAsync.

Step 6 — Recording + Contact Lens-equivalent

Enable StartRecordingAsync for compliance. ACS recordings drop into your storage account; pipe through Azure AI Speech batch transcription + Foundry sentiment analysis for post-call analytics.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 7 — Ship to Container Apps with managed identity

Same as the previous post: az containerapp create with --user-assigned for AAD, --ingress external, scaling to 50 replicas. WebSocket sticky session is automatic in Container Apps.

Pitfalls

  • EventGrid validation handshake: must respond to SubscriptionValidationEvent with the validation code or the topic stays unverified.
  • MediaStreamingOptions requires EnableBidirectional=true for the new GA bidirectional path; otherwise you get one-way out-of-call streaming (legacy).
  • Audio format: ACS bidirectional only supports 24kHz mono PCM. If your downstream is 8kHz mu-law, transcode on the fly.
  • Concurrent call quota starts at 100 — request increases via support.
  • PSTN ingress costs ~$0.014/min in addition to Voice Live; cheaper than Twilio in most regions but watch billing.

How CallSphere does this in production

CallSphere uses ACS for select Microsoft-aligned enterprise tenants in Healthcare (HIPAA + BAA on AAD) but our default voice path is OpenAI Realtime over Twilio Media Streams via FastAPI :8084. We've measured ACS bidirectional median latency at ~750ms vs ~650ms for Twilio + OpenAI, but ACS wins on data residency for EU customers. 37 agents, 90+ tools, 115+ DB tables, 6 verticals. $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: Can I use ACS without Voice Live? Yes — bridge to any STT+LLM+TTS stack. Voice Live just removes the integration tax.

Q: How do I get an EU number? ACS supports number purchase in 30+ countries via the portal; pick country during PurchasePhoneNumbers.

Q: Latency vs Twilio Media Streams? On East US 2 with Voice Live: ~750ms. With Twilio + OpenAI Realtime: ~650ms. ACS catches up in EU regions where Twilio adds a transatlantic hop.

Q: How do I do warm transfers? Use AddParticipantAsync then MuteParticipantAsync for the AI; the live agent picks up the same call leg.

Q: Can I record with redaction? Yes — pipe recordings through Azure AI Speech batch transcription with profanity=Masked + a Foundry redaction prompt for PII/PHI before storing.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.