---
title: "Twilio TwiML Stream Deep Dive: Bidirectional Media for AI Voice in 2026"
description: "Twilio's <Connect><Stream> verb is the load-bearing primitive behind 80%+ of production AI voice in 2026. Mark and Clear events for barge-in, mulaw 8 kHz one-way at base, and a hard 1-stream-per-call limit. Here is how to build on it."
canonical: https://callsphere.ai/blog/vw4d-twilio-twiml-stream-deep-dive-2026
category: "AI Voice Agents"
tags: ["Twilio", "TwiML", "Media Streams", "WebSocket", "AI Voice"]
author: "CallSphere Team"
published: 2026-03-29T00:00:00.000Z
updated: 2026-05-07T16:13:30.987Z
---

# Twilio TwiML Stream Deep Dive: Bidirectional Media for AI Voice in 2026

> Twilio's <Connect><Stream> verb is the load-bearing primitive behind 80%+ of production AI voice in 2026. Mark and Clear events for barge-in, mulaw 8 kHz one-way at base, and a hard 1-stream-per-call limit. Here is how to build on it.

> Twilio Media Streams started life in 2019 as a one-way stream-out feature. Bidirectional  went GA in 2023, and as of 2026 it is the substrate underneath ConversationRelay and probably 80% of every Twilio-fronted AI voice product. The format is simple, the constraints are real, and once you understand Mark and Clear events, barge-in becomes a one-line change.

## Background

Twilio Programmable Voice lets you control calls with TwiML, an XML markup with verbs like , , , . The  noun inside  opens a WebSocket from Twilio to your server. Audio flows in both directions: media events carry base64-encoded mulaw 8 kHz 8-bit payloads (160 bytes per 20 ms frame), and your server can send the same format back to be played to the caller.

`` is the older one-way variant; `` is bidirectional and blocks subsequent TwiML until the WebSocket disconnects. The bidirectional version added Mark and Clear events: Mark lets you tag a position in your sent audio buffer and get a confirmation when Twilio plays past it; Clear empties Twilio's outbound buffer for instant interruption when the caller starts speaking.

The 8 kHz mulaw default is the friction point. OpenAI Realtime accepts G.711 directly, so for many builders Twilio's native format is fine end-to-end. For better quality you transcode upstream to 16 kHz L16 or Opus.

## Architecture

```mermaid
graph LR
    A[Caller PSTN] --> B[Twilio Voice]
    B -->|mulaw 8k 20ms frames| C[Your WebSocket Server]
    C -->|JSON media events| D[Audio Decoder]
    D -->|L16 16k| E[OpenAI Realtime]
    E -->|Opus or PCM back| F[Audio Encoder]
    F -->|mulaw 8k frames| C
    C -->|JSON media + mark + clear| B
    B --> A
```

```xml

```

```json
// Outbound media event from your server to Twilio (base64 mulaw)
{"event":"media","streamSid":"MZxx","media":{"payload":"PT4+Pj4..."}}
// Mark to track playback position
{"event":"mark","streamSid":"MZxx","mark":{"name":"utterance-42-end"}}
// Clear to interrupt currently buffered audio (barge-in)
{"event":"clear","streamSid":"MZxx"}
```

## CallSphere implementation

CallSphere uses TwiML  as the load-bearing primitive across every product. Healthcare AI calls land on a FastAPI service at port :8084 that proxies the bidirectional stream into OpenAI Realtime over WebSocket; we send Clear events the moment OpenAI's input_audio_buffer.speech_started fires, which gives sub-200ms barge-in. Sales Calling AI fires up to 5 concurrent outbound calls per tenant, each on its own . After-Hours AI uses a different pattern: a  with simul call+SMS for 120 seconds. Real Estate AI, Salon AI, IT Helpdesk AI all share the same  wiring with per-vertical agent prompts. 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 attestations, $149/$499/$1499 plans, 14-day trial, 22% affiliate.

## Build steps

1. Allocate a TwiML endpoint that returns the  response with your WebSocket URL.
2. Build the WebSocket handler: accept connection, parse start event for streamSid and parameters, then loop on media events.
3. Decode mulaw 8 kHz to L16 16 kHz before sending to OpenAI Realtime; Twilio frames are 160 bytes of mulaw = 20 ms = 160 samples after expansion, upsample to 320 samples L16.
4. Encode model output back to mulaw 8 kHz; chunk into 20 ms frames; send as media events with the streamSid.
5. Send Mark events at sentence boundaries; OpenAI sends response.audio.delta events that you align with marks.
6. On speech_started, send Clear event immediately to flush Twilio's outbound buffer for natural interruption.
7. Monitor statusCallback for stream-failed and stream-stopped to clean up server-side state.

## Pitfalls

- One  per call. Cannot fork to two AI services; must demux server-side.
- DTMF inbound only (caller-to-server). Cannot send DTMF outbound from server through .
- Mulaw payload base64-encoded in JSON; if you forget to base64-decode, you stream garbage and the model says "Hello, hello, are you there?" forever.
- Clear events take ~50 ms to take effect; do not assume instant flush.
- Bidirectional streams have a 30-second idle timeout; send keepalive media frames or expect disconnects.

## FAQ

**Should I use ConversationRelay instead of Streams for AI?**
ConversationRelay packages STT, LLM, TTS into one TwiML verb. Less control, faster build.  wins when you need custom STT/LLM/TTS, multi-modal, or non-OpenAI vendors.

**What is the latency of a Twilio bidirectional Stream?**
20-60 ms for the Twilio leg, plus your server hop, plus the model. End-to-end voice-to-voice 600-900 ms is typical with OpenAI Realtime.

**Is mulaw lossy enough to hurt ASR?**
For Whisper and Deepgram on names and digits, yes; ~3-5% absolute WER hit vs G.722 wideband. Transcode upstream if your trunk supports it.

**Can I record a Stream call?**
Yes via Twilio's separate recording API; the Stream itself does not store audio.

**Mark vs Clear: when do I use which?**
Mark for tracking playback progress (used to align tool calls with what the user already heard). Clear for barge-in interruption.

## Sources

- [Twilio Media Streams Overview](https://www.twilio.com/docs/voice/media-streams)
- [Twilio TwiML Stream verb reference](https://www.twilio.com/docs/voice/twiml/stream)
- [Bi-directional Streaming changelog](https://www.twilio.com/en-us/changelog/bi-directional-streaming-support-with-media-streams)
- [WebSocket Messages reference](https://www.twilio.com/docs/voice/media-streams/websocket-messages)

Start a [14-day trial](/trial) on our Twilio-powered stack, see [pricing](/pricing) for $149/$499/$1499, or [book a demo](/demo) to hear barge-in latency in production.

---

Source: https://callsphere.ai/blog/vw4d-twilio-twiml-stream-deep-dive-2026
