Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Twilio TwiML Stream Deep Dive: Bidirectional Media for AI Voice in 2026

Twilio's <Connect><Stream> verb is the load-bearing primitive behind 80%+ of production AI voice in 2026. Mark and Clear events for barge-in, mulaw 8 kHz one-way at base, and a hard 1-stream-per-call limit. Here is how to build on it.

Twilio Media Streams started life in 2019 as a one-way stream-out feature. Bidirectional went GA in 2023, and as of 2026 it is the substrate underneath ConversationRelay and probably 80% of every Twilio-fronted AI voice product. The format is simple, the constraints are real, and once you understand Mark and Clear events, barge-in becomes a one-line change.

Background

Twilio Programmable Voice lets you control calls with TwiML, an XML markup with verbs like , , , . The noun inside opens a WebSocket from Twilio to your server. Audio flows in both directions: media events carry base64-encoded mulaw 8 kHz 8-bit payloads (160 bytes per 20 ms frame), and your server can send the same format back to be played to the caller.

<Start><Stream> is the older one-way variant; <Connect><Stream> is bidirectional and blocks subsequent TwiML until the WebSocket disconnects. The bidirectional version added Mark and Clear events: Mark lets you tag a position in your sent audio buffer and get a confirmation when Twilio plays past it; Clear empties Twilio's outbound buffer for instant interruption when the caller starts speaking.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The 8 kHz mulaw default is the friction point. OpenAI Realtime accepts G.711 directly, so for many builders Twilio's native format is fine end-to-end. For better quality you transcode upstream to 16 kHz L16 or Opus.

Architecture

graph LR
    A[Caller PSTN] --> B[Twilio Voice]
    B -->|mulaw 8k 20ms frames| C[Your WebSocket Server]
    C -->|JSON media events| D[Audio Decoder]
    D -->|L16 16k| E[OpenAI Realtime]
    E -->|Opus or PCM back| F[Audio Encoder]
    F -->|mulaw 8k frames| C
    C -->|JSON media + mark + clear| B
    B --> A
<Response>
  <Connect>
    <Stream url="wss://bridge.callsphere.ai/realtime"
            track="inbound_track"
            statusCallback="https://callsphere.ai/api/twilio/stream-status">
      <Parameter name="tenant" value="abc123"/>
      <Parameter name="agent" value="healthcare-intake"/>
    </Stream>
  </Connect>
</Response>
// Outbound media event from your server to Twilio (base64 mulaw)
{"event":"media","streamSid":"MZxx","media":{"payload":"PT4+Pj4..."}}
// Mark to track playback position
{"event":"mark","streamSid":"MZxx","mark":{"name":"utterance-42-end"}}
// Clear to interrupt currently buffered audio (barge-in)
{"event":"clear","streamSid":"MZxx"}

CallSphere implementation

CallSphere uses TwiML as the load-bearing primitive across every product. Healthcare AI calls land on a FastAPI service at port :8084 that proxies the bidirectional stream into OpenAI Realtime over WebSocket; we send Clear events the moment OpenAI's input_audio_buffer.speech_started fires, which gives sub-200ms barge-in. Sales Calling AI fires up to 5 concurrent outbound calls per tenant, each on its own . After-Hours AI uses a different pattern: a with simul call+SMS for 120 seconds. Real Estate AI, Salon AI, IT Helpdesk AI all share the same wiring with per-vertical agent prompts. 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 attestations, $149/$499/$1499 plans, 14-day trial, 22% affiliate.

Build steps

  1. Allocate a TwiML endpoint that returns the response with your WebSocket URL.
  2. Build the WebSocket handler: accept connection, parse start event for streamSid and parameters, then loop on media events.
  3. Decode mulaw 8 kHz to L16 16 kHz before sending to OpenAI Realtime; Twilio frames are 160 bytes of mulaw = 20 ms = 160 samples after expansion, upsample to 320 samples L16.
  4. Encode model output back to mulaw 8 kHz; chunk into 20 ms frames; send as media events with the streamSid.
  5. Send Mark events at sentence boundaries; OpenAI sends response.audio.delta events that you align with marks.
  6. On speech_started, send Clear event immediately to flush Twilio's outbound buffer for natural interruption.
  7. Monitor statusCallback for stream-failed and stream-stopped to clean up server-side state.

Pitfalls

  • One per call. Cannot fork to two AI services; must demux server-side.
  • DTMF inbound only (caller-to-server). Cannot send DTMF outbound from server through .
  • Mulaw payload base64-encoded in JSON; if you forget to base64-decode, you stream garbage and the model says "Hello, hello, are you there?" forever.
  • Clear events take ~50 ms to take effect; do not assume instant flush.
  • Bidirectional streams have a 30-second idle timeout; send keepalive media frames or expect disconnects.

FAQ

Should I use ConversationRelay instead of Streams for AI? ConversationRelay packages STT, LLM, TTS into one TwiML verb. Less control, faster build. wins when you need custom STT/LLM/TTS, multi-modal, or non-OpenAI vendors.

What is the latency of a Twilio bidirectional Stream? 20-60 ms for the Twilio leg, plus your server hop, plus the model. End-to-end voice-to-voice 600-900 ms is typical with OpenAI Realtime.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Is mulaw lossy enough to hurt ASR? For Whisper and Deepgram on names and digits, yes; ~3-5% absolute WER hit vs G.722 wideband. Transcode upstream if your trunk supports it.

Can I record a Stream call? Yes via Twilio's separate recording API; the Stream itself does not store audio.

Mark vs Clear: when do I use which? Mark for tracking playback progress (used to align tool calls with what the user already heard). Clear for barge-in interruption.

Sources

Start a 14-day trial on our Twilio-powered stack, see pricing for $149/$499/$1499, or book a demo to hear barge-in latency in production.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.