By Sagar Shankaran, Founder of CallSphere
Twilio's <Connect><Stream> verb is the load-bearing primitive behind 80%+ of production AI voice in 2026. Mark and Clear events for barge-in, mulaw 8 kHz one-way at base, and a hard 1-stream-per-call limit. Here is how to build on it.
Key takeaways
Twilio Media Streams started life in 2019 as a one-way stream-out feature. Bidirectional
went GA in 2023, and as of 2026 it is the substrate underneath ConversationRelay and probably 80% of every Twilio-fronted AI voice product. The format is simple, the constraints are real, and once you understand Mark and Clear events, barge-in becomes a one-line change.
Twilio Programmable Voice lets you control calls with TwiML, an XML markup with verbs like
<Start><Stream> is the older one-way variant; <Connect><Stream> is bidirectional and blocks subsequent TwiML until the WebSocket disconnects. The bidirectional version added Mark and Clear events: Mark lets you tag a position in your sent audio buffer and get a confirmation when Twilio plays past it; Clear empties Twilio's outbound buffer for instant interruption when the caller starts speaking.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The 8 kHz mulaw default is the friction point. OpenAI Realtime accepts G.711 directly, so for many builders Twilio's native format is fine end-to-end. For better quality you transcode upstream to 16 kHz L16 or Opus.
graph LR
A[Caller PSTN] --> B[Twilio Voice]
B -->|mulaw 8k 20ms frames| C[Your WebSocket Server]
C -->|JSON media events| D[Audio Decoder]
D -->|L16 16k| E[OpenAI Realtime]
E -->|Opus or PCM back| F[Audio Encoder]
F -->|mulaw 8k frames| C
C -->|JSON media + mark + clear| B
B --> A
<Response>
<Connect>
<Stream url="wss://bridge.callsphere.ai/realtime"
track="inbound_track"
statusCallback="https://callsphere.ai/api/twilio/stream-status">
<Parameter name="tenant" value="abc123"/>
<Parameter name="agent" value="healthcare-intake"/>
</Stream>
</Connect>
</Response>
// Outbound media event from your server to Twilio (base64 mulaw)
{"event":"media","streamSid":"MZxx","media":{"payload":"PT4+Pj4..."}}
// Mark to track playback position
{"event":"mark","streamSid":"MZxx","mark":{"name":"utterance-42-end"}}
// Clear to interrupt currently buffered audio (barge-in)
{"event":"clear","streamSid":"MZxx"}
CallSphere uses TwiML
Should I use ConversationRelay instead of Streams for AI?
ConversationRelay packages STT, LLM, TTS into one TwiML verb. Less control, faster build.
What is the latency of a Twilio bidirectional Stream? 20-60 ms for the Twilio leg, plus your server hop, plus the model. End-to-end voice-to-voice 600-900 ms is typical with OpenAI Realtime.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Is mulaw lossy enough to hurt ASR? For Whisper and Deepgram on names and digits, yes; ~3-5% absolute WER hit vs G.722 wideband. Transcode upstream if your trunk supports it.
Can I record a Stream call? Yes via Twilio's separate recording API; the Stream itself does not store audio.
Mark vs Clear: when do I use which? Mark for tracking playback progress (used to align tool calls with what the user already heard). Clear for barge-in interruption.
Start a 14-day trial on our Twilio-powered stack, see pricing for $149/$499/$1499, or book a demo to hear barge-in latency in production.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A VoIP telephone number is a phone number that routes calls over the internet instead of copper lines. Learn what a VoIP number is, how to get one, what it costs, and how to pair it with an AI voice agent in 2026.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
fly.io runs voice agents close to every user. Real working fly.toml, Pipecat in Docker, and fly-replay for sticky WebSocket sessions across 35 regions.
Voicemail detection accuracy makes or breaks outbound voice AI. CallSphere VoicemailAnalyzerAgent + Twilio AMD vs Vapi defaults. Real call examples included.
© 2026 CallSphere LLC. All rights reserved.