Twilio Voice <Stream> Bidirectional Patterns for AI Agents (2026)
Bidirectional Media Streams ship raw mulaw both directions over a single WebSocket. We break down the four patterns CallSphere ships in production: proxy-to-OpenAI, sidecar STT, conference fork, and replay-on-reconnect.
TL;DR — Bidirectional
<Stream>is the cleanest path from PSTN to a Realtime LLM. Send mulaw 8 kHz both ways, mark every chunk with a sequence number, and gate barge-in on themarkevent — not on the audio buffer.
Background
Twilio's <Stream> verb opens a WebSocket from the call leg to your server. In unidirectional mode you only receive audio (good for transcription). In bidirectional mode (<Stream bidirectional="true">) you can also push base64-encoded mulaw frames back into the call. That second direction is what unlocks AI voice agents — you stream OpenAI Realtime / Deepgram Aura / ElevenLabs output straight onto the PSTN line without a second SIP leg.
Anatomy of a stream:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- start event — once per call, contains
streamSid,callSid,accountSid, custom parameters. - media events — 20 ms mulaw frames, base64-encoded, ~50 per second per direction.
- mark events — your own labels. Twilio echoes them back when the corresponding outbound audio finishes playing. This is the only reliable barge-in signal.
- stop event — leg ended.
Architecture / config
flowchart LR
PSTN[Caller / PSTN] --> TW[Twilio Voice]
TW -- TwiML <Stream bidirectional> --> WS[wss://yourapp/stream]
WS -- inbound mulaw --> STT[STT or Realtime API]
STT --> LLM[LLM turn]
LLM --> TTS[TTS or Realtime API]
TTS -- outbound mulaw --> WS
WS -- "mark" events --> BARGE[Barge-in detector]
BARGE -- "clear" --> WS
Four patterns we run in production:
- Proxy-to-Realtime — your WS server proxies frames straight into OpenAI Realtime over a second WS. ~120 ms median round trip.
- Sidecar STT + LLM + TTS — split STT (Deepgram), LLM (Anthropic / OpenAI Chat), TTS (ElevenLabs streaming). Higher latency (~450 ms) but per-stage observability.
- Conference fork — call goes into a Twilio
<Conference>, you fork audio to your AI stream, and an AI participant is added back via a TwiML App. Useful for AI as 3rd party. - Replay-on-reconnect — buffer last 8 s of inbound + last 4 s of outbound on Redis; on
stopfollowed by a newstartwith the samecallSid, replay so the LLM has continuity.
CallSphere implementation
CallSphere runs Twilio across all six verticals. The Healthcare agent fronts a FastAPI service on port :8084 that proxies bidirectional audio into OpenAI Realtime; Sales runs five concurrent outbound calls per account with separate WS workers; the After-hours agent fires a simultaneous voice call + SMS in a 120-second race. Every leg flows through the same /twilio/stream Fastify route, with streamSid keyed into Postgres for replay.
Stack snapshot:
- 37 specialized agents · 90+ tools · 115+ DB tables · 6 verticals.
- HIPAA + SOC 2 — TLS to the WS, mulaw recording opt-in per tenant, BAA covers Twilio + OpenAI.
- $149 / $499 / $1499 plans · 14-day trial · 22% lifetime affiliate.
Build steps with code
<!-- TwiML returned from your /voice webhook -->
<Response>
<Connect>
<Stream url="wss://api.callsphere.ai/twilio/stream" bidirectional="true">
<Parameter name="tenant_id" value="tnt_123"/>
<Parameter name="agent" value="healthcare-intake"/>
</Stream>
</Connect>
</Response>
// Fastify WS handler — frames inbound, mark-gated barge-in
app.register(websocket);
app.get("/twilio/stream", { websocket: true }, (conn) => {
let streamSid = "";
conn.socket.on("message", async (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.event === "start") streamSid = evt.start.streamSid;
if (evt.event === "media") openai.sendAudio(evt.media.payload);
if (evt.event === "mark" && evt.mark.name === "tts-end") openai.flush();
});
openai.on("audio", (b64) => {
conn.socket.send(JSON.stringify({ event: "media", streamSid, media: { payload: b64 } }));
conn.socket.send(JSON.stringify({ event: "mark", streamSid, mark: { name: "tts-end" } }));
});
});
Pitfalls
- Forgetting
bidirectional="true"— you'll silently get one-way audio and waste an afternoon. - Not echoing
streamSidin outbound media — Twilio drops the frame. - Using sample rate 16 kHz —
<Stream>is mulaw 8 kHz only on PSTN; resample. - Treating audio buffer length as barge-in — race condition. Trust
markevents. - Logging full base64 frames — explodes Datadog cost; log every 200th frame at most.
FAQ
Q: How many bidirectional streams per Twilio account? Default cap is 100 concurrent; raise via support ticket. We run 800 concurrent in production.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Mulaw vs PCM?
PSTN is mulaw 8 kHz. Twilio <Stream> does not transcode for you — your TTS must output mulaw or you must resample server-side.
Q: Can I record while streaming?
Yes — <Start><Stream/></Start> plus standard <Record> works. Recordings are stored separately.
Q: How do I detect dropped streams?
Watch for stop events without prior mark echoes within 5 s. Reconnect with replay buffer.
Q: Latency floor? ~80 ms one-way Twilio→WS in us-east-1. Add LLM + TTS to estimate end-to-end.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.