TL;DR — Bidirectional <Stream> is the cleanest path from PSTN to a Realtime LLM. Send mulaw 8 kHz both ways, mark every chunk with a sequence number, and gate barge-in on the mark event — not on the audio buffer.

Background

Twilio's <Stream> verb opens a WebSocket from the call leg to your server. In unidirectional mode you only receive audio (good for transcription). In bidirectional mode (<Stream bidirectional="true">) you can also push base64-encoded mulaw frames back into the call. That second direction is what unlocks AI voice agents — you stream OpenAI Realtime / Deepgram Aura / ElevenLabs output straight onto the PSTN line without a second SIP leg.

Anatomy of a stream:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

start event — once per call, contains streamSid, callSid, accountSid, custom parameters.
media events — 20 ms mulaw frames, base64-encoded, ~50 per second per direction.
mark events — your own labels. Twilio echoes them back when the corresponding outbound audio finishes playing. This is the only reliable barge-in signal.
stop event — leg ended.

Architecture / config

flowchart LR
  PSTN[Caller / PSTN] --> TW[Twilio Voice]
  TW -- TwiML &lt;Stream bidirectional&gt; --> WS[wss://yourapp/stream]
  WS -- inbound mulaw --> STT[STT or Realtime API]
  STT --> LLM[LLM turn]
  LLM --> TTS[TTS or Realtime API]
  TTS -- outbound mulaw --> WS
  WS -- &quot;mark&quot; events --> BARGE[Barge-in detector]
  BARGE -- &quot;clear&quot; --> WS

Four patterns we run in production:

Proxy-to-Realtime — your WS server proxies frames straight into OpenAI Realtime over a second WS. ~120 ms median round trip.
Sidecar STT + LLM + TTS — split STT (Deepgram), LLM (Anthropic / OpenAI Chat), TTS (ElevenLabs streaming). Higher latency (~450 ms) but per-stage observability.
Conference fork — call goes into a Twilio <Conference>, you fork audio to your AI stream, and an AI participant is added back via a TwiML App. Useful for AI as 3rd party.
Replay-on-reconnect — buffer last 8 s of inbound + last 4 s of outbound on Redis; on stop followed by a new start with the same callSid, replay so the LLM has continuity.

CallSphere implementation

CallSphere runs Twilio across all six verticals. The Healthcare agent fronts a FastAPI service on port :8084 that proxies bidirectional audio into OpenAI Realtime; Sales runs five concurrent outbound calls per account with separate WS workers; the After-hours agent fires a simultaneous voice call + SMS in a 120-second race. Every leg flows through the same /twilio/stream Fastify route, with streamSid keyed into Postgres for replay.

Stack snapshot:

37 specialized agents · 90+ tools · 115+ DB tables · 6 verticals.
HIPAA + SOC 2 — TLS to the WS, mulaw recording opt-in per tenant, BAA covers Twilio + OpenAI.
$149 / $499 / $1499 plans · 14-day trial · 22% lifetime affiliate.

Build steps with code

<!-- TwiML returned from your /voice webhook -->
<Response>
  <Connect>
    <Stream url="wss://api.callsphere.ai/twilio/stream" bidirectional="true">
      <Parameter name="tenant_id" value="tnt_123"/>
      <Parameter name="agent" value="healthcare-intake"/>
    </Stream>
  </Connect>
</Response>

// Fastify WS handler — frames inbound, mark-gated barge-in
app.register(websocket);
app.get("/twilio/stream", { websocket: true }, (conn) => {
  let streamSid = "";
  conn.socket.on("message", async (raw) => {
    const evt = JSON.parse(raw.toString());
    if (evt.event === "start") streamSid = evt.start.streamSid;
    if (evt.event === "media") openai.sendAudio(evt.media.payload);
    if (evt.event === "mark" && evt.mark.name === "tts-end") openai.flush();
  });
  openai.on("audio", (b64) => {
    conn.socket.send(JSON.stringify({ event: "media", streamSid, media: { payload: b64 } }));
    conn.socket.send(JSON.stringify({ event: "mark", streamSid, mark: { name: "tts-end" } }));
  });
});

Pitfalls

Forgetting bidirectional="true" — you'll silently get one-way audio and waste an afternoon.
Not echoing streamSid in outbound media — Twilio drops the frame.
Using sample rate 16 kHz — <Stream> is mulaw 8 kHz only on PSTN; resample.
Treating audio buffer length as barge-in — race condition. Trust mark events.
Logging full base64 frames — explodes Datadog cost; log every 200th frame at most.

FAQ

Q: How many bidirectional streams per Twilio account? Default cap is 100 concurrent; raise via support ticket. We run 800 concurrent in production.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Q: Mulaw vs PCM? PSTN is mulaw 8 kHz. Twilio <Stream> does not transcode for you — your TTS must output mulaw or you must resample server-side.

Q: Can I record while streaming? Yes — <Start><Stream/></Start> plus standard <Record> works. Recordings are stored separately.

Q: How do I detect dropped streams? Watch for stop events without prior mark echoes within 5 s. Reconnect with replay buffer.

Q: Latency floor? ~80 ms one-way Twilio→WS in us-east-1. Add LLM + TTS to estimate end-to-end.

Twilio Voice <Stream> Bidirectional Patterns for AI Agents (2026)

Background

Architecture / config

CallSphere implementation

Build steps with code

Pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Defense, ITAR & AI Voice Vendor Compliance in 2026

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Call Sentiment Time-Series Dashboards for Voice AI in 2026