By Sagar Shankaran, Founder of CallSphere
How Twilio Elastic SIP Trunking and OpenAI's Realtime SIP connector now bridge directly, what the call flow looks like, and the latency budget that actually works.
Key takeaways
The cleanest production pattern for AI phone agents in 2026 is no longer "WebSocket proxy in the middle." Twilio Elastic SIP Trunking now hands a call directly to OpenAI's Realtime SIP endpoint, and your server only steps in to accept the session and stream tools.
flowchart TD
Out[Outbound campaign] --> Twilio[Twilio Voice API]
Twilio --> STIR[STIR/SHAKEN attestation]
STIR --> Carrier[Originating carrier]
Carrier --> Term[Terminating carrier]
Term --> Recipient[Recipient phone]
Recipient --> Webhook[/voice webhook/]
Webhook --> Agent[AI sales agent]Through most of 2024 and 2025, the canonical pattern for an AI phone agent on Twilio was a Node or Python WebSocket server sitting between Twilio Media Streams and OpenAI's Realtime API. The server transcoded mu-law 8 kHz audio into 16 kHz PCM16, forwarded it to OpenAI over WebSocket, transcoded the response back, and pushed it to Twilio. It worked, but every team hit the same problems: WebSocket reconnect storms during deploys, audio drift on long calls, and a fragile interruption model that lost the last 200 to 600 ms of speech when the user barged in.
In late 2025 OpenAI shipped a SIP connector for the Realtime API. The Realtime endpoint speaks SIP natively. Twilio Elastic SIP Trunking can point an origination URI directly at sip:project-id@sip.api.openai.com;transport=tls. The audio path stops bouncing through your server. Your server only handles a webhook ("realtime.call.incoming"), accepts the session with a voice and an instructions block, and opens a thin WebSocket only for tool calls.
This is the production pattern most serious teams are migrating toward in 2026.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The call lifecycle splits cleanly into three legs:
sip.api.openai.com. OpenAI's edge accepts, opens an SRTP media stream, and starts decoding the inbound G.711 to 24 kHz PCM for the Realtime model./accept with the model, voice, and instructions, then opens a WebSocket to receive function-call events and to push tool results.The key insight: voice audio never touches your application server in the steady state. Your server is on the control plane only.
CallSphere runs Twilio across all six verticals: Healthcare AI, Real Estate AI, Sales Calling AI, Salon AI, IT Helpdesk AI, and After-Hours AI. The Healthcare receptionist uses a FastAPI service on port 8084 to bridge to OpenAI Realtime; Sales Calling AI runs five concurrent outbound calls per tenant on Twilio Programmable Voice; After-Hours AI fires a simultaneous Twilio call plus SMS per on-call contact with a 120 second timeout before falling through to the next contact.
The platform ships 37 agents across 90+ tools and 115+ database tables, with HIPAA and SOC 2 controls in place. Pricing is $149, $499, and $1499 for 1, 3, and 10 numbers respectively, with a 14-day trial and a 22% recurring affiliate program. The relevant change for 2026: new Healthcare deployments default to the SIP-direct pattern; older deployments keep the WebSocket-proxy pattern until their next migration window.
sip:YOUR_PROJECT_ID@sip.api.openai.com;transport=tls.realtime.call.incoming event, signed with HMAC./accept endpoint with model, voice, and instructions.response.function_call_arguments.done events for tool calls.<!-- Twilio TwiML fallback when OpenAI accept fails -->
<Response>
<Say voice="Polly.Joanna-Neural">
Our assistant is briefly unavailable. Please leave a message after the beep.
</Say>
<Record
timeout="5"
maxLength="120"
transcribe="true"
transcribeCallback="https://api.callsphere.ai/voicemail/transcribe"
action="https://api.callsphere.ai/voicemail/done"/>
<Say>We did not capture a message. Goodbye.</Say>
</Response>
Do I still need a WebSocket server with the SIP-direct pattern? Yes, but only for tool calls and observability. Audio bypasses your server entirely.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What happens if my tool server is down when a call comes in? The call still completes; the model just answers without tools. CallSphere recommends a circuit breaker that switches the model to a "safe-mode" instructions block when tool servers are degraded.
Is this cheaper than the proxy pattern? Usually yes, because you stop paying for media-plane bandwidth and CPU on your application servers.
Does interruption handling improve? Materially, yes. OpenAI handles barge-in inside its own audio pipeline rather than over a round-trip through your proxy.
Can I still record the call? Yes. Enable recording on the Twilio trunk side; the SIP-direct path does not break Twilio call recordings.
Start a 14-day trial to see the SIP-direct pattern in production, compare pricing for 1, 3, or 10 numbers, or read the Twilio integration page for setup details.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
An honest 2026 guide to VoIP desk phones. Hardware vs softphone, top picks, when an internet phone is worth it, and where AI voice agents fit.
The best business phone app in 2026 is the one with an AI agent attached. Compare options, costs, and what an AI phone app actually does for a small business.
A VoIP telephone number is a phone number that routes calls over the internet instead of copper lines. Learn what a VoIP number is, how to get one, what it costs, and how to pair it with an AI voice agent in 2026.
A founder's guide to business phone systems in 2026. Cloud vs on-prem, AI voice agents, small business pricing, and what actually works for under 100 seats.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
© 2026 CallSphere LLC. All rights reserved.