Telephony Integration for Voice Agents: Connecting to Phone Systems
Connect your AI voice agents to real phone systems using SIP, Twilio, and WebSocket transport with the OpenAI Realtime API for inbound and outbound call handling.
Bridging AI Voice Agents and the Phone Network
A voice agent running in a browser demo is impressive. A voice agent that answers your business phone line is useful. The gap between those two is telephony integration — connecting your AI agent to the Public Switched Telephone Network (PSTN) so real callers on real phones can interact with it.
This post covers three integration patterns: direct SIP trunking, Twilio as a telephony middleware, and raw WebSocket transport for custom deployments.
Telephony Architecture Patterns
Pattern 1: Twilio Media Streams + OpenAI Realtime API
This is the most accessible approach. Twilio handles all telephony complexity (phone numbers, call routing, PSTN connectivity) and forwards raw audio to your server via WebSocket Media Streams.
flowchart TD
START["Telephony Integration for Voice Agents: Connectin…"] --> A
A["Bridging AI Voice Agents and the Phone …"]
A --> B
B["Telephony Architecture Patterns"]
B --> C
C["Implementation: Twilio Media Streams"]
C --> D
D["Outbound Calls"]
D --> E
E["DTMF Tone Handling"]
E --> F
F["SIP Integration Overview"]
F --> G
G["Production Checklist"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
┌──────────┐ PSTN ┌──────────┐ Media Stream ┌──────────────┐
│ Caller │────────────►│ Twilio │◄────────────────►│ Your Server │
│ (Phone) │ │ │ (WebSocket) │ (FastAPI) │
└──────────┘ └──────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ OpenAI │
│ Realtime API │
└──────────────┘
Pattern 2: Direct SIP Trunk
For high-volume call centers, you connect your SIP-capable server directly to a SIP trunk provider. This eliminates the Twilio middleman but requires you to handle SIP signaling, codec negotiation, and RTP media streams yourself.
Pattern 3: WebRTC Gateway
For browser-based or mobile app callers, you use a WebRTC gateway that bridges browser audio to your voice agent pipeline. This is the approach used in web-based customer portals.
Implementation: Twilio Media Streams
Step 1: Twilio Configuration
First, configure a Twilio phone number to forward calls to your server via TwiML.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# twilio_config.py
from twilio.rest import Client
import os
client = Client(
os.environ["TWILIO_ACCOUNT_SID"],
os.environ["TWILIO_AUTH_TOKEN"],
)
def configure_phone_number(phone_sid: str, webhook_url: str):
"""Point a Twilio phone number at our voice webhook."""
client.incoming_phone_numbers(phone_sid).update(
voice_url=f"{webhook_url}/twilio/voice",
voice_method="POST",
)
Step 2: TwiML Voice Webhook
When Twilio receives a call, it hits your webhook. You respond with TwiML that opens a Media Stream WebSocket back to your server.
# main.py
from fastapi import FastAPI, Request
from fastapi.responses import Response
app = FastAPI()
@app.post("/twilio/voice")
async def twilio_voice_webhook(request: Request):
"""Twilio calls this when a new inbound call arrives."""
form = await request.form()
caller = form.get("From", "unknown")
call_sid = form.get("CallSid", "")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="alice">Please hold while we connect you to our assistant.</Say>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="caller" value="{caller}" />
<Parameter name="call_sid" value="{call_sid}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
Step 3: Media Stream WebSocket Handler
This is the core: a WebSocket endpoint that receives Twilio's audio stream, forwards it to OpenAI's Realtime API, and sends the response audio back to Twilio.
# media_stream.py
import asyncio
import json
import base64
import websockets
from fastapi import WebSocket, WebSocketDisconnect
import os
OPENAI_REALTIME_URL = "wss://api.openai.com/v1/realtime"
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
SYSTEM_INSTRUCTIONS = """You are a helpful customer support agent for Acme Corp.
You are speaking with a customer on the phone. Keep responses concise and natural.
When you need to look up information, tell the customer you are checking.
If you cannot help, offer to transfer them to a human agent."""
async def handle_twilio_media_stream(websocket: WebSocket):
"""Bridge between Twilio Media Stream and OpenAI Realtime API."""
await websocket.accept()
stream_sid = None
caller = "unknown"
# Connect to OpenAI Realtime API
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(
f"{OPENAI_REALTIME_URL}?model=gpt-4o-realtime-preview",
additional_headers=headers,
) as openai_ws:
# Configure the OpenAI session
session_config = {
"type": "session.update",
"session": {
"instructions": SYSTEM_INSTRUCTIONS,
"voice": "nova",
"input_audio_format": "g711_ulaw",
"output_audio_format": "g711_ulaw",
"input_audio_transcription": {"model": "whisper-1"},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 700,
},
},
}
await openai_ws.send(json.dumps(session_config))
async def twilio_to_openai():
"""Forward Twilio audio to OpenAI."""
nonlocal stream_sid, caller
try:
while True:
message = await websocket.receive_text()
data = json.loads(message)
if data["event"] == "start":
stream_sid = data["start"]["streamSid"]
params = data["start"].get("customParameters", {})
caller = params.get("caller", "unknown")
elif data["event"] == "media":
audio_payload = data["media"]["payload"]
audio_event = {
"type": "input_audio_buffer.append",
"audio": audio_payload,
}
await openai_ws.send(json.dumps(audio_event))
elif data["event"] == "stop":
break
except WebSocketDisconnect:
pass
async def openai_to_twilio():
"""Forward OpenAI audio back to Twilio."""
try:
async for message in openai_ws:
data = json.loads(message)
if data["type"] == "response.audio.delta":
audio_delta = data["delta"]
twilio_message = {
"event": "media",
"streamSid": stream_sid,
"media": {"payload": audio_delta},
}
await websocket.send_json(twilio_message)
elif data["type"] == "response.audio.done":
# Mark end of response for logging
pass
elif data["type"] == "input_audio_buffer.speech_started":
# User started speaking — clear any pending audio
clear_msg = {
"event": "clear",
"streamSid": stream_sid,
}
await websocket.send_json(clear_msg)
except Exception:
pass
await asyncio.gather(twilio_to_openai(), openai_to_twilio())
Step 4: Register the WebSocket Route
# In main.py, add the media stream route
from media_stream import handle_twilio_media_stream
@app.websocket("/twilio/media-stream")
async def twilio_media_stream(websocket: WebSocket):
await handle_twilio_media_stream(websocket)
Outbound Calls
Voice agents can also initiate calls — for appointment reminders, follow-ups, or proactive support.
flowchart TD
ROOT["Telephony Integration for Voice Agents: Conn…"]
ROOT --> P0["Telephony Architecture Patterns"]
P0 --> P0C0["Pattern 1: Twilio Media Streams + OpenA…"]
P0 --> P0C1["Pattern 2: Direct SIP Trunk"]
P0 --> P0C2["Pattern 3: WebRTC Gateway"]
ROOT --> P1["Implementation: Twilio Media Streams"]
P1 --> P1C0["Step 1: Twilio Configuration"]
P1 --> P1C1["Step 2: TwiML Voice Webhook"]
P1 --> P1C2["Step 3: Media Stream WebSocket Handler"]
P1 --> P1C3["Step 4: Register the WebSocket Route"]
style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
# outbound.py
from twilio.rest import Client
import os
client = Client(
os.environ["TWILIO_ACCOUNT_SID"],
os.environ["TWILIO_AUTH_TOKEN"],
)
def initiate_outbound_call(
to_number: str,
from_number: str,
webhook_base_url: str,
purpose: str = "follow_up",
):
"""Initiate an outbound call that connects to our AI agent."""
twiml_url = f"{webhook_base_url}/twilio/outbound-voice?purpose={purpose}"
call = client.calls.create(
to=to_number,
from_=from_number,
url=twiml_url,
method="POST",
status_callback=f"{webhook_base_url}/twilio/call-status",
status_callback_event=["initiated", "ringing", "answered", "completed"],
)
return call.sid
@app.post("/twilio/outbound-voice")
async def outbound_voice_webhook(request: Request):
"""Handle the outbound call connection."""
params = request.query_params
purpose = params.get("purpose", "general")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="direction" value="outbound" />
<Parameter name="purpose" value="{purpose}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
DTMF Tone Handling
Some callers prefer pressing buttons. You can handle DTMF input alongside voice by gathering digits before connecting the Media Stream.
flowchart LR
S0["Implementation: Twilio Media Streams"]
S0 --> S1
S1["Step 1: Twilio Configuration"]
S1 --> S2
S2["Step 2: TwiML Voice Webhook"]
S2 --> S3
S3["Step 3: Media Stream WebSocket Handler"]
S3 --> S4
S4["Step 4: Register the WebSocket Route"]
style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
style S4 fill:#059669,stroke:#047857,color:#fff
@app.post("/twilio/voice-with-dtmf")
async def voice_with_dtmf(request: Request):
"""Offer a DTMF menu before connecting to the AI agent."""
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather numDigits="1" action="/twilio/dtmf-handler" method="POST" timeout="5">
<Say voice="alice">
Press 1 for billing, 2 for refunds, or stay on the line
to speak with our AI assistant.
</Say>
</Gather>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="department" value="triage" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
@app.post("/twilio/dtmf-handler")
async def dtmf_handler(request: Request):
"""Route based on DTMF digit pressed."""
form = await request.form()
digit = form.get("Digits", "")
department_map = {"1": "billing", "2": "refunds"}
department = department_map.get(digit, "triage")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="alice">Connecting you now.</Say>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="department" value="{department}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
SIP Integration Overview
For direct SIP integration without Twilio, you need a SIP stack. The open-source library pjsip (via pjsua2 Python bindings) handles SIP signaling, while you manage the RTP audio stream yourself.
# sip_overview.py (conceptual — requires pjsua2)
"""
SIP integration requires three components:
1. SIP User Agent — registers with your SIP provider and handles
INVITE/BYE/CANCEL signaling
2. RTP Media Handler — receives and sends audio packets using
the negotiated codec (typically G.711 u-law or a-law)
3. Audio Bridge — converts between RTP packets and the PCM16
format expected by OpenAI's Realtime API
The flow:
SIP INVITE → Accept call → Negotiate codec → Open RTP stream
→ Forward RTP audio to OpenAI → Receive response audio
→ Send as RTP back to caller → BYE to end call
Key considerations:
- Codec negotiation: Prefer G.711 u-law for compatibility
- NAT traversal: Use STUN/TURN if your server is behind NAT
- Registration refresh: SIP registrations expire; re-register periodically
- Call recording: Tap the RTP stream for compliance recording
"""
Production Checklist
When deploying telephony-connected voice agents:
- Phone number management: Use a pool of numbers for outbound calls to avoid spam flagging
- Call recording consent: Announce recording at the start of each call where legally required
- Failover: If the AI pipeline is down, fall back to a traditional IVR or voicemail
- Cost monitoring: Track per-minute costs across Twilio, OpenAI Realtime API, and compute
- Concurrent call limits: Size your WebSocket server for your peak concurrent call volume
- Audio quality logging: Log audio quality metrics (jitter, packet loss) for debugging
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.