Skip to content
Learn Agentic AI
Learn Agentic AI17 min read2 views

Telephony Integration for Voice Agents: Connecting to Phone Systems

Connect your AI voice agents to real phone systems using SIP, Twilio, and WebSocket transport with the OpenAI Realtime API for inbound and outbound call handling.

Bridging AI Voice Agents and the Phone Network

A voice agent running in a browser demo is impressive. A voice agent that answers your business phone line is useful. The gap between those two is telephony integration — connecting your AI agent to the Public Switched Telephone Network (PSTN) so real callers on real phones can interact with it.

This post covers three integration patterns: direct SIP trunking, Twilio as a telephony middleware, and raw WebSocket transport for custom deployments.

Telephony Architecture Patterns

Pattern 1: Twilio Media Streams + OpenAI Realtime API

This is the most accessible approach. Twilio handles all telephony complexity (phone numbers, call routing, PSTN connectivity) and forwards raw audio to your server via WebSocket Media Streams.

flowchart TD
    START["Telephony Integration for Voice Agents: Connectin…"] --> A
    A["Bridging AI Voice Agents and the Phone …"]
    A --> B
    B["Telephony Architecture Patterns"]
    B --> C
    C["Implementation: Twilio Media Streams"]
    C --> D
    D["Outbound Calls"]
    D --> E
    E["DTMF Tone Handling"]
    E --> F
    F["SIP Integration Overview"]
    F --> G
    G["Production Checklist"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
┌──────────┐    PSTN     ┌──────────┐   Media Stream   ┌──────────────┐
│  Caller  │────────────►│  Twilio  │◄────────────────►│  Your Server │
│ (Phone)  │             │          │    (WebSocket)    │  (FastAPI)   │
└──────────┘             └──────────┘                   └──────┬───────┘
                                                               │
                                                        ┌──────▼───────┐
                                                        │ OpenAI       │
                                                        │ Realtime API │
                                                        └──────────────┘

Pattern 2: Direct SIP Trunk

For high-volume call centers, you connect your SIP-capable server directly to a SIP trunk provider. This eliminates the Twilio middleman but requires you to handle SIP signaling, codec negotiation, and RTP media streams yourself.

Pattern 3: WebRTC Gateway

For browser-based or mobile app callers, you use a WebRTC gateway that bridges browser audio to your voice agent pipeline. This is the approach used in web-based customer portals.

Implementation: Twilio Media Streams

Step 1: Twilio Configuration

First, configure a Twilio phone number to forward calls to your server via TwiML.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# twilio_config.py
from twilio.rest import Client
import os

client = Client(
    os.environ["TWILIO_ACCOUNT_SID"],
    os.environ["TWILIO_AUTH_TOKEN"],
)

def configure_phone_number(phone_sid: str, webhook_url: str):
    """Point a Twilio phone number at our voice webhook."""
    client.incoming_phone_numbers(phone_sid).update(
        voice_url=f"{webhook_url}/twilio/voice",
        voice_method="POST",
    )

Step 2: TwiML Voice Webhook

When Twilio receives a call, it hits your webhook. You respond with TwiML that opens a Media Stream WebSocket back to your server.

# main.py
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice_webhook(request: Request):
    """Twilio calls this when a new inbound call arrives."""
    form = await request.form()
    caller = form.get("From", "unknown")
    call_sid = form.get("CallSid", "")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="alice">Please hold while we connect you to our assistant.</Say>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="caller" value="{caller}" />
            <Parameter name="call_sid" value="{call_sid}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

Step 3: Media Stream WebSocket Handler

This is the core: a WebSocket endpoint that receives Twilio's audio stream, forwards it to OpenAI's Realtime API, and sends the response audio back to Twilio.

# media_stream.py
import asyncio
import json
import base64
import websockets
from fastapi import WebSocket, WebSocketDisconnect
import os

OPENAI_REALTIME_URL = "wss://api.openai.com/v1/realtime"
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

SYSTEM_INSTRUCTIONS = """You are a helpful customer support agent for Acme Corp.
You are speaking with a customer on the phone. Keep responses concise and natural.
When you need to look up information, tell the customer you are checking.
If you cannot help, offer to transfer them to a human agent."""

async def handle_twilio_media_stream(websocket: WebSocket):
    """Bridge between Twilio Media Stream and OpenAI Realtime API."""
    await websocket.accept()

    stream_sid = None
    caller = "unknown"

    # Connect to OpenAI Realtime API
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(
        f"{OPENAI_REALTIME_URL}?model=gpt-4o-realtime-preview",
        additional_headers=headers,
    ) as openai_ws:

        # Configure the OpenAI session
        session_config = {
            "type": "session.update",
            "session": {
                "instructions": SYSTEM_INSTRUCTIONS,
                "voice": "nova",
                "input_audio_format": "g711_ulaw",
                "output_audio_format": "g711_ulaw",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 700,
                },
            },
        }
        await openai_ws.send(json.dumps(session_config))

        async def twilio_to_openai():
            """Forward Twilio audio to OpenAI."""
            nonlocal stream_sid, caller
            try:
                while True:
                    message = await websocket.receive_text()
                    data = json.loads(message)

                    if data["event"] == "start":
                        stream_sid = data["start"]["streamSid"]
                        params = data["start"].get("customParameters", {})
                        caller = params.get("caller", "unknown")

                    elif data["event"] == "media":
                        audio_payload = data["media"]["payload"]
                        audio_event = {
                            "type": "input_audio_buffer.append",
                            "audio": audio_payload,
                        }
                        await openai_ws.send(json.dumps(audio_event))

                    elif data["event"] == "stop":
                        break
            except WebSocketDisconnect:
                pass

        async def openai_to_twilio():
            """Forward OpenAI audio back to Twilio."""
            try:
                async for message in openai_ws:
                    data = json.loads(message)

                    if data["type"] == "response.audio.delta":
                        audio_delta = data["delta"]
                        twilio_message = {
                            "event": "media",
                            "streamSid": stream_sid,
                            "media": {"payload": audio_delta},
                        }
                        await websocket.send_json(twilio_message)

                    elif data["type"] == "response.audio.done":
                        # Mark end of response for logging
                        pass

                    elif data["type"] == "input_audio_buffer.speech_started":
                        # User started speaking — clear any pending audio
                        clear_msg = {
                            "event": "clear",
                            "streamSid": stream_sid,
                        }
                        await websocket.send_json(clear_msg)
            except Exception:
                pass

        await asyncio.gather(twilio_to_openai(), openai_to_twilio())

Step 4: Register the WebSocket Route

# In main.py, add the media stream route
from media_stream import handle_twilio_media_stream

@app.websocket("/twilio/media-stream")
async def twilio_media_stream(websocket: WebSocket):
    await handle_twilio_media_stream(websocket)

Outbound Calls

Voice agents can also initiate calls — for appointment reminders, follow-ups, or proactive support.

flowchart TD
    ROOT["Telephony Integration for Voice Agents: Conn…"] 
    ROOT --> P0["Telephony Architecture Patterns"]
    P0 --> P0C0["Pattern 1: Twilio Media Streams + OpenA…"]
    P0 --> P0C1["Pattern 2: Direct SIP Trunk"]
    P0 --> P0C2["Pattern 3: WebRTC Gateway"]
    ROOT --> P1["Implementation: Twilio Media Streams"]
    P1 --> P1C0["Step 1: Twilio Configuration"]
    P1 --> P1C1["Step 2: TwiML Voice Webhook"]
    P1 --> P1C2["Step 3: Media Stream WebSocket Handler"]
    P1 --> P1C3["Step 4: Register the WebSocket Route"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
# outbound.py
from twilio.rest import Client
import os

client = Client(
    os.environ["TWILIO_ACCOUNT_SID"],
    os.environ["TWILIO_AUTH_TOKEN"],
)

def initiate_outbound_call(
    to_number: str,
    from_number: str,
    webhook_base_url: str,
    purpose: str = "follow_up",
):
    """Initiate an outbound call that connects to our AI agent."""
    twiml_url = f"{webhook_base_url}/twilio/outbound-voice?purpose={purpose}"

    call = client.calls.create(
        to=to_number,
        from_=from_number,
        url=twiml_url,
        method="POST",
        status_callback=f"{webhook_base_url}/twilio/call-status",
        status_callback_event=["initiated", "ringing", "answered", "completed"],
    )
    return call.sid

@app.post("/twilio/outbound-voice")
async def outbound_voice_webhook(request: Request):
    """Handle the outbound call connection."""
    params = request.query_params
    purpose = params.get("purpose", "general")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="direction" value="outbound" />
            <Parameter name="purpose" value="{purpose}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

DTMF Tone Handling

Some callers prefer pressing buttons. You can handle DTMF input alongside voice by gathering digits before connecting the Media Stream.

flowchart LR
    S0["Implementation: Twilio Media Streams"]
    S0 --> S1
    S1["Step 1: Twilio Configuration"]
    S1 --> S2
    S2["Step 2: TwiML Voice Webhook"]
    S2 --> S3
    S3["Step 3: Media Stream WebSocket Handler"]
    S3 --> S4
    S4["Step 4: Register the WebSocket Route"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S4 fill:#059669,stroke:#047857,color:#fff
@app.post("/twilio/voice-with-dtmf")
async def voice_with_dtmf(request: Request):
    """Offer a DTMF menu before connecting to the AI agent."""
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Gather numDigits="1" action="/twilio/dtmf-handler" method="POST" timeout="5">
        <Say voice="alice">
            Press 1 for billing, 2 for refunds, or stay on the line
            to speak with our AI assistant.
        </Say>
    </Gather>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="department" value="triage" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

@app.post("/twilio/dtmf-handler")
async def dtmf_handler(request: Request):
    """Route based on DTMF digit pressed."""
    form = await request.form()
    digit = form.get("Digits", "")

    department_map = {"1": "billing", "2": "refunds"}
    department = department_map.get(digit, "triage")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="alice">Connecting you now.</Say>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="department" value="{department}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

SIP Integration Overview

For direct SIP integration without Twilio, you need a SIP stack. The open-source library pjsip (via pjsua2 Python bindings) handles SIP signaling, while you manage the RTP audio stream yourself.

# sip_overview.py (conceptual — requires pjsua2)
"""
SIP integration requires three components:

1. SIP User Agent — registers with your SIP provider and handles
   INVITE/BYE/CANCEL signaling
2. RTP Media Handler — receives and sends audio packets using
   the negotiated codec (typically G.711 u-law or a-law)
3. Audio Bridge — converts between RTP packets and the PCM16
   format expected by OpenAI's Realtime API

The flow:
  SIP INVITE → Accept call → Negotiate codec → Open RTP stream
  → Forward RTP audio to OpenAI → Receive response audio
  → Send as RTP back to caller → BYE to end call

Key considerations:
- Codec negotiation: Prefer G.711 u-law for compatibility
- NAT traversal: Use STUN/TURN if your server is behind NAT
- Registration refresh: SIP registrations expire; re-register periodically
- Call recording: Tap the RTP stream for compliance recording
"""

Production Checklist

When deploying telephony-connected voice agents:

  1. Phone number management: Use a pool of numbers for outbound calls to avoid spam flagging
  2. Call recording consent: Announce recording at the start of each call where legally required
  3. Failover: If the AI pipeline is down, fall back to a traditional IVR or voicemail
  4. Cost monitoring: Track per-minute costs across Twilio, OpenAI Realtime API, and compute
  5. Concurrent call limits: Size your WebSocket server for your peak concurrent call volume
  6. Audio quality logging: Log audio quality metrics (jitter, packet loss) for debugging

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.