RTP and the Opus Codec: AI Voice Quality from PSTN to Model
How RTP carries AI voice end-to-end, why Opus matters more than G.711 for model accuracy, and the codec negotiation patterns that ship in 2026.
If you care about the accuracy of your AI voice agent, the codec choice on the carrier leg matters more than the model choice. The full-fidelity end of the call is where the model lives; the squeezed end is where the PSTN forces 8 kHz mu-law. The 2026 patterns push the high-fidelity boundary as far toward the caller as possible.
Background: the codec hierarchy that matters in 2026
flowchart LR
Phone["PSTN caller"] --> Carrier["Carrier"]
Carrier -- "SIP INVITE" --> SBC["Session Border Controller"]
SBC -- "SIP" --> PBX["Twilio / Asterisk"]
PBX -- "RTP · Opus" --> Bridge["AI Voice Gateway"]
Bridge --> AI["OpenAI Realtime"]
AI --> Bridge
Bridge --> PBXVoice over IP audio is encoded by a codec on the sender, packetized into RTP, and decoded by a codec on the receiver. The codecs that matter for AI agents are:
- G.711 (PCMU/PCMA): 8 kHz, 64 kbps, the PSTN baseline. Universal but narrowband.
- G.722: 16 kHz, 64 kbps. The simplest "HD voice" upgrade. Compatible with most modern carriers.
- Opus: 8 to 48 kHz, 6 to 510 kbps, the WebRTC standard. Adaptive, low-latency, and the codec the OpenAI Realtime API likes most.
For voice to feel natural, one-way latency should stay under ~150 ms. Opus supports configurable frame sizes from 10 to 20 ms, which materially helps the latency budget.
The 2026 trend: keep Opus all the way from caller to model where possible (WebRTC client to your AI), and only collapse to G.711 when crossing into the PSTN. Every codec hop adds artifacts that hurt model accuracy.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
How VoIP and SIP work for this use case
Codec negotiation happens in the SDP body of the SIP INVITE. The caller sends an offer listing supported codecs in priority order. The receiver answers with the codec it picks. RTP carries the encoded payloads on the negotiated UDP port pair.
For a typical AI voice call:
- PSTN to Twilio: G.711 mu-law, 8 kHz.
- Twilio to OpenAI SIP: G.711 still, on the PSTN-facing leg; Opus on browser/mobile-facing legs.
- OpenAI to model: 24 kHz PCM internally.
Most of the audio quality loss is in that PSTN G.711 leg. There is nothing your software can do about it; the originating carrier picked the codec when the call left the caller's phone. What you control is everything after the carrier hands the call to you: keep it Opus, keep it 16 or 24 kHz, and avoid double-transcoding through low-rate codecs.
CallSphere implementation
CallSphere uses Twilio across all products. PSTN inbound to the Healthcare AI receptionist on FastAPI :8084 is G.711 mu-law on the carrier leg, then bridged to OpenAI Realtime where the audio is up-sampled to 24 kHz internally. Sales Calling AI with five concurrent outbound on Twilio Programmable Voice and After-Hours AI with simultaneous Twilio call plus SMS and 120 second timeout follow the same path.
For browser and mobile demo paths (the /demo page), CallSphere uses Opus end-to-end via WebRTC, which produces noticeably crisper audio and slightly higher model accuracy than the PSTN paths. The 37 agents, 90+ tools, 115+ database tables, HIPAA and SOC 2 controls, and the $149/$499/$1499 pricing for 1/3/10 numbers do not change based on codec — but customers running large headsets or wideband phones see better outcomes.
Build and integration steps
- Inventory the codecs in use on every leg of your call path.
- Where you control the leg (browser, mobile, internal SIP), use Opus.
- Where the PSTN forces G.711, accept it and avoid additional transcodes.
- Configure your media servers (FreeSWITCH, Asterisk, SBC) to prefer wideband on inter-server legs.
- Tune the jitter buffer to adapt to network conditions; default fixed buffers are usually too large.
- Run a needle-in-the-noise transcription accuracy test across your codec configurations.
- Track per-call metrics: codec used, packet loss percentage, jitter, mean opinion score (if available).
- Alert on packet loss above 1% and on jitter above 30 ms.
Code or config snippet
<!-- FreeSWITCH SIP profile: prefer Opus, fall back to G.722, then G.711 -->
<settings>
<param name="inbound-codec-prefs" value="opus,G722,PCMU,PCMA"/>
<param name="outbound-codec-prefs" value="opus,G722,PCMU,PCMA"/>
<param name="codec-negotiation" value="generous"/>
<param name="enable-3pcc" value="proxy"/>
<param name="rtp-timeout-sec" value="300"/>
<param name="rtp-hold-timeout-sec" value="1800"/>
</settings>
FAQ
Does Opus work over the PSTN? No, the PSTN forces G.711 on the last mile. Opus matters on the legs you control: browser, mobile, internal SIP.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Will my AI accuracy improve if I force G.722 on the carrier leg? Slightly, on calls where the originating carrier supports it end-to-end. Most US PSTN paths still collapse to G.711 somewhere.
What about transcoding cost? Modern SBCs and softswitches transcode in software at low cost. The bigger penalty is latency and audio fidelity, not CPU.
Should I use 8 kHz Opus or 16 kHz Opus? 16 kHz where bandwidth permits. The model accuracy gain from 16 kHz is real.
Does the OpenAI Realtime API care which codec I send? It accepts G.711 mu-law, G.711 a-law, and Opus inputs. Internally it works on 24 kHz PCM regardless.
Sources
- Telnyx: How Opus and G.722 Codecs Turbocharge AI Interactions
- Cainiao Voice: Voice Codec Comparison Guide
- SignalWire / FreeSWITCH: FreeSWITCH and the Opus Audio Codec
- Hamming AI: Debug WebRTC Voice Agents Troubleshooting Guide
Start a 14-day trial, book a demo to hear Opus end-to-end, or read the Twilio integration page.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.