RTP Transcoding Cost for AI Voice in 2026: Why Edge Placement Beats Central GPU
Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute.
A 24-vCPU bridge node can transcode roughly 800 simultaneous PCMU-to-L16 calls. That same node running OpenAI Realtime client connections plus WebSocket framing handles closer to 200. The 4x gap is where your AI voice unit economics live.
Background
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AITranscoding in the AI voice context means converting between codecs (PCMU 8 kHz mu-law to PCM16 16 kHz to Opus 48 kHz, etc.) along the audio path. Pure transcoding is cheap on modern CPUs - a few percent of a core per call - but the operations layered on top (resampling, jitter buffering, WebSocket framing, base64 encoding for Twilio Media Streams JSON envelopes, TLS for WSS) add up.
Where the transcoding happens matters more than people expect. Central placement (one big bridge cluster near your GPU model server) is operationally simple but expensive: every call's audio crosses the public internet twice (carrier-to-bridge, bridge-to-model). Edge placement (transcode near the carrier or in the carrier's region) keeps the heavy media-shuffling close to the source and only sends the model-friendly format over the long haul.
Technical deep-dive
The CPU breakdown for one inbound call from PSTN to OpenAI Realtime:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Step | CPU % per core per call (typical) |
|---|---|
| RTP receive + de-jitter | 0.3% |
| PCMU mu-law decode to PCM16 | 0.1% |
| Resample 8 kHz to 16 kHz | 0.4% |
| Optional resample to 24 kHz for Realtime | 0.3% |
| WebSocket framing + TLS | 0.5% |
| JSON envelope (Twilio Media Streams) | 0.4% |
| Base64 encode | 0.2% |
| Total inbound | ~2.2% |
The reverse path (model to caller) doubles it. Plus tool-call orchestration in your bridge code adds another 1-3% depending on tool density. Realistic budget: 5-8% of a vCPU per concurrent call for the bridge alone.
# Use sample-rate libraries that vectorize, not naive loops
from scipy import signal
def resample_pcmu_to_pcm16_16khz(mulaw_bytes: bytes) -> bytes:
pcm8 = mulaw_to_linear(mulaw_bytes) # 8 kHz int16
pcm16 = signal.resample_poly(pcm8, 2, 1) # 16 kHz
return pcm16.astype("int16").tobytes()
A naive Python loop for resampling can be 50x slower than scipy.signal.resample_poly. For high-concurrency bridges, drop into Cython, Rust, or a compiled extension; libsamplerate is a good non-Python choice.
The economics: a $0.10/hour vCPU at 8% per call costs $0.0008/hour per call, or $0.000013/min. Per-minute that is invisible compared to OpenAI Realtime's $0.30/min audio cost. But the aggregate matters at scale: 1000 concurrent calls is 80 vCPUs which is $8/hour or $5,800/month, on top of GPU time. Edge placement (carrier-region bridges) cuts that and shaves 30-100 ms off every turn-take.
CallSphere implementation
CallSphere runs all six verticals on Twilio Programmable Voice. Twilio's Media Streams handles the carrier-edge transcoding (PCMU to PCM16 + WebSocket framing) on their edge before reaching our FastAPI :8084 bridge for Healthcare AI. We pay for that as part of Twilio's per-minute price; in exchange we do not own the carrier-edge CPU. Our bridge does the final resample + OpenAI Realtime WebSocket. Sales Calling AI's 5 concurrent outbound calls per tenant share the same bridge fleet; the bridge is autoscaled based on the count of active streams. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout where the SMS path costs nothing on transcoding. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the cost model is "Twilio absorbs the carrier-edge transcode, we pay for the bridge-to-Realtime path".
Implementation steps
- Measure your real per-call CPU on the bridge:
pidstator eBPF profile during a load test. Do not trust guesses. - Vectorize hot paths: resampling, mu-law decode, PCM scale - all in compiled code.
- If you control SIP-trunk placement, pick a region close to your callers; cross-region transcoding adds latency and bandwidth cost.
- Co-locate bridges with the model server when geographically possible; OpenAI Realtime regions are limited so plan for transatlantic in EU.
- Avoid extra resampling steps; if your input is 8 kHz and the model accepts 16 kHz, skip 24 kHz unless required.
- Reuse WebSocket connections to OpenAI per worker process; do not open one per call if you can multiplex.
- Profile JSON parsing - rapidjson, orjson, msgspec all crush stdlib json by 5-10x.
- Plan for 90th percentile call duration: a 4-minute average with 12-minute P95 means you provision for the long tail.
FAQ
Should I keep transcoding on Twilio or run my own SBC? For most teams Twilio's edge transcode is the right tradeoff. Roll your own only at very high scale (10k+ concurrent) or for HIPAA/data-residency reasons.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does GPU help transcoding? No. Codec encode/decode is integer-heavy; GPUs do not help. CPU SIMD does.
What about ARM CPUs? Graviton (ARM) often gives 30-40% better cost per call for this workload than x86; benchmark before committing.
Can I skip transcoding by streaming Opus all the way? On WebRTC paths, yes. On PSTN paths, no - the PSTN is PCMU.
How big is the CPU saving from edge placement? Marginal in CPU; significant in latency (30-100 ms saved) and bandwidth (less long-haul UDP).
Sources
- DEV: Designing High-Performance RTP Media Infrastructure at Massive Scale
- HireVoIPDeveloper: SIP AI Voice Integration - Media and Orchestration Pitfalls
- OpenAI: Delivering low-latency voice AI at scale
Start a 14-day trial, see pricing for $149/$499/$1499 tiers, or contact us about scaled AI voice cost engineering.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.