By Sagar Shankaran, Founder of CallSphere
Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute.
Key takeaways
A 24-vCPU bridge node can transcode roughly 800 simultaneous PCMU-to-L16 calls. That same node running OpenAI Realtime client connections plus WebSocket framing handles closer to 200. The 4x gap is where your AI voice unit economics live.
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AITranscoding in the AI voice context means converting between codecs (PCMU 8 kHz mu-law to PCM16 16 kHz to Opus 48 kHz, etc.) along the audio path. Pure transcoding is cheap on modern CPUs - a few percent of a core per call - but the operations layered on top (resampling, jitter buffering, WebSocket framing, base64 encoding for Twilio Media Streams JSON envelopes, TLS for WSS) add up.
Where the transcoding happens matters more than people expect. Central placement (one big bridge cluster near your GPU model server) is operationally simple but expensive: every call's audio crosses the public internet twice (carrier-to-bridge, bridge-to-model). Edge placement (transcode near the carrier or in the carrier's region) keeps the heavy media-shuffling close to the source and only sends the model-friendly format over the long haul.
The CPU breakdown for one inbound call from PSTN to OpenAI Realtime:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Step | CPU % per core per call (typical) |
|---|---|
| RTP receive + de-jitter | 0.3% |
| PCMU mu-law decode to PCM16 | 0.1% |
| Resample 8 kHz to 16 kHz | 0.4% |
| Optional resample to 24 kHz for Realtime | 0.3% |
| WebSocket framing + TLS | 0.5% |
| JSON envelope (Twilio Media Streams) | 0.4% |
| Base64 encode | 0.2% |
| Total inbound | ~2.2% |
The reverse path (model to caller) doubles it. Plus tool-call orchestration in your bridge code adds another 1-3% depending on tool density. Realistic budget: 5-8% of a vCPU per concurrent call for the bridge alone.
# Use sample-rate libraries that vectorize, not naive loops
from scipy import signal
def resample_pcmu_to_pcm16_16khz(mulaw_bytes: bytes) -> bytes:
pcm8 = mulaw_to_linear(mulaw_bytes) # 8 kHz int16
pcm16 = signal.resample_poly(pcm8, 2, 1) # 16 kHz
return pcm16.astype("int16").tobytes()
A naive Python loop for resampling can be 50x slower than scipy.signal.resample_poly. For high-concurrency bridges, drop into Cython, Rust, or a compiled extension; libsamplerate is a good non-Python choice.
The economics: a $0.10/hour vCPU at 8% per call costs $0.0008/hour per call, or $0.000013/min. Per-minute that is invisible compared to OpenAI Realtime's $0.30/min audio cost. But the aggregate matters at scale: 1000 concurrent calls is 80 vCPUs which is $8/hour or $5,800/month, on top of GPU time. Edge placement (carrier-region bridges) cuts that and shaves 30-100 ms off every turn-take.
CallSphere runs all six verticals on Twilio Programmable Voice. Twilio's Media Streams handles the carrier-edge transcoding (PCMU to PCM16 + WebSocket framing) on their edge before reaching our FastAPI :8084 bridge for Healthcare AI. We pay for that as part of Twilio's per-minute price; in exchange we do not own the carrier-edge CPU. Our bridge does the final resample + OpenAI Realtime WebSocket. Sales Calling AI's 5 concurrent outbound calls per tenant share the same bridge fleet; the bridge is autoscaled based on the count of active streams. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout where the SMS path costs nothing on transcoding. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the cost model is "Twilio absorbs the carrier-edge transcode, we pay for the bridge-to-Realtime path".
pidstat or eBPF profile during a load test. Do not trust guesses.Should I keep transcoding on Twilio or run my own SBC? For most teams Twilio's edge transcode is the right tradeoff. Roll your own only at very high scale (10k+ concurrent) or for HIPAA/data-residency reasons.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does GPU help transcoding? No. Codec encode/decode is integer-heavy; GPUs do not help. CPU SIMD does.
What about ARM CPUs? Graviton (ARM) often gives 30-40% better cost per call for this workload than x86; benchmark before committing.
Can I skip transcoding by streaming Opus all the way? On WebRTC paths, yes. On PSTN paths, no - the PSTN is PCMU.
How big is the CPU saving from edge placement? Marginal in CPU; significant in latency (30-100 ms saved) and bandwidth (less long-haul UDP).
Start a 14-day trial, see pricing for $149/$499/$1499 tiers, or contact us about scaled AI voice cost engineering.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
Bedrock Claude + Transcribe streaming + Polly Neural runs $0.06–$0.10 per minute on paper. The honest math reveals where the AWS-native stack beats and where it loses to OpenAI Realtime.
Embeddings, vector storage, graph nodes, and recall API calls all add up faster than expected. The cost model for serving 100k users with agent memory at scale.
Picking an LLM is choosing two of three: latency, quality, cost. The 2026 framework for explicit trade-offs and how to negotiate them.
© 2026 CallSphere LLC. All rights reserved.