Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

RTP Transcoding Cost for AI Voice in 2026: Why Edge Placement Beats Central GPU

Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute.

A 24-vCPU bridge node can transcode roughly 800 simultaneous PCMU-to-L16 calls. That same node running OpenAI Realtime client connections plus WebSocket framing handles closer to 200. The 4x gap is where your AI voice unit economics live.

Background

flowchart LR
  UA[SIP UA] -- REGISTER --> Reg[Registrar]
  UA -- INVITE --> Proxy[SIP Proxy]
  Proxy --> Dispatcher[Kamailio dispatcher]
  Dispatcher --> Worker1[FreeSWITCH worker]
  Dispatcher --> Worker2[FreeSWITCH worker]
  Worker1 --> AI[(AI agent)]
  Worker2 --> AI
CallSphere reference architecture

Transcoding in the AI voice context means converting between codecs (PCMU 8 kHz mu-law to PCM16 16 kHz to Opus 48 kHz, etc.) along the audio path. Pure transcoding is cheap on modern CPUs - a few percent of a core per call - but the operations layered on top (resampling, jitter buffering, WebSocket framing, base64 encoding for Twilio Media Streams JSON envelopes, TLS for WSS) add up.

Where the transcoding happens matters more than people expect. Central placement (one big bridge cluster near your GPU model server) is operationally simple but expensive: every call's audio crosses the public internet twice (carrier-to-bridge, bridge-to-model). Edge placement (transcode near the carrier or in the carrier's region) keeps the heavy media-shuffling close to the source and only sends the model-friendly format over the long haul.

Technical deep-dive

The CPU breakdown for one inbound call from PSTN to OpenAI Realtime:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
Step CPU % per core per call (typical)
RTP receive + de-jitter 0.3%
PCMU mu-law decode to PCM16 0.1%
Resample 8 kHz to 16 kHz 0.4%
Optional resample to 24 kHz for Realtime 0.3%
WebSocket framing + TLS 0.5%
JSON envelope (Twilio Media Streams) 0.4%
Base64 encode 0.2%
Total inbound ~2.2%

The reverse path (model to caller) doubles it. Plus tool-call orchestration in your bridge code adds another 1-3% depending on tool density. Realistic budget: 5-8% of a vCPU per concurrent call for the bridge alone.

# Use sample-rate libraries that vectorize, not naive loops
from scipy import signal

def resample_pcmu_to_pcm16_16khz(mulaw_bytes: bytes) -> bytes:
    pcm8 = mulaw_to_linear(mulaw_bytes)  # 8 kHz int16
    pcm16 = signal.resample_poly(pcm8, 2, 1)  # 16 kHz
    return pcm16.astype("int16").tobytes()

A naive Python loop for resampling can be 50x slower than scipy.signal.resample_poly. For high-concurrency bridges, drop into Cython, Rust, or a compiled extension; libsamplerate is a good non-Python choice.

The economics: a $0.10/hour vCPU at 8% per call costs $0.0008/hour per call, or $0.000013/min. Per-minute that is invisible compared to OpenAI Realtime's $0.30/min audio cost. But the aggregate matters at scale: 1000 concurrent calls is 80 vCPUs which is $8/hour or $5,800/month, on top of GPU time. Edge placement (carrier-region bridges) cuts that and shaves 30-100 ms off every turn-take.

CallSphere implementation

CallSphere runs all six verticals on Twilio Programmable Voice. Twilio's Media Streams handles the carrier-edge transcoding (PCMU to PCM16 + WebSocket framing) on their edge before reaching our FastAPI :8084 bridge for Healthcare AI. We pay for that as part of Twilio's per-minute price; in exchange we do not own the carrier-edge CPU. Our bridge does the final resample + OpenAI Realtime WebSocket. Sales Calling AI's 5 concurrent outbound calls per tenant share the same bridge fleet; the bridge is autoscaled based on the count of active streams. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout where the SMS path costs nothing on transcoding. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the cost model is "Twilio absorbs the carrier-edge transcode, we pay for the bridge-to-Realtime path".

Implementation steps

  1. Measure your real per-call CPU on the bridge: pidstat or eBPF profile during a load test. Do not trust guesses.
  2. Vectorize hot paths: resampling, mu-law decode, PCM scale - all in compiled code.
  3. If you control SIP-trunk placement, pick a region close to your callers; cross-region transcoding adds latency and bandwidth cost.
  4. Co-locate bridges with the model server when geographically possible; OpenAI Realtime regions are limited so plan for transatlantic in EU.
  5. Avoid extra resampling steps; if your input is 8 kHz and the model accepts 16 kHz, skip 24 kHz unless required.
  6. Reuse WebSocket connections to OpenAI per worker process; do not open one per call if you can multiplex.
  7. Profile JSON parsing - rapidjson, orjson, msgspec all crush stdlib json by 5-10x.
  8. Plan for 90th percentile call duration: a 4-minute average with 12-minute P95 means you provision for the long tail.

FAQ

Should I keep transcoding on Twilio or run my own SBC? For most teams Twilio's edge transcode is the right tradeoff. Roll your own only at very high scale (10k+ concurrent) or for HIPAA/data-residency reasons.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Does GPU help transcoding? No. Codec encode/decode is integer-heavy; GPUs do not help. CPU SIMD does.

What about ARM CPUs? Graviton (ARM) often gives 30-40% better cost per call for this workload than x86; benchmark before committing.

Can I skip transcoding by streaming Opus all the way? On WebRTC paths, yes. On PSTN paths, no - the PSTN is PCMU.

How big is the CPU saving from edge placement? Marginal in CPU; significant in latency (30-100 ms saved) and bandwidth (less long-haul UDP).

Sources

Start a 14-day trial, see pricing for $149/$499/$1499 tiers, or contact us about scaled AI voice cost engineering.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Infrastructure

AWS Bedrock + Transcribe + Polly Stitched vs Realtime: Real Cost

Bedrock Claude + Transcribe streaming + Polly Neural runs $0.06–$0.10 per minute on paper. The honest math reveals where the AWS-native stack beats and where it loses to OpenAI Realtime.

AI Strategy

Agent Memory Cost Modeling in 2026: An Honest Numbers Walkthrough

Embeddings, vector storage, graph nodes, and recall API calls all add up faster than expected. The cost model for serving 100k users with agent memory at scale.

AI Infrastructure

Agent Caching Layers: Semantic, Prefix, and Prompt Caching Stacked

Three caching layers cut agent cost dramatically in stacked combination. The architecture for stacking them and the gotchas with each one in serious production deployments.

AI Engineering

Reranking Cuts RAG Cost 3x — With Trade-Offs You Should Know

Adding a reranker lets you retrieve more candidates and use a cheaper LLM downstream. The math, the latency cost, and where this approach quietly backfires in production.