---
title: "RTP Transcoding Cost for AI Voice in 2026: Why Edge Placement Beats Central GPU"
description: "Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute."
canonical: https://callsphere.ai/blog/vw3d-rtp-transcoding-cost-ai-voice-2026
category: "AI Infrastructure"
tags: ["RTP", "Transcoding", "Edge Compute", "AI Voice", "Cost"]
author: "CallSphere Team"
published: 2026-04-12T00:00:00.000Z
updated: 2026-05-07T09:59:38.215Z
---

# RTP Transcoding Cost for AI Voice in 2026: Why Edge Placement Beats Central GPU

> Transcoding RTP to WebSocket is more CPU-intensive than people expect. For AI voice in 2026, where you place the transcode (edge near the carrier vs central near the model) decides your cost-per-minute.

> A 24-vCPU bridge node can transcode roughly 800 simultaneous PCMU-to-L16 calls. That same node running OpenAI Realtime client connections plus WebSocket framing handles closer to 200. The 4x gap is where your AI voice unit economics live.

## Background

```mermaid
flowchart LR
  UA[SIP UA] -- REGISTER --> Reg[Registrar]
  UA -- INVITE --> Proxy[SIP Proxy]
  Proxy --> Dispatcher[Kamailio dispatcher]
  Dispatcher --> Worker1[FreeSWITCH worker]
  Dispatcher --> Worker2[FreeSWITCH worker]
  Worker1 --> AI[(AI agent)]
  Worker2 --> AI
```

CallSphere reference architecture

Transcoding in the AI voice context means converting between codecs (PCMU 8 kHz mu-law to PCM16 16 kHz to Opus 48 kHz, etc.) along the audio path. Pure transcoding is cheap on modern CPUs - a few percent of a core per call - but the operations layered on top (resampling, jitter buffering, WebSocket framing, base64 encoding for Twilio Media Streams JSON envelopes, TLS for WSS) add up.

Where the transcoding happens matters more than people expect. Central placement (one big bridge cluster near your GPU model server) is operationally simple but expensive: every call's audio crosses the public internet twice (carrier-to-bridge, bridge-to-model). Edge placement (transcode near the carrier or in the carrier's region) keeps the heavy media-shuffling close to the source and only sends the model-friendly format over the long haul.

## Technical deep-dive

The CPU breakdown for one inbound call from PSTN to OpenAI Realtime:

| Step | CPU % per core per call (typical) |
| --- | --- |
| RTP receive + de-jitter | 0.3% |
| PCMU mu-law decode to PCM16 | 0.1% |
| Resample 8 kHz to 16 kHz | 0.4% |
| Optional resample to 24 kHz for Realtime | 0.3% |
| WebSocket framing + TLS | 0.5% |
| JSON envelope (Twilio Media Streams) | 0.4% |
| Base64 encode | 0.2% |
| Total inbound | ~2.2% |

The reverse path (model to caller) doubles it. Plus tool-call orchestration in your bridge code adds another 1-3% depending on tool density. Realistic budget: 5-8% of a vCPU per concurrent call for the bridge alone.

```python
# Use sample-rate libraries that vectorize, not naive loops
from scipy import signal

def resample_pcmu_to_pcm16_16khz(mulaw_bytes: bytes) -> bytes:
    pcm8 = mulaw_to_linear(mulaw_bytes)  # 8 kHz int16
    pcm16 = signal.resample_poly(pcm8, 2, 1)  # 16 kHz
    return pcm16.astype("int16").tobytes()
```

A naive Python loop for resampling can be 50x slower than `scipy.signal.resample_poly`. For high-concurrency bridges, drop into Cython, Rust, or a compiled extension; libsamplerate is a good non-Python choice.

The economics: a $0.10/hour vCPU at 8% per call costs $0.0008/hour per call, or $0.000013/min. Per-minute that is invisible compared to OpenAI Realtime's $0.30/min audio cost. But the *aggregate* matters at scale: 1000 concurrent calls is 80 vCPUs which is $8/hour or $5,800/month, on top of GPU time. Edge placement (carrier-region bridges) cuts that and shaves 30-100 ms off every turn-take.

## CallSphere implementation

CallSphere runs all six verticals on Twilio Programmable Voice. Twilio's Media Streams handles the carrier-edge transcoding (PCMU to PCM16 + WebSocket framing) on their edge before reaching our FastAPI :8084 bridge for Healthcare AI. We pay for that as part of Twilio's per-minute price; in exchange we do not own the carrier-edge CPU. Our bridge does the final resample + OpenAI Realtime WebSocket. Sales Calling AI's 5 concurrent outbound calls per tenant share the same bridge fleet; the bridge is autoscaled based on the count of active streams. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout where the SMS path costs nothing on transcoding. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the cost model is "Twilio absorbs the carrier-edge transcode, we pay for the bridge-to-Realtime path".

## Implementation steps

1. Measure your real per-call CPU on the bridge: `pidstat` or eBPF profile during a load test. Do not trust guesses.
2. Vectorize hot paths: resampling, mu-law decode, PCM scale - all in compiled code.
3. If you control SIP-trunk placement, pick a region close to your callers; cross-region transcoding adds latency and bandwidth cost.
4. Co-locate bridges with the model server when geographically possible; OpenAI Realtime regions are limited so plan for transatlantic in EU.
5. Avoid extra resampling steps; if your input is 8 kHz and the model accepts 16 kHz, skip 24 kHz unless required.
6. Reuse WebSocket connections to OpenAI per worker process; do not open one per call if you can multiplex.
7. Profile JSON parsing - rapidjson, orjson, msgspec all crush stdlib json by 5-10x.
8. Plan for 90th percentile call duration: a 4-minute average with 12-minute P95 means you provision for the long tail.

## FAQ

**Should I keep transcoding on Twilio or run my own SBC?**
For most teams Twilio's edge transcode is the right tradeoff. Roll your own only at very high scale (10k+ concurrent) or for HIPAA/data-residency reasons.

**Does GPU help transcoding?**
No. Codec encode/decode is integer-heavy; GPUs do not help. CPU SIMD does.

**What about ARM CPUs?**
Graviton (ARM) often gives 30-40% better cost per call for this workload than x86; benchmark before committing.

**Can I skip transcoding by streaming Opus all the way?**
On WebRTC paths, yes. On PSTN paths, no - the PSTN is PCMU.

**How big is the CPU saving from edge placement?**
Marginal in CPU; significant in latency (30-100 ms saved) and bandwidth (less long-haul UDP).

## Sources

- [DEV: Designing High-Performance RTP Media Infrastructure at Massive Scale](https://dev.to/ecosmob_technologies/designing-high-performance-rtp-media-infrastructure-at-massive-scale-36mj)
- [HireVoIPDeveloper: SIP AI Voice Integration - Media and Orchestration Pitfalls](https://www.hirevoipdeveloper.com/blog/ai-voice-integration-in-sip-ims-architectures/)
- [OpenAI: Delivering low-latency voice AI at scale](https://openai.com/index/delivering-low-latency-voice-ai-at-scale/)

Start a [14-day trial](/trial), see [pricing](/pricing) for $149/$499/$1499 tiers, or [contact us](/contact) about scaled AI voice cost engineering.

---

Source: https://callsphere.ai/blog/vw3d-rtp-transcoding-cost-ai-voice-2026