Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Jitter Buffer Tuning for AI Agent Latency in 2026: Less Buffer, More Brain

Default jitter buffers (60-200 ms) are tuned for human ears, not LLM turn-taking. Here is how to retune them for AI voice agents without melting your audio quality on cell phone calls.

A 200 ms jitter buffer is invisible to a human listener and devastating to an AI voice agent. It is one of the easiest end-to-end latency improvements you can ship, and one of the most overlooked.

Background

flowchart LR
  UA[SIP UA] -- REGISTER --> Reg[Registrar]
  UA -- INVITE --> Proxy[SIP Proxy]
  Proxy --> Dispatcher[Kamailio dispatcher]
  Dispatcher --> Worker1[FreeSWITCH worker]
  Dispatcher --> Worker2[FreeSWITCH worker]
  Worker1 --> AI[(AI agent)]
  Worker2 --> AI
CallSphere reference architecture

A jitter buffer holds incoming RTP packets briefly to absorb network-induced timing variation before play-out. Adaptive jitter buffers (AJB) adjust their depth between configured min and max based on observed jitter. Default values across SIP stacks (Asterisk, FreeSWITCH, PJSIP, Twilio) cluster around 60-200 ms minimum and 200-500 ms maximum. Those defaults are tuned for one-way listening: the buffer can grow to absorb a 300 ms reordering event without dropouts, and the human ear forgives the latency.

For AI voice that math is wrong. The AI's loop is: voice activity detection -> waited end-of-turn -> ASR partial -> ASR final -> LLM token stream -> TTS first byte -> RTP send. Every milliseconds of jitter buffer on inbound RTP becomes a millisecond of additional turn-taking lag. The user perceives the delay between finishing their sentence and the AI starting its response. Studies put the human "feels natural" threshold at around 800 ms; modern AI voice ships in the 600-1500 ms range. Cutting 100 ms off the jitter buffer is a meaningful share of that budget.

Technical deep-dive

Three knobs matter: minimum depth, maximum depth, and adaptation aggressiveness.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
; Asterisk rtp.conf for AI voice agent inbound
[general]
rtpstart=10000
rtpend=20000
rtpchecksums=no
strictrtp=yes
icesupport=no

[jitter]
jbenable=yes
jbforce=yes
jbmaxsize=120
jbresyncthreshold=500
jbimpl=adaptive
jbtargetextra=20
<!-- FreeSWITCH var per-call for AI bridge -->
<action application="set" data="jitterbuffer_msec=20:80:60"/>
<action application="set" data="rtp_jitter_buffer_plc=true"/>

The FreeSWITCH triplet sets minimum 20 ms, maximum 80 ms, and target 60 ms. That is aggressive but right for a low-loss path. The PLC flag turns on packet-loss concealment so a single dropped packet does not punch a hole through to the ASR engine.

For PJSIP-based Twilio Voice SDK clients, the relevant setting is audioJitterBufferMaxPackets and audioJitterBufferFastAccelerate in the createPeerConnection options. WebRTC defaults are roughly 50-200 ms depending on the implementation; Chrome adapted faster than Firefox in 2026 measurements.

The risk is over-cutting: a buffer too shallow drops packets on burst jitter. Symptom: choppy ASR, missing words at the start of utterances, ASR engines transcribing "hello" as "ello".

CallSphere implementation

CallSphere runs Twilio Programmable Voice across all six verticals. For Healthcare AI on FastAPI :8084 we accept Twilio Media Streams over WebSocket; Twilio's send-side jitter is sub-30 ms median and we run no additional jitter buffer between the WebSocket and OpenAI Realtime, instead relying on Realtime's own server VAD turn detection. For Sales Calling AI's 5 concurrent outbound calls per tenant we tune our internal bridge buffer to 40-80 ms because cell phone destinations have higher native jitter. After-Hours AI uses simul call+SMS to on-call staff with a 120-second timeout where reliability beats latency optimization. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, and 14-day trial, the buffer policy is documented per-vertical with measured turn-taking latency on each.

Implementation steps

  1. Establish your baseline: log RTP arrival timestamps for 1000 calls and compute the 95th percentile inter-packet delta minus the expected packetization interval.
  2. If P95 jitter is under 30 ms, you are over-buffered with defaults; cut min to 20 ms and max to 80 ms.
  3. If P95 jitter is over 80 ms, you are likely on a noisy mobile path; keep max at 120-150 ms but reduce min to 40 ms.
  4. Turn on PLC unconditionally. A reconstructed packet beats a hole.
  5. Measure end-to-end turn-taking latency before and after with a stopwatch script: send "hello", wait, time first AI audio byte.
  6. Add jitter and loss alerts to your CDR pipeline; sustained P95 above 100 ms means a peering or SBC problem.
  7. For OpenAI Realtime, prefer server VAD turn detection over client; the model has its own end-of-turn classifier that handles small jitter implicitly.
  8. Re-run quarterly; carrier and SIP peering paths shift.

FAQ

Why not just remove the jitter buffer entirely? Even a 5 ms buffer is doing useful work because RTP arrival is bursty. Zero buffer creates audio glitches the ASR engine hears as phantom syllables.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Does Twilio expose jitter buffer config on Programmable Voice? Not on the trunk side, but on Voice SDK clients you can tune it in the WebRTC PeerConnection. On Media Streams the server-side buffer is short and you can effectively bypass it.

Will this break for callers on slow networks? You will see more PLC events and occasional ASR misreads on bad networks. Set a fallback buffer depth that triggers on detected loss above 5%.

How does this interact with Opus FEC? Opus inband FEC reconstructs the previous frame so a small jitter buffer plus FEC is more robust than a large jitter buffer with G.722.

What about the outbound side? Outbound RTP is sent on a fixed cadence; jitter buffer only matters on the receiver. The AI as receiver is the side to optimize.

Sources

Start a 14-day trial and feel the latency, see pricing for $149/$499/$1499 tiers, or contact us about jitter buffer tuning for high-stakes voice AI.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Trucking dispatchers spend half their day on check-calls. Here is how a 2026 AI voice agent runs the driver hotline, assigns loads, and updates the TMS in real time.