Jitter Buffer Tuning for AI Agent Latency in 2026: Less Buffer, More Brain
Default jitter buffers (60-200 ms) are tuned for human ears, not LLM turn-taking. Here is how to retune them for AI voice agents without melting your audio quality on cell phone calls.
A 200 ms jitter buffer is invisible to a human listener and devastating to an AI voice agent. It is one of the easiest end-to-end latency improvements you can ship, and one of the most overlooked.
Background
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AIA jitter buffer holds incoming RTP packets briefly to absorb network-induced timing variation before play-out. Adaptive jitter buffers (AJB) adjust their depth between configured min and max based on observed jitter. Default values across SIP stacks (Asterisk, FreeSWITCH, PJSIP, Twilio) cluster around 60-200 ms minimum and 200-500 ms maximum. Those defaults are tuned for one-way listening: the buffer can grow to absorb a 300 ms reordering event without dropouts, and the human ear forgives the latency.
For AI voice that math is wrong. The AI's loop is: voice activity detection -> waited end-of-turn -> ASR partial -> ASR final -> LLM token stream -> TTS first byte -> RTP send. Every milliseconds of jitter buffer on inbound RTP becomes a millisecond of additional turn-taking lag. The user perceives the delay between finishing their sentence and the AI starting its response. Studies put the human "feels natural" threshold at around 800 ms; modern AI voice ships in the 600-1500 ms range. Cutting 100 ms off the jitter buffer is a meaningful share of that budget.
Technical deep-dive
Three knobs matter: minimum depth, maximum depth, and adaptation aggressiveness.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
; Asterisk rtp.conf for AI voice agent inbound
[general]
rtpstart=10000
rtpend=20000
rtpchecksums=no
strictrtp=yes
icesupport=no
[jitter]
jbenable=yes
jbforce=yes
jbmaxsize=120
jbresyncthreshold=500
jbimpl=adaptive
jbtargetextra=20
<!-- FreeSWITCH var per-call for AI bridge -->
<action application="set" data="jitterbuffer_msec=20:80:60"/>
<action application="set" data="rtp_jitter_buffer_plc=true"/>
The FreeSWITCH triplet sets minimum 20 ms, maximum 80 ms, and target 60 ms. That is aggressive but right for a low-loss path. The PLC flag turns on packet-loss concealment so a single dropped packet does not punch a hole through to the ASR engine.
For PJSIP-based Twilio Voice SDK clients, the relevant setting is audioJitterBufferMaxPackets and audioJitterBufferFastAccelerate in the createPeerConnection options. WebRTC defaults are roughly 50-200 ms depending on the implementation; Chrome adapted faster than Firefox in 2026 measurements.
The risk is over-cutting: a buffer too shallow drops packets on burst jitter. Symptom: choppy ASR, missing words at the start of utterances, ASR engines transcribing "hello" as "ello".
CallSphere implementation
CallSphere runs Twilio Programmable Voice across all six verticals. For Healthcare AI on FastAPI :8084 we accept Twilio Media Streams over WebSocket; Twilio's send-side jitter is sub-30 ms median and we run no additional jitter buffer between the WebSocket and OpenAI Realtime, instead relying on Realtime's own server VAD turn detection. For Sales Calling AI's 5 concurrent outbound calls per tenant we tune our internal bridge buffer to 40-80 ms because cell phone destinations have higher native jitter. After-Hours AI uses simul call+SMS to on-call staff with a 120-second timeout where reliability beats latency optimization. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, and 14-day trial, the buffer policy is documented per-vertical with measured turn-taking latency on each.
Implementation steps
- Establish your baseline: log RTP arrival timestamps for 1000 calls and compute the 95th percentile inter-packet delta minus the expected packetization interval.
- If P95 jitter is under 30 ms, you are over-buffered with defaults; cut min to 20 ms and max to 80 ms.
- If P95 jitter is over 80 ms, you are likely on a noisy mobile path; keep max at 120-150 ms but reduce min to 40 ms.
- Turn on PLC unconditionally. A reconstructed packet beats a hole.
- Measure end-to-end turn-taking latency before and after with a stopwatch script: send "hello", wait, time first AI audio byte.
- Add jitter and loss alerts to your CDR pipeline; sustained P95 above 100 ms means a peering or SBC problem.
- For OpenAI Realtime, prefer server VAD turn detection over client; the model has its own end-of-turn classifier that handles small jitter implicitly.
- Re-run quarterly; carrier and SIP peering paths shift.
FAQ
Why not just remove the jitter buffer entirely? Even a 5 ms buffer is doing useful work because RTP arrival is bursty. Zero buffer creates audio glitches the ASR engine hears as phantom syllables.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does Twilio expose jitter buffer config on Programmable Voice? Not on the trunk side, but on Voice SDK clients you can tune it in the WebRTC PeerConnection. On Media Streams the server-side buffer is short and you can effectively bypass it.
Will this break for callers on slow networks? You will see more PLC events and occasional ASR misreads on bad networks. Set a fallback buffer depth that triggers on detected loss above 5%.
How does this interact with Opus FEC? Opus inband FEC reconstructs the previous frame so a small jitter buffer plus FEC is more robust than a large jitter buffer with G.722.
What about the outbound side? Outbound RTP is sent on a fixed cadence; jitter buffer only matters on the receiver. The AI as receiver is the side to optimize.
Sources
- Wildix: RTP, RTCP and Jitter Buffer
- Telnyx: Adaptive Jitter Buffer for SIP Trunking
- PJSIP: Jitter Buffer Features and Operations
Start a 14-day trial and feel the latency, see pricing for $149/$499/$1499 tiers, or contact us about jitter buffer tuning for high-stakes voice AI.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.