By Sagar Shankaran, Founder of CallSphere
Default jitter buffers (60-200 ms) are tuned for human ears, not LLM turn-taking. Here is how to retune them for AI voice agents without melting your audio quality on cell phone calls.
Key takeaways
A 200 ms jitter buffer is invisible to a human listener and devastating to an AI voice agent. It is one of the easiest end-to-end latency improvements you can ship, and one of the most overlooked.
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AIA jitter buffer holds incoming RTP packets briefly to absorb network-induced timing variation before play-out. Adaptive jitter buffers (AJB) adjust their depth between configured min and max based on observed jitter. Default values across SIP stacks (Asterisk, FreeSWITCH, PJSIP, Twilio) cluster around 60-200 ms minimum and 200-500 ms maximum. Those defaults are tuned for one-way listening: the buffer can grow to absorb a 300 ms reordering event without dropouts, and the human ear forgives the latency.
For AI voice that math is wrong. The AI's loop is: voice activity detection -> waited end-of-turn -> ASR partial -> ASR final -> LLM token stream -> TTS first byte -> RTP send. Every milliseconds of jitter buffer on inbound RTP becomes a millisecond of additional turn-taking lag. The user perceives the delay between finishing their sentence and the AI starting its response. Studies put the human "feels natural" threshold at around 800 ms; modern AI voice ships in the 600-1500 ms range. Cutting 100 ms off the jitter buffer is a meaningful share of that budget.
Three knobs matter: minimum depth, maximum depth, and adaptation aggressiveness.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
; Asterisk rtp.conf for AI voice agent inbound
[general]
rtpstart=10000
rtpend=20000
rtpchecksums=no
strictrtp=yes
icesupport=no
[jitter]
jbenable=yes
jbforce=yes
jbmaxsize=120
jbresyncthreshold=500
jbimpl=adaptive
jbtargetextra=20
<!-- FreeSWITCH var per-call for AI bridge -->
<action application="set" data="jitterbuffer_msec=20:80:60"/>
<action application="set" data="rtp_jitter_buffer_plc=true"/>
The FreeSWITCH triplet sets minimum 20 ms, maximum 80 ms, and target 60 ms. That is aggressive but right for a low-loss path. The PLC flag turns on packet-loss concealment so a single dropped packet does not punch a hole through to the ASR engine.
For PJSIP-based Twilio Voice SDK clients, the relevant setting is audioJitterBufferMaxPackets and audioJitterBufferFastAccelerate in the createPeerConnection options. WebRTC defaults are roughly 50-200 ms depending on the implementation; Chrome adapted faster than Firefox in 2026 measurements.
The risk is over-cutting: a buffer too shallow drops packets on burst jitter. Symptom: choppy ASR, missing words at the start of utterances, ASR engines transcribing "hello" as "ello".
CallSphere runs Twilio Programmable Voice across all six verticals. For Healthcare AI on FastAPI :8084 we accept Twilio Media Streams over WebSocket; Twilio's send-side jitter is sub-30 ms median and we run no additional jitter buffer between the WebSocket and OpenAI Realtime, instead relying on Realtime's own server VAD turn detection. For Sales Calling AI's 5 concurrent outbound calls per tenant we tune our internal bridge buffer to 40-80 ms because cell phone destinations have higher native jitter. After-Hours AI uses simul call+SMS to on-call staff with a 120-second timeout where reliability beats latency optimization. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2, $149/$499/$1499 pricing, and 14-day trial, the buffer policy is documented per-vertical with measured turn-taking latency on each.
Why not just remove the jitter buffer entirely? Even a 5 ms buffer is doing useful work because RTP arrival is bursty. Zero buffer creates audio glitches the ASR engine hears as phantom syllables.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does Twilio expose jitter buffer config on Programmable Voice? Not on the trunk side, but on Voice SDK clients you can tune it in the WebRTC PeerConnection. On Media Streams the server-side buffer is short and you can effectively bypass it.
Will this break for callers on slow networks? You will see more PLC events and occasional ASR misreads on bad networks. Set a fallback buffer depth that triggers on detected loss above 5%.
How does this interact with Opus FEC? Opus inband FEC reconstructs the previous frame so a small jitter buffer plus FEC is more robust than a large jitter buffer with G.722.
What about the outbound side? Outbound RTP is sent on a fixed cadence; jitter buffer only matters on the receiver. The AI as receiver is the side to optimize.
Start a 14-day trial and feel the latency, see pricing for $149/$499/$1499 tiers, or contact us about jitter buffer tuning for high-stakes voice AI.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.