By Sagar Shankaran, Founder of CallSphere
How to apply real backpressure to a WebSocket carrying AI audio: bounded queues, token-bucket grants, sentence-level streaming, and the buffer trap to avoid.
Key takeaways
The browser WebSocket API has no
pause()method. There is no built-in backpressure. Whatever you ship is what you build, and most teams ship "send and pray."
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Because WebSocket is a fire-and-forget message protocol. The browser will accept frames into its receive buffer as fast as the network can deliver them and only drop the connection when the buffer is exhausted. There is no stream.pull() semantic. So when your AI generates 4 seconds of TTS audio in 600 ms and your phone client can only render at real time, you have 3.4 seconds of audio queued — and if the user interrupts, all 3.4 seconds still need to be flushed before the agent can stop talking.
The fix is application-layer backpressure: bounded queues at every stage, explicit ACKs from the consumer, and producers that pause until they get a grant.
Three patterns dominate in 2026:
Sentence-level streaming. Split the LLM output by sentence (or 200-character chunks) and TTS each piece independently. Send to the client one sentence at a time, wait for an explicit played ACK before sending the next. Latency stays low because the first sentence arrives quickly; backpressure is automatic because the queue cannot grow past one in flight.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Token-bucket grants. The client gives the server a "credit" of N seconds of audio it can buffer. Server tracks remaining credit per session, pauses sends when credit drops to zero, resumes when the client emits a grant event after consuming.
Bounded asyncio.Queue between stages. Inside the server, every stage (STT → reasoning → TTS → send) has a queue with maxsize=N. When a downstream stage is slow, the queue fills, and the upstream stage blocks on put(). This pushes the pause signal back to the audio source automatically.
The trap to avoid is "infinite client buffer." The browser's bufferedAmount will grow to gigabytes if you let it. Always bound the producer.
The CallSphere voice agents use all three patterns at different layers:
response.audio.delta is treated as a chunk; on user interruption, we cancel the response and emit a Twilio clear to drain.asyncio.Queue(maxsize=20) between transcription and reasoning, sized to roughly 400 ms of headroom.This is one reason our voice agents stay under 1.2 s mic-to-mic latency even when the network jitters.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
import asyncio
class GrantedSender:
def __init__(self, ws, max_credit_ms: int = 800) -> None:
self.ws = ws
self.credit = max_credit_ms
self.cv = asyncio.Condition()
async def send_chunk(self, chunk: bytes, dur_ms: int) -> None:
async with self.cv:
while self.credit < dur_ms:
await self.cv.wait()
self.credit -= dur_ms
await self.ws.send_bytes(chunk)
async def on_grant(self, grant_ms: int) -> None:
async with self.cv:
self.credit += grant_ms
self.cv.notify_all()
maxsize=10 and tune.bufferedAmount on the server side; alert when it crosses 256 KB per connection.tc qdisc add to inject 100 ms latency and 5% packet loss, then verify your queues bound correctly.Why does my agent talk over interruption? Because you flushed the in-flight buffer instead of dropping it. On interrupt, send a clear to the client and response.cancel to the model.
What is the right credit window? 600–1000 ms. Less and you stutter; more and interruption feels laggy.
Can I use TCP_NODELAY to fix this? No. TCP-level tuning helps small messages, but cannot help an oversized application-level buffer.
Does WebTransport solve this? Yes — its streams have native backpressure. But browser support is still uneven in 2026; WebSocket remains the default.
Should I use WebRTC instead? For audio specifically, yes — WebRTC's pacer applies backpressure in the codec layer. We use WebRTC for our Real Estate agent for exactly this reason.
CallSphere builds backpressure into 90+ tools across the platform. Start the 14-day trial for $149/$499/$1499.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
© 2026 CallSphere LLC. All rights reserved.