WebSocket Backpressure for AI Audio Streams: Flow Control That Works
How to apply real backpressure to a WebSocket carrying AI audio: bounded queues, token-bucket grants, sentence-level streaming, and the buffer trap to avoid.
The browser WebSocket API has no
pause()method. There is no built-in backpressure. Whatever you ship is what you build, and most teams ship "send and pray."
Why is backpressure hard on WebSockets?
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Because WebSocket is a fire-and-forget message protocol. The browser will accept frames into its receive buffer as fast as the network can deliver them and only drop the connection when the buffer is exhausted. There is no stream.pull() semantic. So when your AI generates 4 seconds of TTS audio in 600 ms and your phone client can only render at real time, you have 3.4 seconds of audio queued — and if the user interrupts, all 3.4 seconds still need to be flushed before the agent can stop talking.
The fix is application-layer backpressure: bounded queues at every stage, explicit ACKs from the consumer, and producers that pause until they get a grant.
How should backpressure actually work?
Three patterns dominate in 2026:
Sentence-level streaming. Split the LLM output by sentence (or 200-character chunks) and TTS each piece independently. Send to the client one sentence at a time, wait for an explicit
playedACK before sending the next. Latency stays low because the first sentence arrives quickly; backpressure is automatic because the queue cannot grow past one in flight.Try Live Demo →Try Live →Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Token-bucket grants. The client gives the server a "credit" of N seconds of audio it can buffer. Server tracks remaining credit per session, pauses sends when credit drops to zero, resumes when the client emits a
grantevent after consuming.Bounded
asyncio.Queuebetween stages. Inside the server, every stage (STT → reasoning → TTS → send) has a queue withmaxsize=N. When a downstream stage is slow, the queue fills, and the upstream stage blocks onput(). This pushes the pause signal back to the audio source automatically.
The trap to avoid is "infinite client buffer." The browser's bufferedAmount will grow to gigabytes if you let it. Always bound the producer.
CallSphere's implementation
The CallSphere voice agents use all three patterns at different layers:
- Sentence-level streaming between the OpenAI Realtime model and the client. Each
response.audio.deltais treated as a chunk; on user interruption, we cancel the response and emit a Twilioclearto drain. - Token-bucket grants between the FastAPI Healthcare service and the bridge. The bridge advertises 800 ms of audio credit and refills as it plays out.
- Bounded queues inside the Sales Calling and After-hours services, with
asyncio.Queue(maxsize=20)between transcription and reasoning, sized to roughly 400 ms of headroom.
This is one reason our voice agents stay under 1.2 s mic-to-mic latency even when the network jitters.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Code: bounded queue with explicit grants
import asyncio
class GrantedSender:
def __init__(self, ws, max_credit_ms: int = 800) -> None:
self.ws = ws
self.credit = max_credit_ms
self.cv = asyncio.Condition()
async def send_chunk(self, chunk: bytes, dur_ms: int) -> None:
async with self.cv:
while self.credit < dur_ms:
await self.cv.wait()
self.credit -= dur_ms
await self.ws.send_bytes(chunk)
async def on_grant(self, grant_ms: int) -> None:
async with self.cv:
self.credit += grant_ms
self.cv.notify_all()
Build steps
- Pick the granularity of backpressure (audio chunk, sentence, or message). Lower granularity = lower latency but more chatter.
- Implement bounded queues between every stage of your pipeline. Default to
maxsize=10and tune. - Add an explicit ACK or grant message from client to server. Browsers cannot push back implicitly.
- Watch
bufferedAmounton the server side; alert when it crosses 256 KB per connection. - On user interruption, cancel upstream production before draining the local buffer — drop, do not finish.
- Load test with a slow client:
tc qdisc addto inject 100 ms latency and 5% packet loss, then verify your queues bound correctly.
FAQ
Why does my agent talk over interruption? Because you flushed the in-flight buffer instead of dropping it. On interrupt, send a clear to the client and response.cancel to the model.
What is the right credit window? 600–1000 ms. Less and you stutter; more and interruption feels laggy.
Can I use TCP_NODELAY to fix this? No. TCP-level tuning helps small messages, but cannot help an oversized application-level buffer.
Does WebTransport solve this? Yes — its streams have native backpressure. But browser support is still uneven in 2026; WebSocket remains the default.
Should I use WebRTC instead? For audio specifically, yes — WebRTC's pacer applies backpressure in the codec layer. We use WebRTC for our Real Estate agent for exactly this reason.
CallSphere builds backpressure into 90+ tools across the platform. Start the 14-day trial for $149/$499/$1499.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.