Skip to content
AI Engineering
AI Engineering10 min read0 views

WebSocket Backpressure for AI Audio Streams: Flow Control That Works

How to apply real backpressure to a WebSocket carrying AI audio: bounded queues, token-bucket grants, sentence-level streaming, and the buffer trap to avoid.

The browser WebSocket API has no pause() method. There is no built-in backpressure. Whatever you ship is what you build, and most teams ship "send and pray."

Why is backpressure hard on WebSockets?

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Because WebSocket is a fire-and-forget message protocol. The browser will accept frames into its receive buffer as fast as the network can deliver them and only drop the connection when the buffer is exhausted. There is no stream.pull() semantic. So when your AI generates 4 seconds of TTS audio in 600 ms and your phone client can only render at real time, you have 3.4 seconds of audio queued — and if the user interrupts, all 3.4 seconds still need to be flushed before the agent can stop talking.

The fix is application-layer backpressure: bounded queues at every stage, explicit ACKs from the consumer, and producers that pause until they get a grant.

How should backpressure actually work?

Three patterns dominate in 2026:

  1. Sentence-level streaming. Split the LLM output by sentence (or 200-character chunks) and TTS each piece independently. Send to the client one sentence at a time, wait for an explicit played ACK before sending the next. Latency stays low because the first sentence arrives quickly; backpressure is automatic because the queue cannot grow past one in flight.

    Hear it before you finish reading

    Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

    Try Live Demo →
  2. Token-bucket grants. The client gives the server a "credit" of N seconds of audio it can buffer. Server tracks remaining credit per session, pauses sends when credit drops to zero, resumes when the client emits a grant event after consuming.

  3. Bounded asyncio.Queue between stages. Inside the server, every stage (STT → reasoning → TTS → send) has a queue with maxsize=N. When a downstream stage is slow, the queue fills, and the upstream stage blocks on put(). This pushes the pause signal back to the audio source automatically.

The trap to avoid is "infinite client buffer." The browser's bufferedAmount will grow to gigabytes if you let it. Always bound the producer.

CallSphere's implementation

The CallSphere voice agents use all three patterns at different layers:

  • Sentence-level streaming between the OpenAI Realtime model and the client. Each response.audio.delta is treated as a chunk; on user interruption, we cancel the response and emit a Twilio clear to drain.
  • Token-bucket grants between the FastAPI Healthcare service and the bridge. The bridge advertises 800 ms of audio credit and refills as it plays out.
  • Bounded queues inside the Sales Calling and After-hours services, with asyncio.Queue(maxsize=20) between transcription and reasoning, sized to roughly 400 ms of headroom.

This is one reason our voice agents stay under 1.2 s mic-to-mic latency even when the network jitters.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Code: bounded queue with explicit grants

import asyncio

class GrantedSender:
    def __init__(self, ws, max_credit_ms: int = 800) -> None:
        self.ws = ws
        self.credit = max_credit_ms
        self.cv = asyncio.Condition()

    async def send_chunk(self, chunk: bytes, dur_ms: int) -> None:
        async with self.cv:
            while self.credit < dur_ms:
                await self.cv.wait()
            self.credit -= dur_ms
        await self.ws.send_bytes(chunk)

    async def on_grant(self, grant_ms: int) -> None:
        async with self.cv:
            self.credit += grant_ms
            self.cv.notify_all()

Build steps

  1. Pick the granularity of backpressure (audio chunk, sentence, or message). Lower granularity = lower latency but more chatter.
  2. Implement bounded queues between every stage of your pipeline. Default to maxsize=10 and tune.
  3. Add an explicit ACK or grant message from client to server. Browsers cannot push back implicitly.
  4. Watch bufferedAmount on the server side; alert when it crosses 256 KB per connection.
  5. On user interruption, cancel upstream production before draining the local buffer — drop, do not finish.
  6. Load test with a slow client: tc qdisc add to inject 100 ms latency and 5% packet loss, then verify your queues bound correctly.

FAQ

Why does my agent talk over interruption? Because you flushed the in-flight buffer instead of dropping it. On interrupt, send a clear to the client and response.cancel to the model.

What is the right credit window? 600–1000 ms. Less and you stutter; more and interruption feels laggy.

Can I use TCP_NODELAY to fix this? No. TCP-level tuning helps small messages, but cannot help an oversized application-level buffer.

Does WebTransport solve this? Yes — its streams have native backpressure. But browser support is still uneven in 2026; WebSocket remains the default.

Should I use WebRTC instead? For audio specifically, yes — WebRTC's pacer applies backpressure in the codec layer. We use WebRTC for our Real Estate agent for exactly this reason.

CallSphere builds backpressure into 90+ tools across the platform. Start the 14-day trial for $149/$499/$1499.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.