Skip to content
Learn Agentic AI
Learn Agentic AI10 min read7 views

Server-Sent Events for Agent Streaming: Pushing Token-by-Token Responses to Clients

Implement Server-Sent Events (SSE) to stream AI agent responses token by token to browser clients using FastAPI StreamingResponse, EventSource API, and proper reconnection handling.

Why SSE for Agent Streaming

When a user sends a message to an AI agent, waiting 10-30 seconds for a complete response creates a terrible experience. Streaming tokens as they are generated makes the agent feel responsive and intelligent. Server-Sent Events (SSE) is the simplest protocol for this — it is HTTP-based, works through proxies and firewalls, auto-reconnects on failure, and requires zero client-side libraries.

Unlike WebSockets, SSE is unidirectional: the server pushes events to the client. This is a perfect fit for the most common agent pattern — the user sends a message (via a regular POST), and the server streams back the response token by token.

The SSE Protocol

SSE follows a simple text format. Each event is a block of lines separated by a blank line:

flowchart TD
    START["Server-Sent Events for Agent Streaming: Pushing T…"] --> A
    A["Why SSE for Agent Streaming"]
    A --> B
    B["The SSE Protocol"]
    B --> C
    C["FastAPI Streaming Endpoint"]
    C --> D
    D["Client-Side with EventSource"]
    D --> E
    E["Handling Backpressure"]
    E --> F
    F["Adding Reconnection Support"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
event: token
data: {"content": "Hello"}

event: token
data: {"content": " world"}

event: done
data: {"session_id": "abc-123", "total_tokens": 47}

The event: field names the event type. The data: field contains the payload. Clients parse these automatically with the browser EventSource API.

FastAPI Streaming Endpoint

Use StreamingResponse with an async generator to produce SSE events:

# app/routes/stream.py
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from agents import Agent, Runner
import json

router = APIRouter(prefix="/api/v1/agent", tags=["Streaming"])

agent = Agent(
    name="assistant",
    instructions="You are a helpful assistant.",
    model="gpt-4o",
)

async def event_generator(message: str):
    """Yield SSE-formatted events as the agent generates tokens."""
    result = Runner.run_streamed(agent, message)

    async for event in result.stream_events():
        if event.type == "raw_response_event":
            delta = event.data
            if hasattr(delta, "delta") and hasattr(delta.delta, "content"):
                text = delta.delta.content
                if text:
                    yield f"event: token\ndata: {json.dumps({'content': text})}\n\n"

    yield f"event: done\ndata: {json.dumps({'content': result.final_output})}\n\n"

@router.post("/stream")
async def stream_agent(request: Request):
    body = await request.json()
    message = body.get("message", "")

    return StreamingResponse(
        event_generator(message),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",  # Disable nginx buffering
        },
    )

The X-Accel-Buffering: no header is critical when running behind nginx or similar reverse proxies — without it, the proxy buffers the entire response and delivers it at once, defeating the purpose of streaming.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Client-Side with EventSource

The browser EventSource API handles SSE natively, but it only supports GET requests. For POST-based streaming, use the fetch API with a stream reader:

# Using fetch for POST-based SSE (JavaScript)
# Note: This shows the client-side approach in pseudocode
#
# async function streamAgent(message) {
#   const response = await fetch("/api/v1/agent/stream", {
#     method: "POST",
#     headers: {"Content-Type": "application/json"},
#     body: JSON.stringify({message}),
#   });
#   const reader = response.body.getReader();
#   const decoder = new TextDecoder();
#   while (true) {
#     const {done, value} = await reader.read();
#     if (done) break;
#     const text = decoder.decode(value);
#     parseSSEChunks(text);
#   }
# }

Handling Backpressure

If the client reads slower than the server produces tokens, you need backpressure handling to avoid unbounded memory growth:

import asyncio

async def event_generator_with_backpressure(message: str):
    queue: asyncio.Queue = asyncio.Queue(maxsize=100)
    done = asyncio.Event()

    async def producer():
        result = Runner.run_streamed(agent, message)
        async for event in result.stream_events():
            if event.type == "raw_response_event":
                delta = event.data
                if hasattr(delta, "delta") and hasattr(delta.delta, "content"):
                    text = delta.delta.content
                    if text:
                        await queue.put(text)  # Blocks when queue is full
        done.set()
        await queue.put(None)  # Sentinel

    asyncio.create_task(producer())

    while True:
        token = await queue.get()
        if token is None:
            break
        yield f"event: token\ndata: {json.dumps({'content': token})}\n\n"

    yield f"event: done\ndata: {json.dumps({'status': 'complete'})}\n\n"

Adding Reconnection Support

SSE has built-in reconnection. Use the id: field so clients can resume from where they left off:

token_index = 0

async def event_generator_with_ids(message: str, last_id: int = 0):
    token_index = 0
    result = Runner.run_streamed(agent, message)

    async for event in result.stream_events():
        if event.type == "raw_response_event":
            delta = event.data
            if hasattr(delta, "delta") and hasattr(delta.delta, "content"):
                text = delta.delta.content
                if text:
                    token_index += 1
                    if token_index <= last_id:
                        continue  # Skip already-sent tokens
                    yield f"id: {token_index}\nevent: token\ndata: {json.dumps({'content': text})}\n\n"

    yield f"event: done\ndata: {json.dumps({'status': 'complete'})}\n\n"

When the connection drops, the browser EventSource sends a Last-Event-ID header on reconnect. Your endpoint reads this header and skips already-delivered tokens.

FAQ

When should I use SSE instead of WebSockets for AI agents?

Use SSE when the communication is primarily server-to-client — the most common agent pattern where the user sends a message and receives a streamed response. SSE is simpler to implement, works through all HTTP proxies, and auto-reconnects natively. Use WebSockets when you need true bidirectional communication, such as allowing users to interrupt or redirect the agent mid-generation.

How do I handle SSE through a load balancer?

Most load balancers support SSE out of the box since it is standard HTTP. Disable response buffering in your reverse proxy (nginx: proxy_buffering off;, AWS ALB: works by default). Set idle timeout on the load balancer higher than your maximum response time. Use sticky sessions if your agent maintains in-memory state across requests.

What is the maximum number of concurrent SSE connections a browser supports?

Browsers limit concurrent SSE connections to 6 per domain when using HTTP/1.1. With HTTP/2 the limit increases to 100 or more. If your application needs many simultaneous streams, use HTTP/2 or multiplex multiple agent streams over a single SSE connection with event type prefixes.


#SSE #Streaming #AIAgents #FastAPI #Frontend #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Technical Guides

AI Voice Agent Architecture: Real-Time STT, LLM, and TTS Pipeline

Deep dive into the real-time STT → LLM → TTS pipeline that powers modern AI voice agents — latency, streaming, and error recovery.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.