---
title: "Designing Streaming APIs for LLM Applications: SSE, WebSockets, and HTTP Chunked Transfer"
description: "Learn how to choose and implement the right streaming protocol for LLM applications. Covers Server-Sent Events, WebSockets, and HTTP chunked transfer with FastAPI code examples and error handling strategies."
canonical: https://callsphere.ai/blog/designing-streaming-apis-llm-applications-sse-websockets-chunked-transfer
category: "Learn Agentic AI"
tags: ["Streaming APIs", "Server-Sent Events", "WebSockets", "FastAPI", "LLM API Design"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T17:04:28.888Z
---

# Designing Streaming APIs for LLM Applications: SSE, WebSockets, and HTTP Chunked Transfer

> Learn how to choose and implement the right streaming protocol for LLM applications. Covers Server-Sent Events, WebSockets, and HTTP chunked transfer with FastAPI code examples and error handling strategies.

## Why LLM Applications Need Streaming

Large language models generate tokens sequentially, often taking several seconds to produce a complete response. Without streaming, users stare at a blank screen until the entire response is ready. Streaming lets you push tokens to the client as they are generated, dramatically improving perceived latency and user experience.

Three protocols dominate the streaming landscape for LLM applications: Server-Sent Events (SSE), WebSockets, and HTTP chunked transfer encoding. Each comes with distinct tradeoffs in complexity, browser support, and bidirectional capability.

## Server-Sent Events: The Default Choice

SSE is a unidirectional protocol built on top of standard HTTP. The server pushes a stream of events over a long-lived connection. It is the protocol OpenAI, Anthropic, and most LLM providers use for their streaming endpoints.

```mermaid
sequenceDiagram
    autonumber
    participant Client
    participant Edge as Edge Worker
    participant LLM as LLM Provider
    participant DB as Logs and Trace
    Client->>Edge: POST /chat (stream=true)
    Edge->>LLM: messages.create(stream=true)
    loop Each token
        LLM-->>Edge: SSE chunk delta
        Edge-->>Client: SSE chunk delta
        Edge->>DB: append token to span
    end
    LLM-->>Edge: stop_reason=end_turn
    Edge-->>Client: event: done
    Edge->>DB: finalize trace
```

```python
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI()

async def generate_tokens(prompt: str):
    """Simulate LLM token generation."""
    words = ["Hello", " there!", " I", " am", " an", " AI", " assistant."]
    for token in words:
        yield token
        await asyncio.sleep(0.1)

@app.post("/v1/chat/completions")
async def stream_chat(request: dict):
    prompt = request.get("prompt", "")

    async def event_stream():
        async for token in generate_tokens(prompt):
            chunk = {
                "choices": [{"delta": {"content": token}}],
                "finish_reason": None,
            }
            yield f"data: {json.dumps(chunk)}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )
```

The `X-Accel-Buffering: no` header tells reverse proxies like Nginx to disable response buffering, which is critical for real-time streaming. The `Cache-Control: no-cache` header prevents intermediaries from caching the stream.

## WebSockets: When You Need Bidirectional Communication

WebSockets provide full-duplex communication over a single TCP connection. Use WebSockets when the client needs to send data during generation, such as cancellation signals, follow-up context, or tool results mid-stream.

```python
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
import json

app = FastAPI()

@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
            data = await websocket.receive_json()
            prompt = data.get("prompt", "")

            async for token in generate_tokens(prompt):
                await websocket.send_json({
                    "type": "token",
                    "content": token,
                })

            await websocket.send_json({
                "type": "done",
                "usage": {"prompt_tokens": 10, "completion_tokens": 7},
            })
    except WebSocketDisconnect:
        pass
```

## HTTP Chunked Transfer: The Simplest Approach

HTTP chunked transfer encoding sends the response body in chunks without knowing the total size upfront. It requires no special protocol support, works everywhere HTTP works, and is the simplest to implement. The downside is that it lacks the structured event format of SSE and the bidirectionality of WebSockets.

```python
@app.post("/v1/generate")
async def chunked_generate(request: dict):
    async def chunked_response():
        async for token in generate_tokens(request.get("prompt", "")):
            yield token

    return StreamingResponse(
        chunked_response(),
        media_type="text/plain",
    )
```

## Error Handling During Streams

Errors during streaming are tricky because HTTP status codes are sent before the body. Once the stream starts, you cannot change the status code. The standard pattern is to embed errors inside the stream itself.

```python
async def safe_event_stream(prompt: str):
    try:
        async for token in generate_tokens(prompt):
            chunk = {"choices": [{"delta": {"content": token}}]}
            yield f"data: {json.dumps(chunk)}\n\n"
    except Exception as e:
        error_event = {
            "error": {
                "message": str(e),
                "type": "stream_error",
                "code": "generation_failed",
            }
        }
        yield f"data: {json.dumps(error_event)}\n\n"
    finally:
        yield "data: [DONE]\n\n"
```

## Protocol Selection Guide

Choose **SSE** when your application follows a request-response pattern where the client sends a prompt and receives a streamed response. It has automatic reconnection built into the browser EventSource API and works behind most proxies without configuration.

Choose **WebSockets** when you need the client to send cancellation signals, provide tool call results during generation, or maintain a persistent conversational session with server-push notifications.

Choose **HTTP chunked transfer** when you need maximum compatibility, your consumers are backend services rather than browsers, or you are building internal microservice communication.

## FAQ

### When should I use SSE over WebSockets for LLM streaming?

Use SSE when your pattern is unidirectional: the client sends a prompt and the server streams back tokens. SSE is simpler to implement, works through HTTP proxies without special configuration, has built-in browser reconnection via EventSource, and uses standard HTTP semantics for authentication. Most production LLM APIs, including OpenAI and Anthropic, use SSE.

### How do I handle connection drops during a long LLM stream?

For SSE, include an `id` field with each event. The browser EventSource API sends the last received ID in a `Last-Event-ID` header on reconnection, letting your server resume from where it left off. For WebSockets, implement application-level heartbeats and reconnection logic with exponential backoff. In both cases, cache partial generation state on the server keyed by a request ID so you can resume.

### Why does my SSE stream appear to arrive all at once instead of token by token?

This is almost always caused by response buffering in a reverse proxy (Nginx, AWS ALB, Cloudflare) or in your application server. Set the `X-Accel-Buffering: no` header for Nginx, disable proxy buffering in your load balancer, and ensure your ASGI server (uvicorn) is not batching output. Also check that your client is reading the stream incrementally rather than awaiting the full response.

---

#StreamingAPIs #ServerSentEvents #WebSockets #FastAPI #LLMAPIDesign #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/designing-streaming-apis-llm-applications-sse-websockets-chunked-transfer
