Skip to content
Learn Agentic AI
Learn Agentic AI11 min read6 views

WebSocket Transport for Low-Latency Agent Communication

Enable WebSocket transport in the OpenAI Agents SDK for persistent connections, reduced latency, and faster multi-turn agent interactions using set_default_openai_responses_transport.

Why WebSocket Transport Matters for Agents

By default, the OpenAI Agents SDK uses HTTP for every API call. Each tool call, each generation, each handoff results in a new HTTP request — a new TCP connection (or at least a new request on a keep-alive connection), TLS handshake overhead, and HTTP header parsing. For a single agent call, this overhead is negligible. For a multi-agent workflow with ten tool calls and three handoffs, it adds up.

WebSocket transport replaces these individual HTTP requests with a single persistent connection. The agent opens a WebSocket to the OpenAI API once, and all subsequent messages flow over that connection with minimal overhead. The result is measurably lower latency for multi-turn and tool-heavy agent interactions.

Enabling WebSocket Transport

The SDK provides a one-line configuration to switch to WebSocket transport:

flowchart TD
    START["WebSocket Transport for Low-Latency Agent Communi…"] --> A
    A["Why WebSocket Transport Matters for Age…"]
    A --> B
    B["Enabling WebSocket Transport"]
    B --> C
    C["How It Works Under the Hood"]
    C --> D
    D["Benchmarking the Difference"]
    D --> E
    E["Per-Agent Transport Configuration"]
    E --> F
    F["Connection Management"]
    F --> G
    G["Streaming with WebSocket Transport"]
    G --> H
    H["When to Use WebSocket Transport"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents import set_default_openai_responses_transport

# Enable WebSocket transport globally
set_default_openai_responses_transport("websocket")

That is it. Every subsequent Runner.run() call will use WebSocket instead of HTTP. No changes to your agent definitions, tools, or handoffs are needed.

How It Works Under the Hood

When you set the transport to "websocket", the SDK:

  1. Opens a persistent WebSocket connection to the OpenAI Responses API
  2. Sends agent generation requests as WebSocket messages
  3. Receives streaming responses over the same connection
  4. Keeps the connection alive across multiple tool call rounds within a single Runner.run()

The key performance benefit is in multi-turn interactions. Consider an agent that calls three tools sequentially:

HTTP transport (default):

  • Request 1: Initial generation -> Response (tool call) — ~200ms overhead
  • Request 2: Tool result -> Response (tool call) — ~200ms overhead
  • Request 3: Tool result -> Response (tool call) — ~200ms overhead
  • Request 4: Tool result -> Final response — ~200ms overhead
  • Total overhead: ~800ms just from HTTP round trips

WebSocket transport:

  • Connection established once: ~300ms
  • Message 1: Initial generation -> Response (tool call) — ~20ms overhead
  • Message 2: Tool result -> Response (tool call) — ~20ms overhead
  • Message 3: Tool result -> Response (tool call) — ~20ms overhead
  • Message 4: Tool result -> Final response — ~20ms overhead
  • Total overhead: ~380ms

For tool-heavy workflows, WebSocket transport can reduce total latency by 40-60%.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Benchmarking the Difference

Let us build a benchmark to measure the actual impact:

from agents import Agent, Runner, function_tool, set_default_openai_responses_transport
import asyncio
import time


@function_tool
def get_weather(city: str) -> str:
    """Get current weather for a city."""
    return f"Weather in {city}: 22C, partly cloudy"


@function_tool
def get_population(city: str) -> str:
    """Get population of a city."""
    return f"Population of {city}: 8.3 million"


@function_tool
def get_timezone(city: str) -> str:
    """Get timezone of a city."""
    return f"Timezone of {city}: UTC+5:30"


agent = Agent(
    name="CityInfoAgent",
    model="gpt-4.1",
    instructions=(
        "When asked about a city, always call all three tools "
        "(weather, population, timezone) before responding."
    ),
    tools=[get_weather, get_population, get_timezone],
)


async def benchmark_transport(transport: str, iterations: int = 5):
    """Benchmark agent runs with the specified transport."""
    set_default_openai_responses_transport(transport)

    durations = []
    for i in range(iterations):
        start = time.monotonic()
        result = await Runner.run(
            agent,
            input="Tell me about Mumbai.",
        )
        elapsed = time.monotonic() - start
        durations.append(elapsed)

    avg = sum(durations) / len(durations)
    p95 = sorted(durations)[int(len(durations) * 0.95)]
    return {"transport": transport, "avg_ms": avg * 1000, "p95_ms": p95 * 1000}


async def main():
    print("Benchmarking HTTP transport...")
    http_results = await benchmark_transport("http", iterations=10)
    print(f"  HTTP  - avg: {http_results['avg_ms']:.0f}ms, p95: {http_results['p95_ms']:.0f}ms")

    print("Benchmarking WebSocket transport...")
    ws_results = await benchmark_transport("websocket", iterations=10)
    print(f"  WS    - avg: {ws_results['avg_ms']:.0f}ms, p95: {ws_results['p95_ms']:.0f}ms")

    improvement = (1 - ws_results["avg_ms"] / http_results["avg_ms"]) * 100
    print(f"  Improvement: {improvement:.1f}%")

asyncio.run(main())

In typical benchmarks, you will see 30-50% latency reduction for agents with three or more tool calls per run.

Per-Agent Transport Configuration

You can also set the transport per-agent rather than globally:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Opens a persistent WebSocket connection…"]
    CENTER --> N1["Sends agent generation requests as WebS…"]
    CENTER --> N2["Receives streaming responses over the s…"]
    CENTER --> N3["Keeps the connection alive across multi…"]
    CENTER --> N4["Request 1: Initial generation -gt Respo…"]
    CENTER --> N5["Request 2: Tool result -gt Response too…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent

# This agent uses WebSocket for its low-latency requirement
fast_agent = Agent(
    name="FastAgent",
    model="gpt-4.1",
    instructions="Respond quickly using available tools.",
    tools=[get_weather, get_population],
    model_settings={"transport": "websocket"},
)

# This agent uses default HTTP (simpler debugging)
debug_agent = Agent(
    name="DebugAgent",
    model="gpt-4.1",
    instructions="Process requests for debugging and analysis.",
)

Connection Management

WebSocket connections need lifecycle management in production. The SDK handles most of this automatically, but you should be aware of the behavior:

from agents import set_default_openai_responses_transport, Runner
import asyncio


async def handle_request(user_input: str):
    """Each Runner.run() manages its own WebSocket lifecycle."""
    # The SDK opens a WebSocket for this run
    result = await Runner.run(agent, input=user_input)
    # The WebSocket is closed when the run completes
    return result.final_output


async def handle_concurrent_requests(inputs: list[str]):
    """Concurrent runs each get their own WebSocket connection."""
    tasks = [handle_request(inp) for inp in inputs]
    results = await asyncio.gather(*tasks)
    return results


# Enable WebSocket globally
set_default_openai_responses_transport("websocket")

# Handle 10 concurrent requests — each gets its own connection
asyncio.run(handle_concurrent_requests(["Query " + str(i) for i in range(10)]))

Each Runner.run() call manages its own WebSocket connection. Concurrent runs create concurrent connections. This is safe and correct — WebSocket connections are lightweight, and the OpenAI API supports many simultaneous connections per API key.

Streaming with WebSocket Transport

WebSocket transport pairs naturally with streaming, since the connection is already persistent:

from agents import Agent, Runner, set_default_openai_responses_transport

set_default_openai_responses_transport("websocket")

agent = Agent(
    name="StreamingAgent",
    model="gpt-4.1",
    instructions="Provide detailed answers.",
)


async def stream_response(user_input: str):
    """Stream agent output over WebSocket transport."""
    result = Runner.run_streamed(agent, input=user_input)

    async for event in result.stream_events():
        if event.type == "raw_response_event":
            if hasattr(event.data, "delta") and event.data.delta:
                print(event.data.delta, end="", flush=True)

    print()  # Newline after streaming completes
    return result.final_output

The combination of WebSocket transport and streaming gives you the lowest possible time-to-first-token for agent responses.

When to Use WebSocket Transport

Use WebSocket when:

  • Your agents make three or more tool calls per run
  • You have multi-agent workflows with handoffs
  • Latency is a key metric (real-time chat, voice agents)
  • You are running streaming responses

Stick with HTTP when:

  • Your agents are simple single-turn, no-tool interactions
  • You are debugging and want clear request/response pairs in your network inspector
  • Your infrastructure (proxies, load balancers) does not support WebSocket passthrough
  • You are behind a corporate firewall that blocks WebSocket upgrades

Infrastructure Considerations

If you deploy behind a reverse proxy or load balancer, ensure WebSocket support is enabled:

# nginx.conf
location /api/agent/ {
    proxy_pass http://agent-service:8000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
}

For Kubernetes ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agent-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/websocket-services: "agent-service"
spec:
  rules:
    - host: agents.example.com
      http:
        paths:
          - path: /api/
            pathType: Prefix
            backend:
              service:
                name: agent-service
                port:
                  number: 8000

WebSocket transport is a straightforward optimization that yields meaningful latency improvements for tool-heavy and multi-agent workflows. Enable it globally, benchmark it against HTTP for your specific use case, and ensure your infrastructure supports WebSocket passthrough. The single-line configuration change makes it one of the easiest performance wins available.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.