Why WebSocket Transport Matters for Agents

By default, the OpenAI Agents SDK uses HTTP for every API call. Each tool call, each generation, each handoff results in a new HTTP request — a new TCP connection (or at least a new request on a keep-alive connection), TLS handshake overhead, and HTTP header parsing. For a single agent call, this overhead is negligible. For a multi-agent workflow with ten tool calls and three handoffs, it adds up.

WebSocket transport replaces these individual HTTP requests with a single persistent connection. The agent opens a WebSocket to the OpenAI API once, and all subsequent messages flow over that connection with minimal overhead. The result is measurably lower latency for multi-turn and tool-heavy agent interactions.

Enabling WebSocket Transport

The SDK provides a one-line configuration to switch to WebSocket transport:

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

from agents import set_default_openai_responses_transport

# Enable WebSocket transport globally
set_default_openai_responses_transport("websocket")

That is it. Every subsequent Runner.run() call will use WebSocket instead of HTTP. No changes to your agent definitions, tools, or handoffs are needed.

How It Works Under the Hood

When you set the transport to "websocket", the SDK:

Opens a persistent WebSocket connection to the OpenAI Responses API
Sends agent generation requests as WebSocket messages
Receives streaming responses over the same connection
Keeps the connection alive across multiple tool call rounds within a single Runner.run()

The key performance benefit is in multi-turn interactions. Consider an agent that calls three tools sequentially:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

HTTP transport (default):

Request 1: Initial generation -> Response (tool call) — ~200ms overhead
Request 2: Tool result -> Response (tool call) — ~200ms overhead
Request 3: Tool result -> Response (tool call) — ~200ms overhead
Request 4: Tool result -> Final response — ~200ms overhead
Total overhead: ~800ms just from HTTP round trips

WebSocket transport:

Connection established once: ~300ms
Message 1: Initial generation -> Response (tool call) — ~20ms overhead
Message 2: Tool result -> Response (tool call) — ~20ms overhead
Message 3: Tool result -> Response (tool call) — ~20ms overhead
Message 4: Tool result -> Final response — ~20ms overhead
Total overhead: ~380ms

For tool-heavy workflows, WebSocket transport can reduce total latency by 40-60%.

Benchmarking the Difference

Let us build a benchmark to measure the actual impact:

from agents import Agent, Runner, function_tool, set_default_openai_responses_transport
import asyncio
import time

@function_tool
def get_weather(city: str) -> str:
    """Get current weather for a city."""
    return f"Weather in {city}: 22C, partly cloudy"

@function_tool
def get_population(city: str) -> str:
    """Get population of a city."""
    return f"Population of {city}: 8.3 million"

@function_tool
def get_timezone(city: str) -> str:
    """Get timezone of a city."""
    return f"Timezone of {city}: UTC+5:30"

agent = Agent(
    name="CityInfoAgent",
    model="gpt-4.1",
    instructions=(
        "When asked about a city, always call all three tools "
        "(weather, population, timezone) before responding."
    ),
    tools=[get_weather, get_population, get_timezone],
)

async def benchmark_transport(transport: str, iterations: int = 5):
    """Benchmark agent runs with the specified transport."""
    set_default_openai_responses_transport(transport)

    durations = []
    for i in range(iterations):
        start = time.monotonic()
        result = await Runner.run(
            agent,
            input="Tell me about Mumbai.",
        )
        elapsed = time.monotonic() - start
        durations.append(elapsed)

    avg = sum(durations) / len(durations)
    p95 = sorted(durations)[int(len(durations) * 0.95)]
    return {"transport": transport, "avg_ms": avg * 1000, "p95_ms": p95 * 1000}

async def main():
    print("Benchmarking HTTP transport...")
    http_results = await benchmark_transport("http", iterations=10)
    print(f"  HTTP  - avg: {http_results['avg_ms']:.0f}ms, p95: {http_results['p95_ms']:.0f}ms")

    print("Benchmarking WebSocket transport...")
    ws_results = await benchmark_transport("websocket", iterations=10)
    print(f"  WS    - avg: {ws_results['avg_ms']:.0f}ms, p95: {ws_results['p95_ms']:.0f}ms")

    improvement = (1 - ws_results["avg_ms"] / http_results["avg_ms"]) * 100
    print(f"  Improvement: {improvement:.1f}%")

asyncio.run(main())

In typical benchmarks, you will see 30-50% latency reduction for agents with three or more tool calls per run.

Per-Agent Transport Configuration

You can also set the transport per-agent rather than globally:

from agents import Agent

# This agent uses WebSocket for its low-latency requirement
fast_agent = Agent(
    name="FastAgent",
    model="gpt-4.1",
    instructions="Respond quickly using available tools.",
    tools=[get_weather, get_population],
    model_settings={"transport": "websocket"},
)

# This agent uses default HTTP (simpler debugging)
debug_agent = Agent(
    name="DebugAgent",
    model="gpt-4.1",
    instructions="Process requests for debugging and analysis.",
)

Connection Management

WebSocket connections need lifecycle management in production. The SDK handles most of this automatically, but you should be aware of the behavior:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from agents import set_default_openai_responses_transport, Runner
import asyncio

async def handle_request(user_input: str):
    """Each Runner.run() manages its own WebSocket lifecycle."""
    # The SDK opens a WebSocket for this run
    result = await Runner.run(agent, input=user_input)
    # The WebSocket is closed when the run completes
    return result.final_output

async def handle_concurrent_requests(inputs: list[str]):
    """Concurrent runs each get their own WebSocket connection."""
    tasks = [handle_request(inp) for inp in inputs]
    results = await asyncio.gather(*tasks)
    return results

# Enable WebSocket globally
set_default_openai_responses_transport("websocket")

# Handle 10 concurrent requests — each gets its own connection
asyncio.run(handle_concurrent_requests(["Query " + str(i) for i in range(10)]))

Each Runner.run() call manages its own WebSocket connection. Concurrent runs create concurrent connections. This is safe and correct — WebSocket connections are lightweight, and the OpenAI API supports many simultaneous connections per API key.

Streaming with WebSocket Transport

WebSocket transport pairs naturally with streaming, since the connection is already persistent:

from agents import Agent, Runner, set_default_openai_responses_transport

set_default_openai_responses_transport("websocket")

agent = Agent(
    name="StreamingAgent",
    model="gpt-4.1",
    instructions="Provide detailed answers.",
)

async def stream_response(user_input: str):
    """Stream agent output over WebSocket transport."""
    result = Runner.run_streamed(agent, input=user_input)

    async for event in result.stream_events():
        if event.type == "raw_response_event":
            if hasattr(event.data, "delta") and event.data.delta:
                print(event.data.delta, end="", flush=True)

    print()  # Newline after streaming completes
    return result.final_output

The combination of WebSocket transport and streaming gives you the lowest possible time-to-first-token for agent responses.

When to Use WebSocket Transport

Use WebSocket when:

Your agents make three or more tool calls per run
You have multi-agent workflows with handoffs
Latency is a key metric (real-time chat, voice agents)
You are running streaming responses

Stick with HTTP when:

Your agents are simple single-turn, no-tool interactions
You are debugging and want clear request/response pairs in your network inspector
Your infrastructure (proxies, load balancers) does not support WebSocket passthrough
You are behind a corporate firewall that blocks WebSocket upgrades

Infrastructure Considerations

If you deploy behind a reverse proxy or load balancer, ensure WebSocket support is enabled:

# nginx.conf
location /api/agent/ {
    proxy_pass http://agent-service:8000;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
}

For Kubernetes ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agent-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/websocket-services: "agent-service"
spec:
  rules:
    - host: agents.example.com
      http:
        paths:
          - path: /api/
            pathType: Prefix
            backend:
              service:
                name: agent-service
                port:
                  number: 8000

WebSocket transport is a straightforward optimization that yields meaningful latency improvements for tool-heavy and multi-agent workflows. Enable it globally, benchmark it against HTTP for your specific use case, and ensure your infrastructure supports WebSocket passthrough. The single-line configuration change makes it one of the easiest performance wins available.

WebSocket Transport for Low-Latency Agent Communication

Why WebSocket Transport Matters for Agents

Enabling WebSocket Transport

How It Works Under the Hood

Benchmarking the Difference

Per-Agent Transport Configuration

Connection Management

Streaming with WebSocket Transport

When to Use WebSocket Transport

Infrastructure Considerations

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026