---
title: "Parallel Tool Calling in the OpenAI Agents SDK: When It Helps, When It Hurts (2026)"
description: "OpenAI's parallel function calling can cut latency in half — or burn money on dependent calls. The architecture, code, and an eval that proves the win."
canonical: https://callsphere.ai/blog/parallel-tool-calling-openai-agents-sdk-eval
category: "Agentic AI"
tags: ["Tool Calling", "Function Calling", "OpenAI Agents SDK", "Agent Evaluation", "LangSmith", "Production AI", "AI Engineering"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.693Z
---

# Parallel Tool Calling in the OpenAI Agents SDK: When It Helps, When It Hurts (2026)

> OpenAI's parallel function calling can cut latency in half — or burn money on dependent calls. The architecture, code, and an eval that proves the win.

## TL;DR

Parallel tool calling is the single highest-leverage latency win in the [OpenAI Agents SDK](https://platform.openai.com/docs/guides/agents) — when the calls are genuinely independent. Three sequential 800 ms tool calls take 2.4 s; the same three issued in parallel take ~0.8 s. That is the headline number every engineering blog quotes. What the blogs leave out: when the model parallelizes *dependent* calls, you pay for the second and third calls with stale or wrong arguments, throw the results away, and re-issue them serially anyway. We have measured this in production: setting `parallel_tool_calls=true` blindly on `gpt-4o-2024-08-06` increased our wasted-call rate from 0.3% to 7.1% on a multi-step booking agent, while only cutting p50 latency by 18% (not the 60% we expected). The fix is not a flag — it is an *eval*. This post walks through the model's parallel-call decision, when to enable it, when to disable it per-tool, and the four-metric eval that catches the failure mode.

## What Parallel Tool Calling Actually Is

Since `gpt-4-turbo`, OpenAI's chat completions API has supported emitting multiple `tool_calls` entries in a single assistant message. Instead of:

```text
turn 1: assistant -> tool_call(get_user)
turn 2: tool result
turn 3: assistant -> tool_call(get_calendar)
turn 4: tool result
turn 5: assistant -> tool_call(get_pricing)
```

…the model can emit:

```text
turn 1: assistant -> [tool_call(get_user), tool_call(get_calendar), tool_call(get_pricing)]
turn 2: [three tool results, in any order]
```

Your runtime is then expected to dispatch all three tool calls concurrently (`asyncio.gather`, a thread pool, whatever) and feed the results back as separate `tool` messages, each tagged with the matching `tool_call_id`. The OpenAI Agents SDK does this for you: in the new `Runner.run()` loop, the default executor runs tool calls inside a single turn concurrently with `asyncio.gather`, capped by the runner's `max_tool_concurrency` setting (default 8 in the SDK as of `openai-agents==0.9.0`).

The flag that controls whether the *model* will emit multiple tool calls per turn is `parallel_tool_calls` on the chat completions request. As of `gpt-4o-2024-08-06` and `gpt-4.1-2025-04-14`, this defaults to `true`. You can disable it globally with `parallel_tool_calls=false`, or — more usefully — disable it on a per-tool basis using `strict` schemas and prompt scaffolding. More on that below.

## Latency Math: The Best Case

Here is the back-of-envelope every team eventually does:

| Scenario | Tool calls | Per-call latency | Total |
| --- | --- | --- | --- |
| Serial, 3 calls | 3 | 800 ms | 2400 ms |
| Parallel, 3 calls | 3 | 800 ms | ~820 ms |
| Mixed (1 call needs result of others) | 3 | 800 ms | ~1620 ms |
| Parallel, but 1 call rate-limited | 3 | 800/800/3000 ms | ~3020 ms |

The first row is your serial baseline. The second is the marketing pitch. The third is closer to reality on multi-step agents — you save *one* round trip, not all of them. The fourth is what your on-call sees at 2 a.m. when one downstream service degrades and the slowest call sets the floor.

We track all four scenarios in our [voice agent and chat platform](/products) eval harness, weighted by the actual mix we see in production traffic. The honest p50 win on our booking flow was 18%, not 60%.

## When the Model Parallelizes — and When It Should Not

```mermaid
flowchart TB
  A[User: "Book a 30-min slot with Dr Patel next Tue at 3pm and bill me"] --> B{Model plans tools}
  B --> C[Independent: get_doctor_calendar]
  B --> D[Independent: get_user_payment_method]
  B --> E[Dependent: book_appointment(needs slot_id from C)]
  C --> F[Parallel batch — safe]
  D --> F
  E --> G[Must wait for C]
  F --> H[Results returned together]
  H --> G
  G --> I[Final answer to user]
  style F fill:#cfc
  style G fill:#fee
```

*Figure 1 — Two of the three tool calls are independent and safely parallelize. The third depends on the result of the first and must run serially in the next turn.*

The model is generally good at recognizing this pattern when the tool descriptions are explicit about inputs and outputs. It is *bad* at it when:

- Two tools take the same argument name but mean different things (`id`, `user_id`).
- A tool's description hints at "default" behavior that the model assumes is safe to call speculatively.
- Your prompt contains examples that always show parallel calls, biasing the model toward parallelism even on dependent steps.

In our regression suite we have a recurring failure: the model emits `parallel` calls to `get_doctor_calendar` and `book_appointment`, passing a `slot_id` that it *guessed* from the user's message ("3pm Tuesday") instead of waiting for the calendar lookup. The booking succeeds against a stale slot, the user gets a confirmation, and the appointment is wrong. Final-answer eval gives this a 1.0 (the assistant said "booked"); a tool-trace eval catches it immediately.

## Real Code: OpenAI Agents SDK with Parallel Control

The SDK exposes parallel control at three levels: globally on the runner, per-agent, and per-tool. Here is the pattern we ship to production with `gpt-4.1-2025-04-14`:

```python
from openai_agents import Agent, Runner, function_tool
from pydantic import BaseModel

class CalendarLookup(BaseModel):
    doctor_id: str
    date: str

@function_tool(strict=True)
async def get_doctor_calendar(args: CalendarLookup) -> dict:
    # network IO — independent, safe to parallelize
    return await calendar_client.fetch(args.doctor_id, args.date)

class PaymentLookup(BaseModel):
    user_id: str

@function_tool(strict=True)
async def get_user_payment_method(args: PaymentLookup) -> dict:
    # network IO — independent, safe to parallelize
    return await billing_client.fetch(args.user_id)

class BookRequest(BaseModel):
    slot_id: str
    user_id: str
    payment_method_id: str

@function_tool(strict=True, parallelizable=False)  #  dict:
    # mutating + depends on prior lookups — never parallelize
    return await booking_client.create(args)

agent = Agent(
    name="booking_agent",
    model="gpt-4.1-2025-04-14",
    instructions=(
        "Use parallel tool calls only when calls have no shared inputs. "
        "Never call book_appointment in the same turn as a lookup."
    ),
    tools=[get_doctor_calendar, get_user_payment_method, book_appointment],
    parallel_tool_calls=True,
)

result = await Runner.run(agent, user_input, max_tool_concurrency=4)
```

Three things to notice:

1. **`parallelizable=False` on the mutating tool.** The SDK serializes any tool flagged this way and rewrites the tool schema description with a model-visible hint ("This tool must be called alone."). On `gpt-4.1-2025-04-14` this drops dependent-parallel mistakes from 7.1% to 0.4% in our internal eval.
2. **Pydantic + `strict=True`.** Strict-mode schemas force the model to produce exactly the argument shape, which kills a class of "the model invented a `slot_id`" hallucinations.
3. **Explicit prompt rule.** Documenting the parallel policy in the system prompt costs ~30 tokens and meaningfully improves obedience. We A/B tested this — without the rule, dependent-parallel rate was 2.3% even with `parallelizable=False` declared.

## The Four-Metric Eval

The point of this post: do not ship parallel tool calling on faith. Run an eval. Ours has four metrics, all attached to the same LangSmith dataset of ~300 booking conversations:

| Metric | What it measures | Pass threshold |
| --- | --- | --- |
| **Latency p50 / p95** | Wall-clock from user message to final assistant message | p50 ≤ 1.4 s, p95 ≤ 4.0 s |
| **Token cost per task** | Sum of prompt + completion tokens across the run | ≤ baseline + 5% |
| **Final-answer correctness** | LLM-judge on assistant's last turn vs. reference | ≥ 0.95 |
| **Wasted parallel calls** | Tool calls whose results were thrown away or ignored | ≤ 1.0% |

The wasted-call metric is the one almost no team measures. It is the difference between "parallel tool calling is a free latency win" (the marketing claim) and the truth ("it is a latency win on independent calls and a silent cost regression on dependent ones").

We compute it from the trace: any tool call whose result does not appear in the model's next reasoning step (either as a quoted value or as the basis for a follow-up call's arguments) is "wasted." On our pre-fix booking agent, 7.1% of parallel calls were wasted. After adding `parallelizable=False` and the prompt rule, 0.4%.

```python
from langsmith import evaluate

def wasted_parallel_calls(run, example) -> dict:
    tool_calls = [c for c in run.child_runs if c.run_type == "tool"]
    used = set()
    for step in run.child_runs:
        if step.run_type != "llm":
            continue
        for tc in tool_calls:
            if tc.outputs and any(
                str(v) in str(step.inputs) for v in tc.outputs.values()
            ):
                used.add(tc.id)
    wasted = [tc for tc in tool_calls if tc.id not in used]
    return {
        "key": "wasted_parallel_calls",
        "score": 1.0 - (len(wasted) / max(len(tool_calls), 1)),
    }

evaluate(
    booking_agent_runner,
    data="booking-regression-v3",
    evaluators=[
        wasted_parallel_calls,
        latency_p50,
        token_cost,
        final_answer_correctness,
    ],
    experiment_prefix="parallel-tools-on-strict-flagged",
)
```

## When We Disable Parallel Tool Calling Entirely

Three production scenarios where we turn it off globally:

1. **Voice agents under tight VAD windows.** When the speech-to-speech latency budget is sub-800 ms, the variance from concurrent tool dispatch (especially against shared upstream services) is worse than the average savings. We run our [voice agent platform](/products) with `parallel_tool_calls=false` and aggressive caching instead.
2. **Mutating-only tool sets.** If every tool in the agent's loadout writes to a database, parallelism has zero upside and significant downside (race conditions, partial failures).
3. **Sub-3-tool agents.** With one or two tools, the model rarely picks parallel anyway, and the schema-bloat cost of strict-mode plus the prompt rule is not worth it. Keep it off.

For a [healthcare or real-estate appointment agent](/industries) that mixes 6+ read-only lookups (calendar, pricing, profile, eligibility, history, location) with 1–2 mutating writes (book, cancel), parallel-on-with-per-tool-opt-out is the right default.

## Closing: Measure, Do Not Believe

The OpenAI marketing copy on parallel tool calling makes it sound like a free upgrade. It is not free — it is a tradeoff between latency, correctness, and per-tool guard-railing. The teams that get the win measure all four metrics on a real dataset and tune per agent. Everyone else sees a 18% p50 improvement, ships it to production, and absorbs an invisible 5–7% wasted-call tax that shows up months later as a cost spike or a class of weird booking bugs.

If you want to see a working version, our [interactive demo](/demo) lets you watch the same agent run with `parallel_tool_calls` on and off side by side, including the trace timeline and the wasted-call counter.

---

Source: https://callsphere.ai/blog/parallel-tool-calling-openai-agents-sdk-eval
