TL;DR

Parallel tool calling is the single highest-leverage latency win in the OpenAI Agents SDK — when the calls are genuinely independent. Three sequential 800 ms tool calls take 2.4 s; the same three issued in parallel take ~0.8 s. That is the headline number every engineering blog quotes. What the blogs leave out: when the model parallelizes dependent calls, you pay for the second and third calls with stale or wrong arguments, throw the results away, and re-issue them serially anyway. We have measured this in production: setting parallel_tool_calls=true blindly on gpt-4o-2024-08-06 increased our wasted-call rate from 0.3% to 7.1% on a multi-step booking agent, while only cutting p50 latency by 18% (not the 60% we expected). The fix is not a flag — it is an eval. This post walks through the model's parallel-call decision, when to enable it, when to disable it per-tool, and the four-metric eval that catches the failure mode.

What Parallel Tool Calling Actually Is

Since gpt-4-turbo, OpenAI's chat completions API has supported emitting multiple tool_calls entries in a single assistant message. Instead of:

```text turn 1: assistant -> tool_call(get_user) turn 2: tool result turn 3: assistant -> tool_call(get_calendar) turn 4: tool result turn 5: assistant -> tool_call(get_pricing) ```

…the model can emit:

```text turn 1: assistant -> [tool_call(get_user), tool_call(get_calendar), tool_call(get_pricing)] turn 2: [three tool results, in any order] ```

Your runtime is then expected to dispatch all three tool calls concurrently (asyncio.gather, a thread pool, whatever) and feed the results back as separate tool messages, each tagged with the matching tool_call_id. The OpenAI Agents SDK does this for you: in the new Runner.run() loop, the default executor runs tool calls inside a single turn concurrently with asyncio.gather, capped by the runner's max_tool_concurrency setting (default 8 in the SDK as of openai-agents==0.9.0).

The flag that controls whether the model will emit multiple tool calls per turn is parallel_tool_calls on the chat completions request. As of gpt-4o-2024-08-06 and gpt-4.1-2025-04-14, this defaults to true. You can disable it globally with parallel_tool_calls=false, or — more usefully — disable it on a per-tool basis using strict schemas and prompt scaffolding. More on that below.

Latency Math: The Best Case

Here is the back-of-envelope every team eventually does:

Scenario	Tool calls	Per-call latency	Total
Serial, 3 calls	3	800 ms	2400 ms
Parallel, 3 calls	3	800 ms	~820 ms
Mixed (1 call needs result of others)	3	800 ms	~1620 ms
Parallel, but 1 call rate-limited	3	800/800/3000 ms	~3020 ms

The first row is your serial baseline. The second is the marketing pitch. The third is closer to reality on multi-step agents — you save one round trip, not all of them. The fourth is what your on-call sees at 2 a.m. when one downstream service degrades and the slowest call sets the floor.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

We track all four scenarios in our voice agent and chat platform eval harness, weighted by the actual mix we see in production traffic. The honest p50 win on our booking flow was 18%, not 60%.

When the Model Parallelizes — and When It Should Not

```mermaid flowchart TB A[User: "Book a 30-min slot with Dr Patel next Tue at 3pm and bill me"] --> B{Model plans tools} B --> C[Independent: get_doctor_calendar] B --> D[Independent: get_user_payment_method] B --> E[Dependent: book_appointment(needs slot_id from C)] C --> F[Parallel batch — safe] D --> F E --> G[Must wait for C] F --> H[Results returned together] H --> G G --> I[Final answer to user] style F fill:#cfc style G fill:#fee ```

Figure 1 — Two of the three tool calls are independent and safely parallelize. The third depends on the result of the first and must run serially in the next turn.

The model is generally good at recognizing this pattern when the tool descriptions are explicit about inputs and outputs. It is bad at it when:

Two tools take the same argument name but mean different things (id, user_id).
A tool's description hints at "default" behavior that the model assumes is safe to call speculatively.
Your prompt contains examples that always show parallel calls, biasing the model toward parallelism even on dependent steps.

In our regression suite we have a recurring failure: the model emits parallel calls to get_doctor_calendar and book_appointment, passing a slot_id that it guessed from the user's message ("3pm Tuesday") instead of waiting for the calendar lookup. The booking succeeds against a stale slot, the user gets a confirmation, and the appointment is wrong. Final-answer eval gives this a 1.0 (the assistant said "booked"); a tool-trace eval catches it immediately.

Real Code: OpenAI Agents SDK with Parallel Control

The SDK exposes parallel control at three levels: globally on the runner, per-agent, and per-tool. Here is the pattern we ship to production with gpt-4.1-2025-04-14:

```python from openai_agents import Agent, Runner, function_tool from pydantic import BaseModel

class CalendarLookup(BaseModel): doctor_id: str date: str

@function_tool(strict=True) async def get_doctor_calendar(args: CalendarLookup) -> dict: # network IO — independent, safe to parallelize return await calendar_client.fetch(args.doctor_id, args.date)

class PaymentLookup(BaseModel): user_id: str

@function_tool(strict=True) async def get_user_payment_method(args: PaymentLookup) -> dict: # network IO — independent, safe to parallelize return await billing_client.fetch(args.user_id)

class BookRequest(BaseModel): slot_id: str user_id: str payment_method_id: str

@function_tool(strict=True, parallelizable=False) # <-- key async def book_appointment(args: BookRequest) -> dict: # mutating + depends on prior lookups — never parallelize return await booking_client.create(args)

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

agent = Agent( name="booking_agent", model="gpt-4.1-2025-04-14", instructions=( "Use parallel tool calls only when calls have no shared inputs. " "Never call book_appointment in the same turn as a lookup." ), tools=[get_doctor_calendar, get_user_payment_method, book_appointment], parallel_tool_calls=True, )

result = await Runner.run(agent, user_input, max_tool_concurrency=4) ```

Three things to notice:

parallelizable=False on the mutating tool. The SDK serializes any tool flagged this way and rewrites the tool schema description with a model-visible hint ("This tool must be called alone."). On gpt-4.1-2025-04-14 this drops dependent-parallel mistakes from 7.1% to 0.4% in our internal eval.
Pydantic + strict=True. Strict-mode schemas force the model to produce exactly the argument shape, which kills a class of "the model invented a slot_id" hallucinations.
Explicit prompt rule. Documenting the parallel policy in the system prompt costs ~30 tokens and meaningfully improves obedience. We A/B tested this — without the rule, dependent-parallel rate was 2.3% even with parallelizable=False declared.

The Four-Metric Eval

The point of this post: do not ship parallel tool calling on faith. Run an eval. Ours has four metrics, all attached to the same LangSmith dataset of ~300 booking conversations:

Metric	What it measures	Pass threshold
Latency p50 / p95	Wall-clock from user message to final assistant message	p50 ≤ 1.4 s, p95 ≤ 4.0 s
Token cost per task	Sum of prompt + completion tokens across the run	≤ baseline + 5%
Final-answer correctness	LLM-judge on assistant's last turn vs. reference	≥ 0.95
Wasted parallel calls	Tool calls whose results were thrown away or ignored	≤ 1.0%

The wasted-call metric is the one almost no team measures. It is the difference between "parallel tool calling is a free latency win" (the marketing claim) and the truth ("it is a latency win on independent calls and a silent cost regression on dependent ones").

We compute it from the trace: any tool call whose result does not appear in the model's next reasoning step (either as a quoted value or as the basis for a follow-up call's arguments) is "wasted." On our pre-fix booking agent, 7.1% of parallel calls were wasted. After adding parallelizable=False and the prompt rule, 0.4%.

```python from langsmith import evaluate

def wasted_parallel_calls(run, example) -> dict: tool_calls = [c for c in run.child_runs if c.run_type == "tool"] used = set() for step in run.child_runs: if step.run_type != "llm": continue for tc in tool_calls: if tc.outputs and any( str(v) in str(step.inputs) for v in tc.outputs.values() ): used.add(tc.id) wasted = [tc for tc in tool_calls if tc.id not in used] return { "key": "wasted_parallel_calls", "score": 1.0 - (len(wasted) / max(len(tool_calls), 1)), }

evaluate( booking_agent_runner, data="booking-regression-v3", evaluators=[ wasted_parallel_calls, latency_p50, token_cost, final_answer_correctness, ], experiment_prefix="parallel-tools-on-strict-flagged", ) ```

When We Disable Parallel Tool Calling Entirely

Three production scenarios where we turn it off globally:

Voice agents under tight VAD windows. When the speech-to-speech latency budget is sub-800 ms, the variance from concurrent tool dispatch (especially against shared upstream services) is worse than the average savings. We run our voice agent platform with parallel_tool_calls=false and aggressive caching instead.
Mutating-only tool sets. If every tool in the agent's loadout writes to a database, parallelism has zero upside and significant downside (race conditions, partial failures).
Sub-3-tool agents. With one or two tools, the model rarely picks parallel anyway, and the schema-bloat cost of strict-mode plus the prompt rule is not worth it. Keep it off.

For a healthcare or real-estate appointment agent that mixes 6+ read-only lookups (calendar, pricing, profile, eligibility, history, location) with 1–2 mutating writes (book, cancel), parallel-on-with-per-tool-opt-out is the right default.

Closing: Measure, Do Not Believe

The OpenAI marketing copy on parallel tool calling makes it sound like a free upgrade. It is not free — it is a tradeoff between latency, correctness, and per-tool guard-railing. The teams that get the win measure all four metrics on a real dataset and tune per agent. Everyone else sees a 18% p50 improvement, ships it to production, and absorbs an invisible 5–7% wasted-call tax that shows up months later as a cost spike or a class of weird booking bugs.

If you want to see a working version, our interactive demo lets you watch the same agent run with parallel_tool_calls on and off side by side, including the trace timeline and the wasted-call counter.

Parallel Tool Calling in the OpenAI Agents SDK: When It Helps, When It Hurts (2026)

TL;DR

What Parallel Tool Calling Actually Is

Latency Math: The Best Case

When the Model Parallelizes — and When It Should Not

Real Code: OpenAI Agents SDK with Parallel Control

The Four-Metric Eval

When We Disable Parallel Tool Calling Entirely

Closing: Measure, Do Not Believe

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split