By Sagar Shankaran, Founder of CallSphere
OpenAI's parallel function calling can cut latency in half — or burn money on dependent calls. The architecture, code, and an eval that proves the win.
Key takeaways
Parallel tool calling is the single highest-leverage latency win in the OpenAI Agents SDK — when the calls are genuinely independent. Three sequential 800 ms tool calls take 2.4 s; the same three issued in parallel take ~0.8 s. That is the headline number every engineering blog quotes. What the blogs leave out: when the model parallelizes dependent calls, you pay for the second and third calls with stale or wrong arguments, throw the results away, and re-issue them serially anyway. We have measured this in production: setting parallel_tool_calls=true blindly on gpt-4o-2024-08-06 increased our wasted-call rate from 0.3% to 7.1% on a multi-step booking agent, while only cutting p50 latency by 18% (not the 60% we expected). The fix is not a flag — it is an eval. This post walks through the model's parallel-call decision, when to enable it, when to disable it per-tool, and the four-metric eval that catches the failure mode.
Since gpt-4-turbo, OpenAI's chat completions API has supported emitting multiple tool_calls entries in a single assistant message. Instead of:
```text turn 1: assistant -> tool_call(get_user) turn 2: tool result turn 3: assistant -> tool_call(get_calendar) turn 4: tool result turn 5: assistant -> tool_call(get_pricing) ```
…the model can emit:
```text turn 1: assistant -> [tool_call(get_user), tool_call(get_calendar), tool_call(get_pricing)] turn 2: [three tool results, in any order] ```
Your runtime is then expected to dispatch all three tool calls concurrently (asyncio.gather, a thread pool, whatever) and feed the results back as separate tool messages, each tagged with the matching tool_call_id. The OpenAI Agents SDK does this for you: in the new Runner.run() loop, the default executor runs tool calls inside a single turn concurrently with asyncio.gather, capped by the runner's max_tool_concurrency setting (default 8 in the SDK as of openai-agents==0.9.0).
The flag that controls whether the model will emit multiple tool calls per turn is parallel_tool_calls on the chat completions request. As of gpt-4o-2024-08-06 and gpt-4.1-2025-04-14, this defaults to true. You can disable it globally with parallel_tool_calls=false, or — more usefully — disable it on a per-tool basis using strict schemas and prompt scaffolding. More on that below.
Here is the back-of-envelope every team eventually does:
| Scenario | Tool calls | Per-call latency | Total |
|---|---|---|---|
| Serial, 3 calls | 3 | 800 ms | 2400 ms |
| Parallel, 3 calls | 3 | 800 ms | ~820 ms |
| Mixed (1 call needs result of others) | 3 | 800 ms | ~1620 ms |
| Parallel, but 1 call rate-limited | 3 | 800/800/3000 ms | ~3020 ms |
The first row is your serial baseline. The second is the marketing pitch. The third is closer to reality on multi-step agents — you save one round trip, not all of them. The fourth is what your on-call sees at 2 a.m. when one downstream service degrades and the slowest call sets the floor.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
We track all four scenarios in our voice agent and chat platform eval harness, weighted by the actual mix we see in production traffic. The honest p50 win on our booking flow was 18%, not 60%.
```mermaid flowchart TB A[User: "Book a 30-min slot with Dr Patel next Tue at 3pm and bill me"] --> B{Model plans tools} B --> C[Independent: get_doctor_calendar] B --> D[Independent: get_user_payment_method] B --> E[Dependent: book_appointment(needs slot_id from C)] C --> F[Parallel batch — safe] D --> F E --> G[Must wait for C] F --> H[Results returned together] H --> G G --> I[Final answer to user] style F fill:#cfc style G fill:#fee ```
Figure 1 — Two of the three tool calls are independent and safely parallelize. The third depends on the result of the first and must run serially in the next turn.
The model is generally good at recognizing this pattern when the tool descriptions are explicit about inputs and outputs. It is bad at it when:
id, user_id).In our regression suite we have a recurring failure: the model emits parallel calls to get_doctor_calendar and book_appointment, passing a slot_id that it guessed from the user's message ("3pm Tuesday") instead of waiting for the calendar lookup. The booking succeeds against a stale slot, the user gets a confirmation, and the appointment is wrong. Final-answer eval gives this a 1.0 (the assistant said "booked"); a tool-trace eval catches it immediately.
The SDK exposes parallel control at three levels: globally on the runner, per-agent, and per-tool. Here is the pattern we ship to production with gpt-4.1-2025-04-14:
```python from openai_agents import Agent, Runner, function_tool from pydantic import BaseModel
class CalendarLookup(BaseModel): doctor_id: str date: str
@function_tool(strict=True) async def get_doctor_calendar(args: CalendarLookup) -> dict: # network IO — independent, safe to parallelize return await calendar_client.fetch(args.doctor_id, args.date)
class PaymentLookup(BaseModel): user_id: str
@function_tool(strict=True) async def get_user_payment_method(args: PaymentLookup) -> dict: # network IO — independent, safe to parallelize return await billing_client.fetch(args.user_id)
class BookRequest(BaseModel): slot_id: str user_id: str payment_method_id: str
@function_tool(strict=True, parallelizable=False) # <-- key async def book_appointment(args: BookRequest) -> dict: # mutating + depends on prior lookups — never parallelize return await booking_client.create(args)
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
agent = Agent( name="booking_agent", model="gpt-4.1-2025-04-14", instructions=( "Use parallel tool calls only when calls have no shared inputs. " "Never call book_appointment in the same turn as a lookup." ), tools=[get_doctor_calendar, get_user_payment_method, book_appointment], parallel_tool_calls=True, )
result = await Runner.run(agent, user_input, max_tool_concurrency=4) ```
Three things to notice:
parallelizable=False on the mutating tool. The SDK serializes any tool flagged this way and rewrites the tool schema description with a model-visible hint ("This tool must be called alone."). On gpt-4.1-2025-04-14 this drops dependent-parallel mistakes from 7.1% to 0.4% in our internal eval.strict=True. Strict-mode schemas force the model to produce exactly the argument shape, which kills a class of "the model invented a slot_id" hallucinations.parallelizable=False declared.The point of this post: do not ship parallel tool calling on faith. Run an eval. Ours has four metrics, all attached to the same LangSmith dataset of ~300 booking conversations:
| Metric | What it measures | Pass threshold |
|---|---|---|
| Latency p50 / p95 | Wall-clock from user message to final assistant message | p50 ≤ 1.4 s, p95 ≤ 4.0 s |
| Token cost per task | Sum of prompt + completion tokens across the run | ≤ baseline + 5% |
| Final-answer correctness | LLM-judge on assistant's last turn vs. reference | ≥ 0.95 |
| Wasted parallel calls | Tool calls whose results were thrown away or ignored | ≤ 1.0% |
The wasted-call metric is the one almost no team measures. It is the difference between "parallel tool calling is a free latency win" (the marketing claim) and the truth ("it is a latency win on independent calls and a silent cost regression on dependent ones").
We compute it from the trace: any tool call whose result does not appear in the model's next reasoning step (either as a quoted value or as the basis for a follow-up call's arguments) is "wasted." On our pre-fix booking agent, 7.1% of parallel calls were wasted. After adding parallelizable=False and the prompt rule, 0.4%.
```python from langsmith import evaluate
def wasted_parallel_calls(run, example) -> dict: tool_calls = [c for c in run.child_runs if c.run_type == "tool"] used = set() for step in run.child_runs: if step.run_type != "llm": continue for tc in tool_calls: if tc.outputs and any( str(v) in str(step.inputs) for v in tc.outputs.values() ): used.add(tc.id) wasted = [tc for tc in tool_calls if tc.id not in used] return { "key": "wasted_parallel_calls", "score": 1.0 - (len(wasted) / max(len(tool_calls), 1)), }
evaluate( booking_agent_runner, data="booking-regression-v3", evaluators=[ wasted_parallel_calls, latency_p50, token_cost, final_answer_correctness, ], experiment_prefix="parallel-tools-on-strict-flagged", ) ```
Three production scenarios where we turn it off globally:
parallel_tool_calls=false and aggressive caching instead.For a healthcare or real-estate appointment agent that mixes 6+ read-only lookups (calendar, pricing, profile, eligibility, history, location) with 1–2 mutating writes (book, cancel), parallel-on-with-per-tool-opt-out is the right default.
The OpenAI marketing copy on parallel tool calling makes it sound like a free upgrade. It is not free — it is a tradeoff between latency, correctness, and per-tool guard-railing. The teams that get the win measure all four metrics on a real dataset and tune per agent. Everyone else sees a 18% p50 improvement, ships it to production, and absorbs an invisible 5–7% wasted-call tax that shows up months later as a cost spike or a class of weird booking bugs.
If you want to see a working version, our interactive demo lets you watch the same agent run with parallel_tool_calls on and off side by side, including the trace timeline and the wasted-call counter.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
© 2026 CallSphere LLC. All rights reserved.