Skip to content
Agentic AI
Agentic AI13 min read0 views

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

TL;DR

Most teams reach for "one giant prompt" first, then "agent-as-tool" second, and only discover handoffs after the giant prompt collapses under its own weight at around five tools and three personas. The OpenAI Agents SDK ships a first-class `handoff()` primitive that transfers control of the conversation — not just data — to a specialist agent. That distinction is the whole point. In this post I walk through the production handoff pattern we run on CallSphere's voice and chat agent platform: a triage agent in front, three specialist agents behind it, full conversation history preserved, and a separate evaluator that scores handoff correctness as its own metric. Pinned to the OpenAI Agents SDK 0.9.x line and `gpt-4o-2024-08-06` for the triage path. Setup time: about a day. Reduction in "the bot couldn't help me, please transfer" complaints in our internal data: 71%.

Why a Single Mega-Agent Fails Past a Certain Size

The temptation when you have, say, a customer support agent that needs to handle billing, technical troubleshooting, and pre-sales questions is to write one system prompt with sections for each domain and one tool list with everything attached. This works at 3 tools. It limps at 8. By 15 tools and 4 personas it produces three failure modes you cannot prompt your way out of:

  1. Tool selection drift. The model picks the wrong tool because there are too many that look similar (e.g., `refund_charge` vs. `adjust_subscription` vs. `apply_credit`).
  2. Persona bleed. The agent uses billing-language tone when a customer asks a sales question, because the sales section of the prompt got buried below 2,000 tokens of policy.
  3. Eval flatness. You cannot measure "is the billing flow working" in isolation because everything is one agent. A regression in tech support drags the aggregate score down and you have no scalpel.

The fix is structural, not promptual: split into specialists, put a triage agent in front, and use handoffs to transfer control.

Handoff vs. Agent-as-Tool vs. Mega-Agent

Three patterns, three tradeoffs. Pick deliberately.

Pattern Control flow Context preserved? When it wins When it loses
Single mega-agent One agent, all tools N/A Prototypes, ≤5 tools, one persona Persona drift, tool confusion at scale
Agent-as-tool Parent agent calls child as a function, child returns a string Only what parent passes in args Parent needs structured help (e.g., a research sub-agent) and wants to keep control Multi-turn specialist conversations; child can't drive the user dialog
Handoff Triage agent transfers control; specialist now owns the dialog Full conversation history by default Domain specialists with their own personas and tools (billing, tech support, sales) When you genuinely want one consistent voice and don't need specialization

The mental model that finally clicked for me: agent-as-tool is RPC, handoff is transfer-of-control. Both are useful. They are not interchangeable.

The Topology We Run

```mermaid flowchart TD U[Caller / chat user] --> T[Triage Agent
gpt-4o-2024-08-06] T -->|handoff| B[Billing Agent
gpt-4o-2024-08-06] T -->|handoff| S[Tech Support Agent
gpt-4o-2024-08-06] T -->|handoff| Sa[Sales Agent
gpt-4o-2024-08-06] B -->|escalate| H[Human Handoff Tool] S -->|escalate| H Sa -->|book demo| C[Calendar Tool] B -.handoff back.-> T S -.handoff back.-> T Sa -.handoff back.-> T style T fill:#ffd style H fill:#fcc ```

Figure 1 — Triage in front, three specialists behind, and a designated escape hatch to a human. The dotted lines are handback paths: a specialist can hand control back to triage when the user pivots topics mid-conversation.

The Code, End to End

The OpenAI Agents SDK exposes `Agent` and `handoff()`. The simplest production wiring looks like this. (Pinned to OpenAI Agents SDK 0.9.x; the API in 0.3.x was different and 1.0 may diverge again.)

```python from agents import Agent, Runner, handoff from pydantic import BaseModel

── Specialist agents ──

billing_agent = Agent( name="BillingAgent", handoff_description="Handles invoices, refunds, plan changes, payment failures.", instructions=( "You are CallSphere's billing specialist. Be precise about amounts, " "dates, and policy. Always confirm the customer's account ID before " "making changes. Escalate any refund over $500 to a human." ), model="gpt-4o-2024-08-06", tools=[lookup_invoice, issue_refund, change_plan, escalate_to_human], )

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

tech_support_agent = Agent( name="TechSupportAgent", handoff_description="Diagnoses voice quality, integration, and webhook issues.", instructions=( "You are CallSphere's technical support engineer. Walk users through " "diagnostics step by step. Never guess — ask for the call_id, " "tenant_id, and approximate timestamp before troubleshooting." ), model="gpt-4o-2024-08-06", tools=[fetch_call_log, run_webhook_diagnostic, escalate_to_human], )

sales_agent = Agent( name="SalesAgent", handoff_description="Handles pricing, plan comparisons, and demo bookings.", instructions=( "You are CallSphere's pre-sales engineer. Be helpful, never pushy. " "If the prospect asks about a competitor, answer factually and " "redirect to CallSphere's strengths. Offer a demo when interest is clear." ), model="gpt-4o-2024-08-06", tools=[lookup_pricing, book_demo, send_one_pager], )

── Triage agent with handoffs ──

class HandoffReason(BaseModel): reason: str # short rationale, logged for evals

async def on_billing_handoff(ctx, input: HandoffReason): ctx.logger.info(f"Handoff to billing: {input.reason}")

triage_agent = Agent( name="TriageAgent", instructions=( "You are CallSphere's front-desk triage agent. Your ONLY job is to " "decide which specialist should handle this conversation, then hand " "off. Do not attempt to answer billing, technical, or sales questions " "yourself. If the user is just chatting, ask one clarifying question " "to determine the topic." ), model="gpt-4o-2024-08-06", handoffs=[ handoff(billing_agent, on_handoff=on_billing_handoff, input_type=HandoffReason), handoff(tech_support_agent), handoff(sales_agent), ], )

── Run it ──

async def run_conversation(user_message: str, session_id: str): result = await Runner.run( triage_agent, input=user_message, context={"session_id": session_id}, ) return result.final_output, result.last_agent.name ```

A few things to call out because they are easy to miss on a first read:

  • `handoff_description` on each specialist is what the triage agent actually sees in its tool descriptions. Treat it like API docs — terse, concrete, no marketing language.
  • `on_handoff` callback + `input_type` lets you require the triage agent to produce a structured rationale. This is the hook you need for evaluating handoff correctness later. Without it you are flying blind.
  • The triage agent has no tools other than handoffs. This is deliberate. If you give triage tools, it will use them and skip the handoff. The whole point of triage is to decide and delegate.
  • `result.last_agent` tells you which agent finished the turn. Log this. It is the cheapest signal you will ever get for "did the routing work."

Preserving Context Across the Handoff

By default, the OpenAI Agents SDK passes the full conversation history to the specialist on handoff. That is usually what you want — the billing agent should see the user's earlier "I was charged twice for May" message even though triage technically received it.

There are two cases where you want to override the default:

  1. Trim noisy preamble — if triage had a long clarifying back-and-forth, the specialist may not need all of it. Use the `input_filter` argument on `handoff()` to prune.
  2. Inject structured context — if triage has already extracted the customer's account_id or call_id, pass it through as structured state via the `context` parameter on `Runner.run` so the specialist's tools can use it without re-asking.

```python from agents.extensions.handoff_filters import remove_all_tools

handoff( billing_agent, on_handoff=on_billing_handoff, input_type=HandoffReason, input_filter=remove_all_tools, # drop triage's tool calls, keep messages ) ```

The `remove_all_tools` filter is the one I reach for most often. Triage should not be making tool calls anyway, but if a developer adds one experimentally, you don't want its output cluttering the specialist's context window.

Evaluating the Handoff Decision Itself

Here is the part most teams skip. They evaluate the final answer the specialist produced and assume that if the answer is right, the handoff was right. Wrong. You can get a correct-looking answer from the wrong specialist (the sales agent giving a vague "yes, we support that" to a technical question that should have gone to tech support and been answered with diagnostic steps).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

We treat handoff correctness as its own evaluator, with its own dataset. The dataset rows look like:

Input Expected agent Actual agent Score
"Why was I charged $99 twice in May?" BillingAgent BillingAgent 1.0
"My webhook returns 502 every 10 minutes." TechSupportAgent TechSupportAgent 1.0
"Can I get a discount if I prepay annually?" SalesAgent BillingAgent 0.0
"How do I export call recordings?" TechSupportAgent SalesAgent 0.0

The evaluator is structural — `actual_agent == expected_agent` — which makes it cheap and deterministic. We pair it with an LLM-as-judge evaluator that scores the rationale the triage agent produced (the `HandoffReason` payload). Cheap evaluator catches the obvious; judge catches "right answer for the wrong reason."

```python from langsmith import evaluate

def handoff_correct(run, example): actual = run.outputs["last_agent"] expected = example.outputs["expected_agent"] return {"key": "handoff_correct", "score": float(actual == expected)}

def rationale_quality(run, example): # LLM-as-judge: was the triage rationale specific and topical? rationale = run.outputs.get("handoff_rationale", "") # ... call judge model, return 0..1 ...

evaluate( triage_predict, data="triage-routing-suite", evaluators=[handoff_correct, rationale_quality], experiment_prefix="triage-routing-eval", ) ```

This decomposition is what lets us ship triage prompt changes independently of specialist prompt changes. When routing accuracy regresses, we know exactly which prompt to look at.

What We Found Running This in Production

Across roughly 280k voice and chat sessions/month on the CallSphere platform, here is what the data showed once we moved from one mega-agent to triage + 3 specialists:

  • Routing accuracy on a 220-row labeled dataset: 94.1% with the triage agent, 71.3% with the mega-agent's implicit "use the right tool" routing.
  • Specialist task success (did the user get what they came for): 88.6% triaged, 76.4% mega.
  • p95 latency: +180ms with handoffs (one extra LLM hop). Acceptable for our use case; your mileage will vary if you're sub-500ms-strict.
  • Cost per session: +14% with handoffs because of the extra triage call. Offset by ~22% reduction in handoffs to humans, which dominates economics.
  • Engineer iteration speed: dramatically faster. Specialist prompts can change without retesting the entire surface area.

Honest Tradeoffs

This pattern is not free. The costs:

  • One extra LLM call per conversation start. Triage has to think before it routes. For very simple use cases (one persona, two tools), it's overkill.
  • Handoff loops are a real failure mode. Specialist hands back to triage, triage hands back to specialist. We cap recursion at 5 handoffs per conversation and emit a warning to logs if we hit 3.
  • Persona consistency is harder. Three specialists means three voices. We solve this with a shared style guide injected into every specialist's instructions, but it requires discipline.
  • Debugging is harder. A bad answer might be a triage routing failure, a specialist prompt failure, or a tool failure. Tracing across the handoff is essential — see our trace-to-fix workflow piece for the playbook.

The alternative — keep stuffing the mega-prompt — works until it doesn't, and the day it doesn't is usually a customer-visible outage. We chose the structural fix.

Frequently Asked Questions

When should I prefer agent-as-tool over a handoff?

When the parent agent needs the child's output as data to continue its own reasoning, and the child does not need to drive a multi-turn conversation. Classic example: a research sub-agent that returns a structured summary. Handoff is for when the child should own the dialog from here on.

Can a specialist hand off to another specialist?

Yes — wire the specialists' `handoffs` lists to include each other. We generally route through triage instead because it keeps the topology star-shaped and easier to reason about. Specialist-to-specialist edges are tempting but turn your topology into a mesh fast.

How do I prevent the triage agent from "answering" instead of handing off?

Three things, in order of impact: (1) give triage zero tools other than handoffs, (2) include explicit "do not attempt to answer X, Y, Z yourself" in the triage instructions, (3) evaluate triage on a dataset where the correct answer is which specialist, not the final response — and gate prompt changes on that score.

Does this work with voice agents (audio in/audio out)?

Yes. The OpenAI Agents SDK has a Voice extension and the realtime model variants support handoffs. We run handoffs in our voice agent flow with the realtime model on the specialist tier; the triage step uses gpt-4o because the latency budget there is more forgiving than mid-conversation. See our glossary for the realtime vs. cascaded tradeoff.

What if my handoffs are slower than acceptable for real-time voice?

Two mitigations: (1) collapse low-traffic specialists into a single "general support" agent so the routing is binary (general vs. specific specialist), (2) use a cheaper triage model — gpt-4o-mini works well for the routing decision in our tests, with about 2-3 percentage points of accuracy lost vs. full gpt-4o, but ~3x cheaper and faster. For most conversational latency budgets the mini-on-triage tradeoff is the right call.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.