By Sagar Shankaran, Founder of CallSphere
Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.
Key takeaways
Most teams reach for "one giant prompt" first, then "agent-as-tool" second, and only discover handoffs after the giant prompt collapses under its own weight at around five tools and three personas. The OpenAI Agents SDK ships a first-class `handoff()` primitive that transfers control of the conversation — not just data — to a specialist agent. That distinction is the whole point. In this post I walk through the production handoff pattern we run on CallSphere's voice and chat agent platform: a triage agent in front, three specialist agents behind it, full conversation history preserved, and a separate evaluator that scores handoff correctness as its own metric. Pinned to the OpenAI Agents SDK 0.9.x line and `gpt-4o-2024-08-06` for the triage path. Setup time: about a day. Reduction in "the bot couldn't help me, please transfer" complaints in our internal data: 71%.
The temptation when you have, say, a customer support agent that needs to handle billing, technical troubleshooting, and pre-sales questions is to write one system prompt with sections for each domain and one tool list with everything attached. This works at 3 tools. It limps at 8. By 15 tools and 4 personas it produces three failure modes you cannot prompt your way out of:
The fix is structural, not promptual: split into specialists, put a triage agent in front, and use handoffs to transfer control.
Three patterns, three tradeoffs. Pick deliberately.
| Pattern | Control flow | Context preserved? | When it wins | When it loses |
|---|---|---|---|---|
| Single mega-agent | One agent, all tools | N/A | Prototypes, ≤5 tools, one persona | Persona drift, tool confusion at scale |
| Agent-as-tool | Parent agent calls child as a function, child returns a string | Only what parent passes in args | Parent needs structured help (e.g., a research sub-agent) and wants to keep control | Multi-turn specialist conversations; child can't drive the user dialog |
| Handoff | Triage agent transfers control; specialist now owns the dialog | Full conversation history by default | Domain specialists with their own personas and tools (billing, tech support, sales) | When you genuinely want one consistent voice and don't need specialization |
The mental model that finally clicked for me: agent-as-tool is RPC, handoff is transfer-of-control. Both are useful. They are not interchangeable.
```mermaid
flowchart TD
U[Caller / chat user] --> T[Triage Agent
gpt-4o-2024-08-06]
T -->|handoff| B[Billing Agent
gpt-4o-2024-08-06]
T -->|handoff| S[Tech Support Agent
gpt-4o-2024-08-06]
T -->|handoff| Sa[Sales Agent
gpt-4o-2024-08-06]
B -->|escalate| H[Human Handoff Tool]
S -->|escalate| H
Sa -->|book demo| C[Calendar Tool]
B -.handoff back.-> T
S -.handoff back.-> T
Sa -.handoff back.-> T
style T fill:#ffd
style H fill:#fcc
```
Figure 1 — Triage in front, three specialists behind, and a designated escape hatch to a human. The dotted lines are handback paths: a specialist can hand control back to triage when the user pivots topics mid-conversation.
The OpenAI Agents SDK exposes `Agent` and `handoff()`. The simplest production wiring looks like this. (Pinned to OpenAI Agents SDK 0.9.x; the API in 0.3.x was different and 1.0 may diverge again.)
```python from agents import Agent, Runner, handoff from pydantic import BaseModel
billing_agent = Agent( name="BillingAgent", handoff_description="Handles invoices, refunds, plan changes, payment failures.", instructions=( "You are CallSphere's billing specialist. Be precise about amounts, " "dates, and policy. Always confirm the customer's account ID before " "making changes. Escalate any refund over $500 to a human." ), model="gpt-4o-2024-08-06", tools=[lookup_invoice, issue_refund, change_plan, escalate_to_human], )
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
tech_support_agent = Agent( name="TechSupportAgent", handoff_description="Diagnoses voice quality, integration, and webhook issues.", instructions=( "You are CallSphere's technical support engineer. Walk users through " "diagnostics step by step. Never guess — ask for the call_id, " "tenant_id, and approximate timestamp before troubleshooting." ), model="gpt-4o-2024-08-06", tools=[fetch_call_log, run_webhook_diagnostic, escalate_to_human], )
sales_agent = Agent( name="SalesAgent", handoff_description="Handles pricing, plan comparisons, and demo bookings.", instructions=( "You are CallSphere's pre-sales engineer. Be helpful, never pushy. " "If the prospect asks about a competitor, answer factually and " "redirect to CallSphere's strengths. Offer a demo when interest is clear." ), model="gpt-4o-2024-08-06", tools=[lookup_pricing, book_demo, send_one_pager], )
class HandoffReason(BaseModel): reason: str # short rationale, logged for evals
async def on_billing_handoff(ctx, input: HandoffReason): ctx.logger.info(f"Handoff to billing: {input.reason}")
triage_agent = Agent( name="TriageAgent", instructions=( "You are CallSphere's front-desk triage agent. Your ONLY job is to " "decide which specialist should handle this conversation, then hand " "off. Do not attempt to answer billing, technical, or sales questions " "yourself. If the user is just chatting, ask one clarifying question " "to determine the topic." ), model="gpt-4o-2024-08-06", handoffs=[ handoff(billing_agent, on_handoff=on_billing_handoff, input_type=HandoffReason), handoff(tech_support_agent), handoff(sales_agent), ], )
async def run_conversation(user_message: str, session_id: str): result = await Runner.run( triage_agent, input=user_message, context={"session_id": session_id}, ) return result.final_output, result.last_agent.name ```
A few things to call out because they are easy to miss on a first read:
By default, the OpenAI Agents SDK passes the full conversation history to the specialist on handoff. That is usually what you want — the billing agent should see the user's earlier "I was charged twice for May" message even though triage technically received it.
There are two cases where you want to override the default:
```python from agents.extensions.handoff_filters import remove_all_tools
handoff( billing_agent, on_handoff=on_billing_handoff, input_type=HandoffReason, input_filter=remove_all_tools, # drop triage's tool calls, keep messages ) ```
The `remove_all_tools` filter is the one I reach for most often. Triage should not be making tool calls anyway, but if a developer adds one experimentally, you don't want its output cluttering the specialist's context window.
Here is the part most teams skip. They evaluate the final answer the specialist produced and assume that if the answer is right, the handoff was right. Wrong. You can get a correct-looking answer from the wrong specialist (the sales agent giving a vague "yes, we support that" to a technical question that should have gone to tech support and been answered with diagnostic steps).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
We treat handoff correctness as its own evaluator, with its own dataset. The dataset rows look like:
| Input | Expected agent | Actual agent | Score |
|---|---|---|---|
| "Why was I charged $99 twice in May?" | BillingAgent | BillingAgent | 1.0 |
| "My webhook returns 502 every 10 minutes." | TechSupportAgent | TechSupportAgent | 1.0 |
| "Can I get a discount if I prepay annually?" | SalesAgent | BillingAgent | 0.0 |
| "How do I export call recordings?" | TechSupportAgent | SalesAgent | 0.0 |
The evaluator is structural — `actual_agent == expected_agent` — which makes it cheap and deterministic. We pair it with an LLM-as-judge evaluator that scores the rationale the triage agent produced (the `HandoffReason` payload). Cheap evaluator catches the obvious; judge catches "right answer for the wrong reason."
```python from langsmith import evaluate
def handoff_correct(run, example): actual = run.outputs["last_agent"] expected = example.outputs["expected_agent"] return {"key": "handoff_correct", "score": float(actual == expected)}
def rationale_quality(run, example): # LLM-as-judge: was the triage rationale specific and topical? rationale = run.outputs.get("handoff_rationale", "") # ... call judge model, return 0..1 ...
evaluate( triage_predict, data="triage-routing-suite", evaluators=[handoff_correct, rationale_quality], experiment_prefix="triage-routing-eval", ) ```
This decomposition is what lets us ship triage prompt changes independently of specialist prompt changes. When routing accuracy regresses, we know exactly which prompt to look at.
Across roughly 280k voice and chat sessions/month on the CallSphere platform, here is what the data showed once we moved from one mega-agent to triage + 3 specialists:
This pattern is not free. The costs:
The alternative — keep stuffing the mega-prompt — works until it doesn't, and the day it doesn't is usually a customer-visible outage. We chose the structural fix.
When the parent agent needs the child's output as data to continue its own reasoning, and the child does not need to drive a multi-turn conversation. Classic example: a research sub-agent that returns a structured summary. Handoff is for when the child should own the dialog from here on.
Yes — wire the specialists' `handoffs` lists to include each other. We generally route through triage instead because it keeps the topology star-shaped and easier to reason about. Specialist-to-specialist edges are tempting but turn your topology into a mesh fast.
Three things, in order of impact: (1) give triage zero tools other than handoffs, (2) include explicit "do not attempt to answer X, Y, Z yourself" in the triage instructions, (3) evaluate triage on a dataset where the correct answer is which specialist, not the final response — and gate prompt changes on that score.
Yes. The OpenAI Agents SDK has a Voice extension and the realtime model variants support handoffs. We run handoffs in our voice agent flow with the realtime model on the specialist tier; the triage step uses gpt-4o because the latency budget there is more forgiving than mid-conversation. See our glossary for the realtime vs. cascaded tradeoff.
Two mitigations: (1) collapse low-traffic specialists into a single "general support" agent so the routing is binary (general vs. specific specialist), (2) use a cheaper triage model — gpt-4o-mini works well for the routing decision in our tests, with about 2-3 percentage points of accuracy lost vs. full gpt-4o, but ~3x cheaper and faster. For most conversational latency budgets the mini-on-triage tradeoff is the right call.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.
A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Five proven multi-agent architecture patterns built on A2A — orchestrator, peer mesh, hub-and-spoke, marketplace, and tiered specialist.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI