By Sagar Shankaran, Founder of CallSphere
Langgraph multi-agent supervisor handoffs docs: the supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.
Key takeaways
The supervisor pattern in LangGraph is what you reach for when you have several specialist agents and you want a single orchestrator deciding who goes next, with shared state and a clear termination condition. It sits between the "network" pattern (every agent can call every other agent — chaos) and the "hierarchical" pattern (supervisor of supervisors — overkill for most teams). In this post I show a working four-specialist team — research, code, math, writing — coordinated by a supervisor with `langgraph-supervisor` and `create_supervisor()`. Pinned to LangGraph 0.6.x and `gpt-4o-2024-08-06`. Includes the eval pipeline that scores routing accuracy and tool calls per task, the cost analysis (multi-agent is expensive — here is when it's worth it), and the failure modes that bit us in production. Companion to our OpenAI Agents SDK handoff piece; same problem, different idiom.
LangGraph names three multi-agent topologies. Choose deliberately:
| Topology | Edges | Best for | Pain |
|---|---|---|---|
| Network | Every agent can call every other | Genuinely peer-to-peer collaboration; rare | Combinatorial routing decisions; near-impossible to evaluate or debug |
| Supervisor | One supervisor routes to N workers; workers return to supervisor | 90% of real teams: one orchestrator, several specialists | Supervisor becomes a bottleneck if it has to think too hard |
| Hierarchical | Supervisor of supervisors | Large teams with sub-teams (e.g., a "research wing" with its own internal supervisor) | Triple the cost; only worth it past ~8 specialists |
For four specialists — research, code, math, writing — supervisor is the right answer. Network is anarchy at this size; hierarchy is overengineering.
```mermaid
flowchart TD
U[User task] --> S[Supervisor
gpt-4o-2024-08-06]
S -->|route| R[Research Agent
web_search, arxiv]
S -->|route| C[Code Agent
python_repl, run_tests]
S -->|route| M[Math Agent
wolfram, sympy]
S -->|route| W[Writing Agent
style_check]
R -->|return result| S
C -->|return result| S
M -->|return result| S
W -->|return result| S
S -->|FINISH| O[Final answer to user]
style S fill:#ffd
style O fill:#cfc
```
Figure 1 — Star topology with the supervisor at the center. Every worker reports back; supervisor decides next step or terminates with FINISH.
`langgraph-supervisor` (the helper package, separate from core LangGraph) ships a `create_supervisor` factory that wires the topology for you. The relevant pieces:
```python from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from langgraph_supervisor import create_supervisor
MODEL = "gpt-4o-2024-08-06" llm = ChatOpenAI(model=MODEL, temperature=0)
research_agent = create_react_agent( model=llm, tools=[web_search, arxiv_search, fetch_url], name="research_agent", prompt=( "You are a research specialist. Find authoritative sources, summarize " "findings concisely, and always cite URLs. If asked to do math or write " "code, do not attempt — return a note that the supervisor should route " "to the math or code specialist." ), )
code_agent = create_react_agent( model=llm, tools=[python_repl, run_tests, lint_code], name="code_agent", prompt=( "You are a code specialist. Write, run, and debug Python. Always " "execute code to verify before returning results. If the task requires " "research or pure math, defer." ), )
math_agent = create_react_agent( model=llm, tools=[wolfram_query, sympy_evaluate], name="math_agent", prompt=( "You are a math specialist. Solve symbolic and numeric problems " "precisely. Show your work briefly. Refuse code or research tasks." ), )
writing_agent = create_react_agent( model=llm, tools=[style_check, grammar_check], name="writing_agent", prompt=( "You are a writing specialist. Polish prose for clarity, structure, " "and tone. Do not invent facts; if information is missing, request " "the supervisor route to research first." ), )
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
supervisor = create_supervisor( agents=[research_agent, code_agent, math_agent, writing_agent], model=llm, prompt=( "You are the team supervisor. Decompose the user's task and route to " "ONE specialist at a time. After a specialist returns, decide whether " "to route to another, or to FINISH and produce the final answer. " "Never do specialist work yourself. Never call more than one specialist " "in a single step." ), output_mode="last_message", ).compile(name="multi_agent_team") ```
A few production-grade notes you will not find in the README:
```python from langchain_core.messages import HumanMessage
result = supervisor.invoke({ "messages": [ HumanMessage(content=( "Find the three most-cited 2025 papers on speculative decoding, " "summarize each in one paragraph, and write a short blog " "introduction (under 200 words) that synthesizes them." )), ], }, config={"recursion_limit": 25})
print(result["messages"][-1].content) ```
That request exercises three specialists in sequence: research → research → writing. The supervisor decides at each step. No worker is "in charge" except in its turn.
Three operational concerns that separate a demo from a production system.
Termination. The supervisor terminates when it returns without naming a next worker (the `langgraph-supervisor` helper interprets this as FINISH). In practice, you encode this in the supervisor prompt: "When the user's request is fully addressed, produce the final answer and stop." For paranoia, set `recursion_limit` on `invoke()` to bound the worst case.
Recursion limit. Each "supervisor → worker → supervisor" cycle is two graph steps. A four-specialist task realistically takes 6–10 steps. We default to `recursion_limit=25`. When we hit it, it is almost always a routing loop (supervisor keeps asking the wrong specialist who keeps deferring back). The fix is in the supervisor prompt, not the limit.
Shared state. The default state is the message list. If you need richer shared state (intermediate facts, citations, a partial draft), define a custom `MessagesState` subclass and have workers append structured updates:
```python from typing import Annotated, TypedDict from operator import add from langgraph.graph import MessagesState
class TeamState(MessagesState): citations: Annotated[list[str], add] draft: str | None ```
Workers can then read `state["citations"]` and append their own. This is how we get the writing agent to know what the research agent already found without dumping the entire research conversation into its context window.
A multi-agent team has more failure modes than a single agent, so the eval pipeline needs more axes. The three we score:
| Metric | What it measures | How |
|---|---|---|
| Route accuracy | Did the supervisor pick the right specialist at each step? | Labeled trace dataset; structural match on `next` decision |
| Tool calls per task | Efficiency — did we get the answer in N tool calls or 3N? | Aggregate from LangSmith run tree |
| End-to-end success | Did the final answer satisfy the rubric? | LLM-as-judge against reference |
The eval runner:
```python from langsmith import evaluate, Client
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
def predict(inputs: dict) -> dict: out = supervisor.invoke( {"messages": [HumanMessage(inputs["task"])]}, config={"recursion_limit": 25, "configurable": {"thread_id": "eval"}}, ) return { "final_answer": out["messages"][-1].content, "route_trace": [m.name for m in out["messages"] if getattr(m, "name", None)], "tool_calls": sum( 1 for m in out["messages"] if getattr(m, "tool_calls", None) ), }
def route_accuracy(run, example): expected = example.outputs["expected_route"] actual = run.outputs["route_trace"] # Accept any superset that hits all expected workers in order matches = all(w in actual for w in expected) return {"key": "route_accuracy", "score": float(matches)}
def efficiency(run, example): budget = example.outputs.get("tool_call_budget", 6) actual = run.outputs["tool_calls"] return {"key": "efficiency", "score": float(actual <= budget)}
def end_to_end(run, example): # LLM-as-judge against reference answer rubric return judge_answer(run.outputs["final_answer"], example.outputs["rubric"])
evaluate( predict, data="supervisor-team-eval-v1", evaluators=[route_accuracy, efficiency, end_to_end], experiment_prefix="supervisor-team-2026-05", metadata={"model": MODEL, "supervisor_version": "v3"}, max_concurrency=4, ) ```
We gate PRs against this in CI exactly the way we do for single-agent systems — see our continuous evaluation in CI/CD piece for the GitHub Actions wiring. Multi-agent does not need a different gate; it needs more evaluators.
Multi-agent is expensive. Every supervisor turn is a full LLM call before a worker even starts thinking. Here is the per-task math from our internal eval suite (averaged over 200 mixed-difficulty tasks, gpt-4o-2024-08-06, May 2026):
| Approach | Avg tokens | Avg cost | E2E success | When it wins |
|---|---|---|---|---|
| Single mega-agent | 4,200 | $0.022 | 71% | Simple tasks; one persona |
| ReAct agent + many tools | 6,800 | $0.038 | 79% | Medium complexity |
| Supervisor + 4 specialists | 11,400 | $0.061 | 89% | Heterogeneous tasks; specialist tools |
| Hierarchical (supervisor of supervisors) | 18,200 | $0.097 | 91% | Only past 8+ specialists |
The supervisor pattern is roughly 3x the cost of a single mega-agent for an 18-point lift in success rate. Whether that is "worth it" depends entirely on what the task is worth. For a $0.02 customer-support turn, probably not. For a $50 research synthesis, absolutely. For our voice agent flows where one bad turn loses a deal, the math is easy.
The cost lever you have most control over is supervisor model choice. Swapping the supervisor (not the workers) to gpt-4o-mini drops total cost ~35% with about 4 percentage points of routing accuracy lost. We run mini-on-supervisor for non-critical paths and full gpt-4o for revenue-impact paths.
Things that look fine in dev and bite you in prod:
This pattern is not always the right call. The honest comparison:
Rule of thumb we use: if a single agent with all the tools is hitting ≥85% on your eval and your tasks are reasonably homogeneous, stay single. Move to supervisor when you have demonstrably distinct task types and a single agent's accuracy plateaus despite prompt iteration. The cross-cutting agent observability workflow we use for both topologies is the same — that part doesn't change.
Both are valid. Pick OpenAI Agents SDK handoffs when you want one agent to fully take over the dialog (typical for support-style multi-persona conversations). Pick LangGraph supervisor when you need an orchestrator that retains control and chains multiple specialists per task (typical for research/analysis workloads). They overlap in the middle; the deciding factor is usually whether you want the user to "talk to" a specialist directly (handoff) or always talk to one orchestrator that delegates (supervisor).
Three layers: (1) zero specialist tools on the supervisor — it only has "route" or "finish," (2) the prompt explicitly forbids it, (3) the route-accuracy evaluator catches violations and fails CI. Layer 3 is the one that actually keeps it out long-term.
Default to `25` for 4-specialist teams; bump to 40 for hierarchical. Hitting the limit should be alarming, not routine — over 1% of runs means the supervisor is looping and the limit is masking a prompt bug.
Yes — custom shared state plus LangGraph's Postgres-backed checkpointers keyed to `thread_id` give you long-running context. For streaming, `.astream()` emits events at every node transition; we surface "supervisor consulting research_agent…" status updates in the UI so users see progress instead of a blank screen. See our products page.
This guide is written for engineers and operators evaluating langgraph multi-agent supervisor handoffs docs in real production systems. Langgraph multi-agent supervisor handoffs docs sits alongside agent applications, agents system using langgraph, delegating tasks, flexible message history management, hierarchical multi agent systems in the daily work of teams shipping production AI. The notes below give a plain-language reference for terms used throughout the article.
For teams that want to ship langgraph multi-agent supervisor handoffs docs in voice and chat agents this quarter, CallSphere runs 37 agents and 90+ function tools across 6 verticals on a single dashboard. Start a 14-day trial, see live demo agents, or compare tiers on /pricing.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.
A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Five proven multi-agent architecture patterns built on A2A — orchestrator, peer mesh, hub-and-spoke, marketplace, and tiered specialist.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI