LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026
The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.
TL;DR
The supervisor pattern in LangGraph is what you reach for when you have several specialist agents and you want a single orchestrator deciding who goes next, with shared state and a clear termination condition. It sits between the "network" pattern (every agent can call every other agent — chaos) and the "hierarchical" pattern (supervisor of supervisors — overkill for most teams). In this post I show a working four-specialist team — research, code, math, writing — coordinated by a supervisor with `langgraph-supervisor` and `create_supervisor()`. Pinned to LangGraph 0.6.x and `gpt-4o-2024-08-06`. Includes the eval pipeline that scores routing accuracy and tool calls per task, the cost analysis (multi-agent is expensive — here is when it's worth it), and the failure modes that bit us in production. Companion to our OpenAI Agents SDK handoff piece; same problem, different idiom.
Why Supervisor, Not Network or Hierarchy
LangGraph names three multi-agent topologies. Choose deliberately:
| Topology | Edges | Best for | Pain |
|---|---|---|---|
| Network | Every agent can call every other | Genuinely peer-to-peer collaboration; rare | Combinatorial routing decisions; near-impossible to evaluate or debug |
| Supervisor | One supervisor routes to N workers; workers return to supervisor | 90% of real teams: one orchestrator, several specialists | Supervisor becomes a bottleneck if it has to think too hard |
| Hierarchical | Supervisor of supervisors | Large teams with sub-teams (e.g., a "research wing" with its own internal supervisor) | Triple the cost; only worth it past ~8 specialists |
For four specialists — research, code, math, writing — supervisor is the right answer. Network is anarchy at this size; hierarchy is overengineering.
The Topology
```mermaid
flowchart TD
U[User task] --> S[Supervisor
gpt-4o-2024-08-06]
S -->|route| R[Research Agent
web_search, arxiv]
S -->|route| C[Code Agent
python_repl, run_tests]
S -->|route| M[Math Agent
wolfram, sympy]
S -->|route| W[Writing Agent
style_check]
R -->|return result| S
C -->|return result| S
M -->|return result| S
W -->|return result| S
S -->|FINISH| O[Final answer to user]
style S fill:#ffd
style O fill:#cfc
```
Figure 1 — Star topology with the supervisor at the center. Every worker reports back; supervisor decides next step or terminates with FINISH.
The Code
`langgraph-supervisor` (the helper package, separate from core LangGraph) ships a `create_supervisor` factory that wires the topology for you. The relevant pieces:
```python from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from langgraph_supervisor import create_supervisor
MODEL = "gpt-4o-2024-08-06" llm = ChatOpenAI(model=MODEL, temperature=0)
── Specialist workers ──
research_agent = create_react_agent( model=llm, tools=[web_search, arxiv_search, fetch_url], name="research_agent", prompt=( "You are a research specialist. Find authoritative sources, summarize " "findings concisely, and always cite URLs. If asked to do math or write " "code, do not attempt — return a note that the supervisor should route " "to the math or code specialist." ), )
code_agent = create_react_agent( model=llm, tools=[python_repl, run_tests, lint_code], name="code_agent", prompt=( "You are a code specialist. Write, run, and debug Python. Always " "execute code to verify before returning results. If the task requires " "research or pure math, defer." ), )
math_agent = create_react_agent( model=llm, tools=[wolfram_query, sympy_evaluate], name="math_agent", prompt=( "You are a math specialist. Solve symbolic and numeric problems " "precisely. Show your work briefly. Refuse code or research tasks." ), )
writing_agent = create_react_agent( model=llm, tools=[style_check, grammar_check], name="writing_agent", prompt=( "You are a writing specialist. Polish prose for clarity, structure, " "and tone. Do not invent facts; if information is missing, request " "the supervisor route to research first." ), )
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
── Supervisor ──
supervisor = create_supervisor( agents=[research_agent, code_agent, math_agent, writing_agent], model=llm, prompt=( "You are the team supervisor. Decompose the user's task and route to " "ONE specialist at a time. After a specialist returns, decide whether " "to route to another, or to FINISH and produce the final answer. " "Never do specialist work yourself. Never call more than one specialist " "in a single step." ), output_mode="last_message", ).compile(name="multi_agent_team") ```
A few production-grade notes you will not find in the README:
- `temperature=0` on the supervisor. Routing should be deterministic. We allow temperature on workers because creativity matters at the leaves, not the trunk.
- `output_mode="last_message"` vs `"full_history"`. Last_message keeps context windows under control as the conversation grows; full_history is useful for debugging but expensive in production. We log full history to LangSmith and run with last_message in prod.
- The supervisor prompt explicitly forbids "do specialist work yourself." Without this, supervisors will try to answer simple questions directly. This collapses your routing eval signal because half the time there is no routing decision to grade.
- Worker prompts include "if asked X, defer." This is your primary defense against scope creep at the leaves. A research agent that decides to write code is the start of a debugging nightmare.
Running It
```python from langchain_core.messages import HumanMessage
result = supervisor.invoke({ "messages": [ HumanMessage(content=( "Find the three most-cited 2025 papers on speculative decoding, " "summarize each in one paragraph, and write a short blog " "introduction (under 200 words) that synthesizes them." )), ], }, config={"recursion_limit": 25})
print(result["messages"][-1].content) ```
That request exercises three specialists in sequence: research → research → writing. The supervisor decides at each step. No worker is "in charge" except in its turn.
Termination, Recursion, and Shared State
Three operational concerns that separate a demo from a production system.
Termination. The supervisor terminates when it returns without naming a next worker (the `langgraph-supervisor` helper interprets this as FINISH). In practice, you encode this in the supervisor prompt: "When the user's request is fully addressed, produce the final answer and stop." For paranoia, set `recursion_limit` on `invoke()` to bound the worst case.
Recursion limit. Each "supervisor → worker → supervisor" cycle is two graph steps. A four-specialist task realistically takes 6–10 steps. We default to `recursion_limit=25`. When we hit it, it is almost always a routing loop (supervisor keeps asking the wrong specialist who keeps deferring back). The fix is in the supervisor prompt, not the limit.
Shared state. The default state is the message list. If you need richer shared state (intermediate facts, citations, a partial draft), define a custom `MessagesState` subclass and have workers append structured updates:
```python from typing import Annotated, TypedDict from operator import add from langgraph.graph import MessagesState
class TeamState(MessagesState): citations: Annotated[list[str], add] draft: str | None ```
Workers can then read `state["citations"]` and append their own. This is how we get the writing agent to know what the research agent already found without dumping the entire research conversation into its context window.
The Eval Pipeline
A multi-agent team has more failure modes than a single agent, so the eval pipeline needs more axes. The three we score:
| Metric | What it measures | How |
|---|---|---|
| Route accuracy | Did the supervisor pick the right specialist at each step? | Labeled trace dataset; structural match on `next` decision |
| Tool calls per task | Efficiency — did we get the answer in N tool calls or 3N? | Aggregate from LangSmith run tree |
| End-to-end success | Did the final answer satisfy the rubric? | LLM-as-judge against reference |
The eval runner:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python from langsmith import evaluate, Client
def predict(inputs: dict) -> dict: out = supervisor.invoke( {"messages": [HumanMessage(inputs["task"])]}, config={"recursion_limit": 25, "configurable": {"thread_id": "eval"}}, ) return { "final_answer": out["messages"][-1].content, "route_trace": [m.name for m in out["messages"] if getattr(m, "name", None)], "tool_calls": sum( 1 for m in out["messages"] if getattr(m, "tool_calls", None) ), }
def route_accuracy(run, example): expected = example.outputs["expected_route"] actual = run.outputs["route_trace"] # Accept any superset that hits all expected workers in order matches = all(w in actual for w in expected) return {"key": "route_accuracy", "score": float(matches)}
def efficiency(run, example): budget = example.outputs.get("tool_call_budget", 6) actual = run.outputs["tool_calls"] return {"key": "efficiency", "score": float(actual <= budget)}
def end_to_end(run, example): # LLM-as-judge against reference answer rubric return judge_answer(run.outputs["final_answer"], example.outputs["rubric"])
evaluate( predict, data="supervisor-team-eval-v1", evaluators=[route_accuracy, efficiency, end_to_end], experiment_prefix="supervisor-team-2026-05", metadata={"model": MODEL, "supervisor_version": "v3"}, max_concurrency=4, ) ```
We gate PRs against this in CI exactly the way we do for single-agent systems — see our continuous evaluation in CI/CD piece for the GitHub Actions wiring. Multi-agent does not need a different gate; it needs more evaluators.
Cost Analysis: When Is This Worth It?
Multi-agent is expensive. Every supervisor turn is a full LLM call before a worker even starts thinking. Here is the per-task math from our internal eval suite (averaged over 200 mixed-difficulty tasks, gpt-4o-2024-08-06, May 2026):
| Approach | Avg tokens | Avg cost | E2E success | When it wins |
|---|---|---|---|---|
| Single mega-agent | 4,200 | $0.022 | 71% | Simple tasks; one persona |
| ReAct agent + many tools | 6,800 | $0.038 | 79% | Medium complexity |
| Supervisor + 4 specialists | 11,400 | $0.061 | 89% | Heterogeneous tasks; specialist tools |
| Hierarchical (supervisor of supervisors) | 18,200 | $0.097 | 91% | Only past 8+ specialists |
The supervisor pattern is roughly 3x the cost of a single mega-agent for an 18-point lift in success rate. Whether that is "worth it" depends entirely on what the task is worth. For a $0.02 customer-support turn, probably not. For a $50 research synthesis, absolutely. For our voice agent flows where one bad turn loses a deal, the math is easy.
The cost lever you have most control over is supervisor model choice. Swapping the supervisor (not the workers) to gpt-4o-mini drops total cost ~35% with about 4 percentage points of routing accuracy lost. We run mini-on-supervisor for non-critical paths and full gpt-4o for revenue-impact paths.
Failure Modes We Hit
Things that look fine in dev and bite you in prod:
- Routing loops. Supervisor asks research, research defers ("this is a math question"), supervisor asks math, math defers ("this needs research first"). Recursion limit catches it; the cure is sharper specialist prompts about what they will do, not what they refuse.
- Supervisor over-eager FINISH. Supervisor declares the task done after only one specialist has answered, missing the synthesis step. Fix: explicit "before finishing, verify all parts of the user's request are addressed" in the supervisor prompt + an end-to-end evaluator that catches incomplete answers.
- Worker scope creep. A worker gets a partial task and tries to "be helpful" by doing the next step too. Solved by tight worker prompts and the route-accuracy evaluator.
- Context window blowup. With `output_mode="full_history"` and a long task, you blow past 128k tokens by step 20. Use `last_message` in prod, log full history to LangSmith for debugging, and consider message summarization between supervisor turns for very long tasks.
- State key collisions. Two workers both trying to set `state["draft"]` overwrite each other. The Annotated reducer pattern (`Annotated[list[str], add]`) is your friend.
- Tool descriptions matter more than ever. Supervisors route based on the worker's name and prompt summary. Vague worker prompts produce vague routing. Treat `name` and `prompt` like API documentation.
Honest Tradeoffs vs. Single-Agent
This pattern is not always the right call. The honest comparison:
- Latency: supervisor adds ~1.5–2x wall-clock vs. a single agent. Each supervisor turn is a sequential LLM call.
- Cost: 2–3x in our measurements.
- Engineering complexity: more code, more evaluators, more failure modes.
- What you gain: independent iteration on specialists, sharper evals (you can pinpoint which specialist regressed), better task success on heterogeneous workloads, and a topology that scales to more specialists without a prompt rewrite.
Rule of thumb we use: if a single agent with all the tools is hitting ≥85% on your eval and your tasks are reasonably homogeneous, stay single. Move to supervisor when you have demonstrably distinct task types and a single agent's accuracy plateaus despite prompt iteration. The cross-cutting agent observability workflow we use for both topologies is the same — that part doesn't change.
Frequently Asked Questions
Supervisor vs. OpenAI Agents SDK handoffs — when to pick which?
Both are valid. Pick OpenAI Agents SDK handoffs when you want one agent to fully take over the dialog (typical for support-style multi-persona conversations). Pick LangGraph supervisor when you need an orchestrator that retains control and chains multiple specialists per task (typical for research/analysis workloads). They overlap in the middle; the deciding factor is usually whether you want the user to "talk to" a specialist directly (handoff) or always talk to one orchestrator that delegates (supervisor).
How do I prevent the supervisor from doing specialist work?
Three layers: (1) zero specialist tools on the supervisor — it only has "route" or "finish," (2) the prompt explicitly forbids it, (3) the route-accuracy evaluator catches violations and fails CI. Layer 3 is the one that actually keeps it out long-term.
What recursion_limit should I use?
Default to `25` for 4-specialist teams; bump to 40 for hierarchical. Hitting the limit should be alarming, not routine — over 1% of runs means the supervisor is looping and the limit is masking a prompt bug.
Can I share memory across specialists, and how do I stream?
Yes — custom shared state plus LangGraph's Postgres-backed checkpointers keyed to `thread_id` give you long-running context. For streaming, `.astream()` emits events at every node transition; we surface "supervisor consulting research_agent…" status updates in the UI so users see progress instead of a blank screen. See our products page.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.