TL;DR

The supervisor pattern in LangGraph is what you reach for when you have several specialist agents and you want a single orchestrator deciding who goes next, with shared state and a clear termination condition. It sits between the "network" pattern (every agent can call every other agent — chaos) and the "hierarchical" pattern (supervisor of supervisors — overkill for most teams). In this post I show a working four-specialist team — research, code, math, writing — coordinated by a supervisor with `langgraph-supervisor` and `create_supervisor()`. Pinned to LangGraph 0.6.x and `gpt-4o-2024-08-06`. Includes the eval pipeline that scores routing accuracy and tool calls per task, the cost analysis (multi-agent is expensive — here is when it's worth it), and the failure modes that bit us in production. Companion to our OpenAI Agents SDK handoff piece; same problem, different idiom.

Why Supervisor, Not Network or Hierarchy

LangGraph names three multi-agent topologies. Choose deliberately:

Topology	Edges	Best for	Pain
Network	Every agent can call every other	Genuinely peer-to-peer collaboration; rare	Combinatorial routing decisions; near-impossible to evaluate or debug
Supervisor	One supervisor routes to N workers; workers return to supervisor	90% of real teams: one orchestrator, several specialists	Supervisor becomes a bottleneck if it has to think too hard
Hierarchical	Supervisor of supervisors	Large teams with sub-teams (e.g., a "research wing" with its own internal supervisor)	Triple the cost; only worth it past ~8 specialists

For four specialists — research, code, math, writing — supervisor is the right answer. Network is anarchy at this size; hierarchy is overengineering.

The Topology

Figure 1 — Star topology with the supervisor at the center. Every worker reports back; supervisor decides next step or terminates with FINISH.

The Code

`langgraph-supervisor` (the helper package, separate from core LangGraph) ships a `create_supervisor` factory that wires the topology for you. The relevant pieces:

```python from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent from langgraph_supervisor import create_supervisor

MODEL = "gpt-4o-2024-08-06" llm = ChatOpenAI(model=MODEL, temperature=0)

── Specialist workers ──

research_agent = create_react_agent( model=llm, tools=[web_search, arxiv_search, fetch_url], name="research_agent", prompt=( "You are a research specialist. Find authoritative sources, summarize " "findings concisely, and always cite URLs. If asked to do math or write " "code, do not attempt — return a note that the supervisor should route " "to the math or code specialist." ), )

code_agent = create_react_agent( model=llm, tools=[python_repl, run_tests, lint_code], name="code_agent", prompt=( "You are a code specialist. Write, run, and debug Python. Always " "execute code to verify before returning results. If the task requires " "research or pure math, defer." ), )

math_agent = create_react_agent( model=llm, tools=[wolfram_query, sympy_evaluate], name="math_agent", prompt=( "You are a math specialist. Solve symbolic and numeric problems " "precisely. Show your work briefly. Refuse code or research tasks." ), )

writing_agent = create_react_agent( model=llm, tools=[style_check, grammar_check], name="writing_agent", prompt=( "You are a writing specialist. Polish prose for clarity, structure, " "and tone. Do not invent facts; if information is missing, request " "the supervisor route to research first." ), )

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

── Supervisor ──

supervisor = create_supervisor( agents=[research_agent, code_agent, math_agent, writing_agent], model=llm, prompt=( "You are the team supervisor. Decompose the user's task and route to " "ONE specialist at a time. After a specialist returns, decide whether " "to route to another, or to FINISH and produce the final answer. " "Never do specialist work yourself. Never call more than one specialist " "in a single step." ), output_mode="last_message", ).compile(name="multi_agent_team") ```

A few production-grade notes you will not find in the README:

`temperature=0` on the supervisor. Routing should be deterministic. We allow temperature on workers because creativity matters at the leaves, not the trunk.
`output_mode="last_message"` vs `"full_history"`. Last_message keeps context windows under control as the conversation grows; full_history is useful for debugging but expensive in production. We log full history to LangSmith and run with last_message in prod.
The supervisor prompt explicitly forbids "do specialist work yourself." Without this, supervisors will try to answer simple questions directly. This collapses your routing eval signal because half the time there is no routing decision to grade.
Worker prompts include "if asked X, defer." This is your primary defense against scope creep at the leaves. A research agent that decides to write code is the start of a debugging nightmare.

Running It

```python from langchain_core.messages import HumanMessage

result = supervisor.invoke({ "messages": [ HumanMessage(content=( "Find the three most-cited 2025 papers on speculative decoding, " "summarize each in one paragraph, and write a short blog " "introduction (under 200 words) that synthesizes them." )), ], }, config={"recursion_limit": 25})

print(result["messages"][-1].content) ```

That request exercises three specialists in sequence: research → research → writing. The supervisor decides at each step. No worker is "in charge" except in its turn.

Termination, Recursion, and Shared State

Three operational concerns that separate a demo from a production system.

Termination. The supervisor terminates when it returns without naming a next worker (the `langgraph-supervisor` helper interprets this as FINISH). In practice, you encode this in the supervisor prompt: "When the user's request is fully addressed, produce the final answer and stop." For paranoia, set `recursion_limit` on `invoke()` to bound the worst case.

Recursion limit. Each "supervisor → worker → supervisor" cycle is two graph steps. A four-specialist task realistically takes 6–10 steps. We default to `recursion_limit=25`. When we hit it, it is almost always a routing loop (supervisor keeps asking the wrong specialist who keeps deferring back). The fix is in the supervisor prompt, not the limit.

Shared state. The default state is the message list. If you need richer shared state (intermediate facts, citations, a partial draft), define a custom `MessagesState` subclass and have workers append structured updates:

```python from typing import Annotated, TypedDict from operator import add from langgraph.graph import MessagesState

class TeamState(MessagesState): citations: Annotated[list[str], add] draft: str | None ```

Workers can then read `state["citations"]` and append their own. This is how we get the writing agent to know what the research agent already found without dumping the entire research conversation into its context window.

The Eval Pipeline

A multi-agent team has more failure modes than a single agent, so the eval pipeline needs more axes. The three we score:

Metric	What it measures	How
Route accuracy	Did the supervisor pick the right specialist at each step?	Labeled trace dataset; structural match on `next` decision
Tool calls per task	Efficiency — did we get the answer in N tool calls or 3N?	Aggregate from LangSmith run tree
End-to-end success	Did the final answer satisfy the rubric?	LLM-as-judge against reference

The eval runner:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

```python from langsmith import evaluate, Client

def predict(inputs: dict) -> dict: out = supervisor.invoke( {"messages": [HumanMessage(inputs["task"])]}, config={"recursion_limit": 25, "configurable": {"thread_id": "eval"}}, ) return { "final_answer": out["messages"][-1].content, "route_trace": [m.name for m in out["messages"] if getattr(m, "name", None)], "tool_calls": sum( 1 for m in out["messages"] if getattr(m, "tool_calls", None) ), }

def route_accuracy(run, example): expected = example.outputs["expected_route"] actual = run.outputs["route_trace"] # Accept any superset that hits all expected workers in order matches = all(w in actual for w in expected) return {"key": "route_accuracy", "score": float(matches)}

def efficiency(run, example): budget = example.outputs.get("tool_call_budget", 6) actual = run.outputs["tool_calls"] return {"key": "efficiency", "score": float(actual <= budget)}

def end_to_end(run, example): # LLM-as-judge against reference answer rubric return judge_answer(run.outputs["final_answer"], example.outputs["rubric"])

evaluate( predict, data="supervisor-team-eval-v1", evaluators=[route_accuracy, efficiency, end_to_end], experiment_prefix="supervisor-team-2026-05", metadata={"model": MODEL, "supervisor_version": "v3"}, max_concurrency=4, ) ```

We gate PRs against this in CI exactly the way we do for single-agent systems — see our continuous evaluation in CI/CD piece for the GitHub Actions wiring. Multi-agent does not need a different gate; it needs more evaluators.

Cost Analysis: When Is This Worth It?

Multi-agent is expensive. Every supervisor turn is a full LLM call before a worker even starts thinking. Here is the per-task math from our internal eval suite (averaged over 200 mixed-difficulty tasks, gpt-4o-2024-08-06, May 2026):

Approach	Avg tokens	Avg cost	E2E success	When it wins
Single mega-agent	4,200	$0.022	71%	Simple tasks; one persona
ReAct agent + many tools	6,800	$0.038	79%	Medium complexity
Supervisor + 4 specialists	11,400	$0.061	89%	Heterogeneous tasks; specialist tools
Hierarchical (supervisor of supervisors)	18,200	$0.097	91%	Only past 8+ specialists

The supervisor pattern is roughly 3x the cost of a single mega-agent for an 18-point lift in success rate. Whether that is "worth it" depends entirely on what the task is worth. For a $0.02 customer-support turn, probably not. For a $50 research synthesis, absolutely. For our voice agent flows where one bad turn loses a deal, the math is easy.

The cost lever you have most control over is supervisor model choice. Swapping the supervisor (not the workers) to gpt-4o-mini drops total cost ~35% with about 4 percentage points of routing accuracy lost. We run mini-on-supervisor for non-critical paths and full gpt-4o for revenue-impact paths.

Failure Modes We Hit

Things that look fine in dev and bite you in prod:

Routing loops. Supervisor asks research, research defers ("this is a math question"), supervisor asks math, math defers ("this needs research first"). Recursion limit catches it; the cure is sharper specialist prompts about what they will do, not what they refuse.
Supervisor over-eager FINISH. Supervisor declares the task done after only one specialist has answered, missing the synthesis step. Fix: explicit "before finishing, verify all parts of the user's request are addressed" in the supervisor prompt + an end-to-end evaluator that catches incomplete answers.
Worker scope creep. A worker gets a partial task and tries to "be helpful" by doing the next step too. Solved by tight worker prompts and the route-accuracy evaluator.
Context window blowup. With `output_mode="full_history"` and a long task, you blow past 128k tokens by step 20. Use `last_message` in prod, log full history to LangSmith for debugging, and consider message summarization between supervisor turns for very long tasks.
State key collisions. Two workers both trying to set `state["draft"]` overwrite each other. The Annotated reducer pattern (`Annotated[list[str], add]`) is your friend.
Tool descriptions matter more than ever. Supervisors route based on the worker's name and prompt summary. Vague worker prompts produce vague routing. Treat `name` and `prompt` like API documentation.

Honest Tradeoffs vs. Single-Agent

This pattern is not always the right call. The honest comparison:

Latency: supervisor adds ~1.5–2x wall-clock vs. a single agent. Each supervisor turn is a sequential LLM call.
Cost: 2–3x in our measurements.
Engineering complexity: more code, more evaluators, more failure modes.
What you gain: independent iteration on specialists, sharper evals (you can pinpoint which specialist regressed), better task success on heterogeneous workloads, and a topology that scales to more specialists without a prompt rewrite.

Rule of thumb we use: if a single agent with all the tools is hitting ≥85% on your eval and your tasks are reasonably homogeneous, stay single. Move to supervisor when you have demonstrably distinct task types and a single agent's accuracy plateaus despite prompt iteration. The cross-cutting agent observability workflow we use for both topologies is the same — that part doesn't change.

Frequently Asked Questions

Supervisor vs. OpenAI Agents SDK handoffs — when to pick which?

Both are valid. Pick OpenAI Agents SDK handoffs when you want one agent to fully take over the dialog (typical for support-style multi-persona conversations). Pick LangGraph supervisor when you need an orchestrator that retains control and chains multiple specialists per task (typical for research/analysis workloads). They overlap in the middle; the deciding factor is usually whether you want the user to "talk to" a specialist directly (handoff) or always talk to one orchestrator that delegates (supervisor).

How do I prevent the supervisor from doing specialist work?

Three layers: (1) zero specialist tools on the supervisor — it only has "route" or "finish," (2) the prompt explicitly forbids it, (3) the route-accuracy evaluator catches violations and fails CI. Layer 3 is the one that actually keeps it out long-term.

What recursion_limit should I use?

Default to `25` for 4-specialist teams; bump to 40 for hierarchical. Hitting the limit should be alarming, not routine — over 1% of runs means the supervisor is looping and the limit is masking a prompt bug.

Yes — custom shared state plus LangGraph's Postgres-backed checkpointers keyed to `thread_id` give you long-running context. For streaming, `.astream()` emits events at every node transition; we surface "supervisor consulting research_agent…" status updates in the UI so users see progress instead of a blank screen. See our products page.

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

TL;DR

Why Supervisor, Not Network or Hierarchy

The Topology

The Code

── Specialist workers ──

── Supervisor ──

Running It

Termination, Recursion, and Shared State

The Eval Pipeline

Cost Analysis: When Is This Worth It?

Failure Modes We Hit

Honest Tradeoffs vs. Single-Agent

Frequently Asked Questions

Supervisor vs. OpenAI Agents SDK handoffs — when to pick which?

How do I prevent the supervisor from doing specialist work?

What recursion_limit should I use?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

TL;DR

Why Supervisor, Not Network or Hierarchy

The Topology

The Code

── Specialist workers ──

── Supervisor ──

Running It

Termination, Recursion, and Shared State

The Eval Pipeline

Cost Analysis: When Is This Worth It?

Failure Modes We Hit

Honest Tradeoffs vs. Single-Agent

Frequently Asked Questions

Supervisor vs. OpenAI Agents SDK handoffs — when to pick which?

How do I prevent the supervisor from doing specialist work?

What recursion_limit should I use?

Can I share memory across specialists, and how do I stream?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split