---
title: "LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026"
description: "The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for."
canonical: https://callsphere.ai/blog/langgraph-supervisor-multi-agent-orchestration-2026
category: "Agentic AI"
tags: ["Multi-Agent Systems", "OpenAI Agents SDK", "LangGraph", "Agent Orchestration", "AI Agents", "Production AI", "Agent Evaluation"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.591Z
---

# LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

> The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.

## TL;DR

The **supervisor pattern** in LangGraph is what you reach for when you have several specialist agents and you want a single orchestrator deciding who goes next, with shared state and a clear termination condition. It sits between the "network" pattern (every agent can call every other agent — chaos) and the "hierarchical" pattern (supervisor of supervisors — overkill for most teams). In this post I show a working four-specialist team — research, code, math, writing — coordinated by a supervisor with `langgraph-supervisor` and `create_supervisor()`. Pinned to LangGraph 0.6.x and `gpt-4o-2024-08-06`. Includes the eval pipeline that scores routing accuracy and tool calls per task, the cost analysis (multi-agent is expensive — here is when it's worth it), and the failure modes that bit us in production. Companion to our [OpenAI Agents SDK handoff piece](/blog/multi-agent-handoffs-openai-agents-sdk-pattern); same problem, different idiom.

## Why Supervisor, Not Network or Hierarchy

LangGraph names three multi-agent topologies. Choose deliberately:

| Topology | Edges | Best for | Pain |
| --- | --- | --- | --- |
| Network | Every agent can call every other | Genuinely peer-to-peer collaboration; rare | Combinatorial routing decisions; near-impossible to evaluate or debug |
| Supervisor | One supervisor routes to N workers; workers return to supervisor | 90% of real teams: one orchestrator, several specialists | Supervisor becomes a bottleneck if it has to think too hard |
| Hierarchical | Supervisor of supervisors | Large teams with sub-teams (e.g., a "research wing" with its own internal supervisor) | Triple the cost; only worth it past ~8 specialists |

For four specialists — research, code, math, writing — supervisor is the right answer. Network is anarchy at this size; hierarchy is overengineering.

## The Topology

```mermaid
flowchart TD
  U[User task] --> S[Supervisor
gpt-4o-2024-08-06]
  S -->|route| R[Research Agent
web_search, arxiv]
  S -->|route| C[Code Agent
python_repl, run_tests]
  S -->|route| M[Math Agent
wolfram, sympy]
  S -->|route| W[Writing Agent
style_check]
  R -->|return result| S
  C -->|return result| S
  M -->|return result| S
  W -->|return result| S
  S -->|FINISH| O[Final answer to user]
  style S fill:#ffd
  style O fill:#cfc
```

*Figure 1 — Star topology with the supervisor at the center. Every worker reports back; supervisor decides next step or terminates with FINISH.*

## The Code

`langgraph-supervisor` (the helper package, separate from core LangGraph) ships a `create_supervisor` factory that wires the topology for you. The relevant pieces:

```python
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langgraph_supervisor import create_supervisor

MODEL = "gpt-4o-2024-08-06"
llm = ChatOpenAI(model=MODEL, temperature=0)

# ── Specialist workers ──

research_agent = create_react_agent(
    model=llm,
    tools=[web_search, arxiv_search, fetch_url],
    name="research_agent",
    prompt=(
        "You are a research specialist. Find authoritative sources, summarize "
        "findings concisely, and always cite URLs. If asked to do math or write "
        "code, do not attempt — return a note that the supervisor should route "
        "to the math or code specialist."
    ),
)

code_agent = create_react_agent(
    model=llm,
    tools=[python_repl, run_tests, lint_code],
    name="code_agent",
    prompt=(
        "You are a code specialist. Write, run, and debug Python. Always "
        "execute code to verify before returning results. If the task requires "
        "research or pure math, defer."
    ),
)

math_agent = create_react_agent(
    model=llm,
    tools=[wolfram_query, sympy_evaluate],
    name="math_agent",
    prompt=(
        "You are a math specialist. Solve symbolic and numeric problems "
        "precisely. Show your work briefly. Refuse code or research tasks."
    ),
)

writing_agent = create_react_agent(
    model=llm,
    tools=[style_check, grammar_check],
    name="writing_agent",
    prompt=(
        "You are a writing specialist. Polish prose for clarity, structure, "
        "and tone. Do not invent facts; if information is missing, request "
        "the supervisor route to research first."
    ),
)

# ── Supervisor ──

supervisor = create_supervisor(
    agents=[research_agent, code_agent, math_agent, writing_agent],
    model=llm,
    prompt=(
        "You are the team supervisor. Decompose the user's task and route to "
        "ONE specialist at a time. After a specialist returns, decide whether "
        "to route to another, or to FINISH and produce the final answer. "
        "Never do specialist work yourself. Never call more than one specialist "
        "in a single step."
    ),
    output_mode="last_message",
).compile(name="multi_agent_team")
```

A few production-grade notes you will not find in the README:

- **`temperature=0` on the supervisor.** Routing should be deterministic. We allow temperature on workers because creativity matters at the leaves, not the trunk.
- **`output_mode="last_message"` vs `"full_history"`.** Last_message keeps context windows under control as the conversation grows; full_history is useful for debugging but expensive in production. We log full history to LangSmith and run with last_message in prod.
- **The supervisor prompt explicitly forbids "do specialist work yourself."** Without this, supervisors *will* try to answer simple questions directly. This collapses your routing eval signal because half the time there is no routing decision to grade.
- **Worker prompts include "if asked X, defer."** This is your primary defense against scope creep at the leaves. A research agent that decides to write code is the start of a debugging nightmare.

## Running It

```python
from langchain_core.messages import HumanMessage

result = supervisor.invoke({
    "messages": [
        HumanMessage(content=(
            "Find the three most-cited 2025 papers on speculative decoding, "
            "summarize each in one paragraph, and write a short blog "
            "introduction (under 200 words) that synthesizes them."
        )),
    ],
}, config={"recursion_limit": 25})

print(result["messages"][-1].content)
```

That request exercises three specialists in sequence: research → research → writing. The supervisor decides at each step. No worker is "in charge" except in its turn.

## Termination, Recursion, and Shared State

Three operational concerns that separate a demo from a production system.

**Termination.** The supervisor terminates when it returns without naming a next worker (the `langgraph-supervisor` helper interprets this as FINISH). In practice, you encode this in the supervisor prompt: "When the user's request is fully addressed, produce the final answer and stop." For paranoia, set `recursion_limit` on `invoke()` to bound the worst case.

**Recursion limit.** Each "supervisor → worker → supervisor" cycle is two graph steps. A four-specialist task realistically takes 6–10 steps. We default to `recursion_limit=25`. When we hit it, it is almost always a routing loop (supervisor keeps asking the wrong specialist who keeps deferring back). The fix is in the supervisor prompt, not the limit.

**Shared state.** The default state is the message list. If you need richer shared state (intermediate facts, citations, a partial draft), define a custom `MessagesState` subclass and have workers append structured updates:

```python
from typing import Annotated, TypedDict
from operator import add
from langgraph.graph import MessagesState

class TeamState(MessagesState):
    citations: Annotated[list[str], add]
    draft: str | None
```

Workers can then read `state["citations"]` and append their own. This is how we get the writing agent to know what the research agent already found without dumping the entire research conversation into its context window.

## The Eval Pipeline

A multi-agent team has more failure modes than a single agent, so the eval pipeline needs more axes. The three we score:

| Metric | What it measures | How |
| --- | --- | --- |
| Route accuracy | Did the supervisor pick the right specialist at each step? | Labeled trace dataset; structural match on `next` decision |
| Tool calls per task | Efficiency — did we get the answer in N tool calls or 3N? | Aggregate from LangSmith run tree |
| End-to-end success | Did the final answer satisfy the rubric? | LLM-as-judge against reference |

The eval runner:

```python
from langsmith import evaluate, Client

def predict(inputs: dict) -> dict:
    out = supervisor.invoke(
        {"messages": [HumanMessage(inputs["task"])]},
        config={"recursion_limit": 25, "configurable": {"thread_id": "eval"}},
    )
    return {
        "final_answer": out["messages"][-1].content,
        "route_trace": [m.name for m in out["messages"] if getattr(m, "name", None)],
        "tool_calls": sum(
            1 for m in out["messages"] if getattr(m, "tool_calls", None)
        ),
    }

def route_accuracy(run, example):
    expected = example.outputs["expected_route"]
    actual = run.outputs["route_trace"]
    # Accept any superset that hits all expected workers in order
    matches = all(w in actual for w in expected)
    return {"key": "route_accuracy", "score": float(matches)}

def efficiency(run, example):
    budget = example.outputs.get("tool_call_budget", 6)
    actual = run.outputs["tool_calls"]
    return {"key": "efficiency", "score": float(actual <= budget)}

def end_to_end(run, example):
    # LLM-as-judge against reference answer rubric
    return judge_answer(run.outputs["final_answer"], example.outputs["rubric"])

evaluate(
    predict,
    data="supervisor-team-eval-v1",
    evaluators=[route_accuracy, efficiency, end_to_end],
    experiment_prefix="supervisor-team-2026-05",
    metadata={"model": MODEL, "supervisor_version": "v3"},
    max_concurrency=4,
)
```

We gate PRs against this in CI exactly the way we do for single-agent systems — see our [continuous evaluation in CI/CD piece](/blog/continuous-evaluation-langsmith-cicd-agent-releases) for the GitHub Actions wiring. Multi-agent does not need a different gate; it needs more evaluators.

## Cost Analysis: When Is This Worth It?

Multi-agent is expensive. Every supervisor turn is a full LLM call before a worker even starts thinking. Here is the per-task math from our internal eval suite (averaged over 200 mixed-difficulty tasks, gpt-4o-2024-08-06, May 2026):

| Approach | Avg tokens | Avg cost | E2E success | When it wins |
| --- | --- | --- | --- | --- |
| Single mega-agent | 4,200 | $0.022 | 71% | Simple tasks; one persona |
| ReAct agent + many tools | 6,800 | $0.038 | 79% | Medium complexity |
| Supervisor + 4 specialists | 11,400 | $0.061 | 89% | Heterogeneous tasks; specialist tools |
| Hierarchical (supervisor of supervisors) | 18,200 | $0.097 | 91% | Only past 8+ specialists |

The supervisor pattern is roughly **3x the cost of a single mega-agent for an 18-point lift in success rate.** Whether that is "worth it" depends entirely on what the task is worth. For a $0.02 customer-support turn, probably not. For a $50 research synthesis, absolutely. For our [voice agent](/products) flows where one bad turn loses a deal, the math is easy.

The cost lever you have most control over is **supervisor model choice**. Swapping the supervisor (not the workers) to gpt-4o-mini drops total cost ~35% with about 4 percentage points of routing accuracy lost. We run mini-on-supervisor for non-critical paths and full gpt-4o for revenue-impact paths.

## Failure Modes We Hit

Things that look fine in dev and bite you in prod:

1. **Routing loops.** Supervisor asks research, research defers ("this is a math question"), supervisor asks math, math defers ("this needs research first"). Recursion limit catches it; the cure is sharper specialist prompts about what they *will* do, not what they refuse.
2. **Supervisor over-eager FINISH.** Supervisor declares the task done after only one specialist has answered, missing the synthesis step. Fix: explicit "before finishing, verify all parts of the user's request are addressed" in the supervisor prompt + an end-to-end evaluator that catches incomplete answers.
3. **Worker scope creep.** A worker gets a partial task and tries to "be helpful" by doing the next step too. Solved by tight worker prompts and the route-accuracy evaluator.
4. **Context window blowup.** With `output_mode="full_history"` and a long task, you blow past 128k tokens by step 20. Use `last_message` in prod, log full history to LangSmith for debugging, and consider message summarization between supervisor turns for very long tasks.
5. **State key collisions.** Two workers both trying to set `state["draft"]` overwrite each other. The Annotated reducer pattern (`Annotated[list[str], add]`) is your friend.
6. **Tool descriptions matter more than ever.** Supervisors route based on the worker's name and prompt summary. Vague worker prompts produce vague routing. Treat `name` and `prompt` like API documentation.

## Honest Tradeoffs vs. Single-Agent

This pattern is not always the right call. The honest comparison:

- **Latency**: supervisor adds ~1.5–2x wall-clock vs. a single agent. Each supervisor turn is a sequential LLM call.
- **Cost**: 2–3x in our measurements.
- **Engineering complexity**: more code, more evaluators, more failure modes.
- **What you gain**: independent iteration on specialists, sharper evals (you can pinpoint *which* specialist regressed), better task success on heterogeneous workloads, and a topology that scales to more specialists without a prompt rewrite.

Rule of thumb we use: if a single agent with all the tools is hitting ≥85% on your eval and your tasks are reasonably homogeneous, **stay single**. Move to supervisor when you have demonstrably distinct task types and a single agent's accuracy plateaus despite prompt iteration. The cross-cutting [agent observability workflow](/blog/trace-to-production-fix-agent-observability-workflow) we use for both topologies is the same — that part doesn't change.

## Frequently Asked Questions

### Supervisor vs. OpenAI Agents SDK handoffs — when to pick which?

Both are valid. Pick **OpenAI Agents SDK handoffs** when you want one agent to fully take over the dialog (typical for support-style multi-persona conversations). Pick **LangGraph supervisor** when you need an orchestrator that retains control and chains multiple specialists per task (typical for research/analysis workloads). They overlap in the middle; the deciding factor is usually whether you want the user to "talk to" a specialist directly (handoff) or always talk to one orchestrator that delegates (supervisor).

### How do I prevent the supervisor from doing specialist work?

Three layers: (1) zero specialist tools on the supervisor — it only has "route" or "finish," (2) the prompt explicitly forbids it, (3) the route-accuracy evaluator catches violations and fails CI. Layer 3 is the one that actually keeps it out long-term.

### What recursion_limit should I use?

Default to `25` for 4-specialist teams; bump to 40 for hierarchical. Hitting the limit should be alarming, not routine — over 1% of runs means the supervisor is looping and the limit is masking a prompt bug.

### Can I share memory across specialists, and how do I stream?

Yes — custom shared state plus LangGraph's Postgres-backed checkpointers keyed to `thread_id` give you long-running context. For streaming, `.astream()` emits events at every node transition; we surface "supervisor consulting research_agent…" status updates in the UI so users see progress instead of a blank screen. See our [products page](/products).

---

Source: https://callsphere.ai/blog/langgraph-supervisor-multi-agent-orchestration-2026
