Why Multi-Agent Debugging Is Different

Single-agent debugging is straightforward — you read the prompt, inspect the output, and fix the disconnect. Multi-agent systems are a different challenge entirely. When an orchestrator hands off to a specialist, which passes context to a reviewer, which calls three tools and then escalates back to the orchestrator, finding out why the final answer is wrong requires tracing through a chain of decisions spread across multiple agents.

OpenAI's Agents SDK includes a built-in tracing system that captures every agent invocation, handoff, tool call, and guardrail evaluation as structured spans within a trace. This post walks through how to use that tracing system to systematically debug multi-agent workflows in production.

Understanding the Trace Hierarchy

Every call to Runner.run() produces a trace. Inside that trace, spans are nested in a hierarchy that mirrors the agent execution:

flowchart LR
    APP(["Agent or API"])
    SDK["OTel SDK<br/>GenAI conventions"]
    COL["OTel Collector"]
    subgraph BACKENDS["Backends"]
        TR[("Traces<br/>Tempo or Honeycomb")]
        MET[("Metrics<br/>Prometheus")]
        LOG[("Logs<br/>Loki or ELK")]
    end
    DASH["Grafana plus alerts"]
    PAGE(["Pager"])
    APP --> SDK --> COL
    COL --> TR
    COL --> MET
    COL --> LOG
    TR --> DASH
    MET --> DASH
    LOG --> DASH
    DASH --> PAGE
    style SDK fill:#4f46e5,stroke:#4338ca,color:#fff
    style DASH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PAGE fill:#dc2626,stroke:#b91c1c,color:#fff

Trace — the top-level container for the entire workflow
- Agent span — one per agent invocation
  - Generation span — each LLM call made by the agent
  - Tool call span — each function tool invocation
  - Handoff span — when the agent transfers to another agent
  - Guardrail span — input or output guardrail evaluations

This hierarchy is the foundation of all debugging. Let us start by enabling tracing and examining the output.

Enabling and Viewing Traces

Tracing is enabled by default when you use the Agents SDK. Every run produces a trace visible in the OpenAI dashboard:

from agents import Agent, Runner
import asyncio

research_agent = Agent(
    name="ResearchAgent",
    instructions="You research topics thoroughly using available tools.",
)

writer_agent = Agent(
    name="WriterAgent",
    instructions="You write clear, structured content based on research.",
)

orchestrator = Agent(
    name="Orchestrator",
    instructions="Route research requests to ResearchAgent, then pass results to WriterAgent.",
    handoffs=[research_agent, writer_agent],
)

async def main():
    result = await Runner.run(
        orchestrator,
        input="Write a report on the current state of quantum computing.",
    )
    print(result.final_output)
    # Trace URL is printed automatically to the console

asyncio.run(main())

After running this, the console outputs a trace URL. Opening it in the OpenAI dashboard reveals the full span hierarchy — every agent that was invoked, every LLM generation, every handoff event.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Naming Traces for Searchability

In production, you need to find specific traces quickly. Use the trace context manager to give meaningful names and metadata:

from agents import trace, Runner

async def handle_support_ticket(ticket_id: str, message: str):
    with trace(
        workflow_name="support-ticket-resolution",
        group_id=ticket_id,
        metadata={"ticket_id": ticket_id, "channel": "email"},
    ):
        result = await Runner.run(
            triage_agent,
            input=message,
        )
        return result.final_output

The workflow_name groups related traces in the dashboard. The group_id ties traces to a specific entity (like a ticket or session). The metadata dictionary adds arbitrary key-value pairs you can filter on later.

Diagnosing Handoff Failures

The most common multi-agent bug is a handoff that goes to the wrong agent or fails to transfer critical context. Here is how to diagnose it:

from agents import Agent, handoff, RunContextWrapper

def transfer_with_context(ctx: RunContextWrapper[None]) -> str:
    """Provide context when handing off to the billing specialist."""
    return (
        "The customer has already verified their identity. "
        "Account ID: " + ctx.context.get("account_id", "unknown")
    )

billing_agent = Agent(
    name="BillingSpecialist",
    instructions="Handle billing inquiries. The customer is already verified.",
)

triage_agent = Agent(
    name="TriageAgent",
    instructions="Route billing questions to BillingSpecialist.",
    handoffs=[
        handoff(
            agent=billing_agent,
            on_handoff=transfer_with_context,
            tool_name_override="transfer_to_billing",
            tool_description_override="Transfer to billing specialist for payment and invoice questions.",
        )
    ],
)

In the trace, the handoff span shows:

Which agent initiated the handoff
The tool call that triggered it (the handoff appears as a tool call)
The context string returned by on_handoff
Which agent received the handoff

If the context string is empty or incorrect, the downstream agent will lack the information it needs. The trace makes this immediately visible.

Profiling Tool Call Latency

Slow workflows are usually caused by slow tool calls. The trace shows the duration of every span, so you can identify which tool calls are the bottleneck:

from agents import function_tool
import httpx
import time

@function_tool
async def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant articles."""
    start = time.monotonic()
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://internal-api.example.com/search",
            json={"query": query, "limit": 5},
        )
        response.raise_for_status()
        elapsed = time.monotonic() - start

        if elapsed > 2.0:
            # This will appear in application logs alongside the trace
            print(f"SLOW_TOOL: search_knowledge_base took {elapsed:.2f}s for query: {query}")

        results = response.json()
        return "\n".join(
            f"- {r['title']}: {r['snippet']}" for r in results["articles"]
        )

In the trace dashboard, sort spans by duration to find the slowest ones. Common findings include:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

External API calls that take 3-5 seconds and dominate total latency
Multiple sequential tool calls that could be parallelized
Redundant tool calls where the agent asks for the same data twice

Analyzing Generation Spans for Token Waste

Each generation span shows the input tokens, output tokens, and model used. This is invaluable for spotting token waste:

from agents import Agent, ModelSettings

# Problem: agent gets the entire conversation history, using excessive tokens
# Solution: use truncation to manage context window
efficient_agent = Agent(
    name="EfficientAgent",
    instructions="You answer questions concisely.",
    model_settings=ModelSettings(
        truncation="auto",  # Automatically truncate long histories
        max_tokens=500,     # Cap output length
    ),
)

When reviewing generation spans in the trace, look for:

Input token counts that grow linearly with conversation length
Output tokens that are much longer than necessary
Repeated context that could be summarized before passing to the next agent

Correlating Errors Across Agents

When the final output is wrong, the error often originates several agents upstream. The trace lets you walk backwards from the final agent to find the root cause:

from agents import Agent, Runner

async def debug_workflow(user_input: str):
    result = await Runner.run(orchestrator, input=user_input)

    # Print the full execution path
    for item in result.raw_responses:
        print(f"Agent: {item.agent_name}")
        print(f"  Input tokens: {item.usage.input_tokens}")
        print(f"  Output tokens: {item.usage.output_tokens}")
        if item.handoff:
            print(f"  Handed off to: {item.handoff.target_agent}")
        for tool_call in item.tool_calls:
            print(f"  Tool: {tool_call.name} -> {tool_call.result[:100]}")
        print()

This programmatic inspection supplements the visual trace. You can pipe this output to a log aggregator and set up alerts for common failure patterns.

Setting Up Trace-Based Alerts

For production systems, you should create alerts based on trace data:

from agents import trace, Runner

async def monitored_run(agent, input_text: str, max_duration: float = 10.0):
    with trace(workflow_name="monitored-agent-run") as current_trace:
        import time
        start = time.monotonic()

        result = await Runner.run(agent, input=input_text)

        elapsed = time.monotonic() - start
        if elapsed > max_duration:
            await send_alert(
                channel="agent-alerts",
                message=f"Agent run exceeded {max_duration}s (took {elapsed:.2f}s). "
                        f"Trace: {current_trace.trace_url}",
            )

        return result

Debugging Checklist for Multi-Agent Systems

When a multi-agent workflow produces an incorrect result, follow this systematic approach:

Open the trace — Start at the top level and identify which agent produced the final output
Walk the handoff chain — Check each handoff to verify the right agent was selected and the context was transferred correctly
Inspect generation spans — Read the actual prompts and completions at each step to find where the reasoning went wrong
Check tool call results — Verify that tool calls returned the expected data
Profile durations — Identify whether latency issues are causing timeouts or degraded behavior
Examine guardrail spans — Check if any guardrails fired and whether they correctly allowed or blocked content

Tracing is not just a debugging tool — it is the observability layer that makes multi-agent systems manageable in production. Every multi-agent deployment should have tracing enabled, traces should be named and grouped for searchability, and alerts should fire when traces indicate degraded performance.

Debugging Multi-Agent Workflows with OpenAI Traces

Why Multi-Agent Debugging Is Different

Understanding the Trace Hierarchy

Enabling and Viewing Traces

Naming Traces for Searchability

Diagnosing Handoff Failures

Profiling Tool Call Latency

Analyzing Generation Spans for Token Waste

Correlating Errors Across Agents

Setting Up Trace-Based Alerts

Debugging Checklist for Multi-Agent Systems

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Building Multi-Agent Systems With MCP, A2A, And CallSphere As A Node

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

A2A Protocol Explained: The Agent Card JSON, Discovery, And Tasks

Anthropic's Financial Services Platform: State of Play in May 2026