Skip to content
Learn Agentic AI
Learn Agentic AI12 min read5 views

Debugging Multi-Agent Workflows with OpenAI Traces

Master the art of debugging multi-agent systems using OpenAI's built-in tracing infrastructure to trace handoffs, profile tool calls, and identify bottlenecks in complex agent pipelines.

Why Multi-Agent Debugging Is Different

Single-agent debugging is straightforward — you read the prompt, inspect the output, and fix the disconnect. Multi-agent systems are a different challenge entirely. When an orchestrator hands off to a specialist, which passes context to a reviewer, which calls three tools and then escalates back to the orchestrator, finding out why the final answer is wrong requires tracing through a chain of decisions spread across multiple agents.

OpenAI's Agents SDK includes a built-in tracing system that captures every agent invocation, handoff, tool call, and guardrail evaluation as structured spans within a trace. This post walks through how to use that tracing system to systematically debug multi-agent workflows in production.

Understanding the Trace Hierarchy

Every call to Runner.run() produces a trace. Inside that trace, spans are nested in a hierarchy that mirrors the agent execution:

flowchart TD
    START["Debugging Multi-Agent Workflows with OpenAI Traces"] --> A
    A["Why Multi-Agent Debugging Is Different"]
    A --> B
    B["Understanding the Trace Hierarchy"]
    B --> C
    C["Enabling and Viewing Traces"]
    C --> D
    D["Naming Traces for Searchability"]
    D --> E
    E["Diagnosing Handoff Failures"]
    E --> F
    F["Profiling Tool Call Latency"]
    F --> G
    G["Analyzing Generation Spans for Token Wa…"]
    G --> H
    H["Correlating Errors Across Agents"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  • Trace — the top-level container for the entire workflow
    • Agent span — one per agent invocation
      • Generation span — each LLM call made by the agent
      • Tool call span — each function tool invocation
      • Handoff span — when the agent transfers to another agent
      • Guardrail span — input or output guardrail evaluations

This hierarchy is the foundation of all debugging. Let us start by enabling tracing and examining the output.

Enabling and Viewing Traces

Tracing is enabled by default when you use the Agents SDK. Every run produces a trace visible in the OpenAI dashboard:

from agents import Agent, Runner
import asyncio

research_agent = Agent(
    name="ResearchAgent",
    instructions="You research topics thoroughly using available tools.",
)

writer_agent = Agent(
    name="WriterAgent",
    instructions="You write clear, structured content based on research.",
)

orchestrator = Agent(
    name="Orchestrator",
    instructions="Route research requests to ResearchAgent, then pass results to WriterAgent.",
    handoffs=[research_agent, writer_agent],
)

async def main():
    result = await Runner.run(
        orchestrator,
        input="Write a report on the current state of quantum computing.",
    )
    print(result.final_output)
    # Trace URL is printed automatically to the console

asyncio.run(main())

After running this, the console outputs a trace URL. Opening it in the OpenAI dashboard reveals the full span hierarchy — every agent that was invoked, every LLM generation, every handoff event.

Naming Traces for Searchability

In production, you need to find specific traces quickly. Use the trace context manager to give meaningful names and metadata:

from agents import trace, Runner

async def handle_support_ticket(ticket_id: str, message: str):
    with trace(
        workflow_name="support-ticket-resolution",
        group_id=ticket_id,
        metadata={"ticket_id": ticket_id, "channel": "email"},
    ):
        result = await Runner.run(
            triage_agent,
            input=message,
        )
        return result.final_output

The workflow_name groups related traces in the dashboard. The group_id ties traces to a specific entity (like a ticket or session). The metadata dictionary adds arbitrary key-value pairs you can filter on later.

Diagnosing Handoff Failures

The most common multi-agent bug is a handoff that goes to the wrong agent or fails to transfer critical context. Here is how to diagnose it:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Tool call span — each function tool inv…"]
    CENTER --> N1["Handoff span — when the agent transfers…"]
    CENTER --> N2["Guardrail span — input or output guardr…"]
    CENTER --> N3["Which agent initiated the handoff"]
    CENTER --> N4["The tool call that triggered it the han…"]
    CENTER --> N5["The context string returned by on_hando…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents import Agent, handoff, RunContextWrapper

def transfer_with_context(ctx: RunContextWrapper[None]) -> str:
    """Provide context when handing off to the billing specialist."""
    return (
        "The customer has already verified their identity. "
        "Account ID: " + ctx.context.get("account_id", "unknown")
    )

billing_agent = Agent(
    name="BillingSpecialist",
    instructions="Handle billing inquiries. The customer is already verified.",
)

triage_agent = Agent(
    name="TriageAgent",
    instructions="Route billing questions to BillingSpecialist.",
    handoffs=[
        handoff(
            agent=billing_agent,
            on_handoff=transfer_with_context,
            tool_name_override="transfer_to_billing",
            tool_description_override="Transfer to billing specialist for payment and invoice questions.",
        )
    ],
)

In the trace, the handoff span shows:

  1. Which agent initiated the handoff
  2. The tool call that triggered it (the handoff appears as a tool call)
  3. The context string returned by on_handoff
  4. Which agent received the handoff

If the context string is empty or incorrect, the downstream agent will lack the information it needs. The trace makes this immediately visible.

Profiling Tool Call Latency

Slow workflows are usually caused by slow tool calls. The trace shows the duration of every span, so you can identify which tool calls are the bottleneck:

from agents import function_tool
import httpx
import time

@function_tool
async def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant articles."""
    start = time.monotonic()
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://internal-api.example.com/search",
            json={"query": query, "limit": 5},
        )
        response.raise_for_status()
        elapsed = time.monotonic() - start

        if elapsed > 2.0:
            # This will appear in application logs alongside the trace
            print(f"SLOW_TOOL: search_knowledge_base took {elapsed:.2f}s for query: {query}")

        results = response.json()
        return "\n".join(
            f"- {r['title']}: {r['snippet']}" for r in results["articles"]
        )

In the trace dashboard, sort spans by duration to find the slowest ones. Common findings include:

  • External API calls that take 3-5 seconds and dominate total latency
  • Multiple sequential tool calls that could be parallelized
  • Redundant tool calls where the agent asks for the same data twice

Analyzing Generation Spans for Token Waste

Each generation span shows the input tokens, output tokens, and model used. This is invaluable for spotting token waste:

from agents import Agent, ModelSettings

# Problem: agent gets the entire conversation history, using excessive tokens
# Solution: use truncation to manage context window
efficient_agent = Agent(
    name="EfficientAgent",
    instructions="You answer questions concisely.",
    model_settings=ModelSettings(
        truncation="auto",  # Automatically truncate long histories
        max_tokens=500,     # Cap output length
    ),
)

When reviewing generation spans in the trace, look for:

  • Input token counts that grow linearly with conversation length
  • Output tokens that are much longer than necessary
  • Repeated context that could be summarized before passing to the next agent

Correlating Errors Across Agents

When the final output is wrong, the error often originates several agents upstream. The trace lets you walk backwards from the final agent to find the root cause:

from agents import Agent, Runner

async def debug_workflow(user_input: str):
    result = await Runner.run(orchestrator, input=user_input)

    # Print the full execution path
    for item in result.raw_responses:
        print(f"Agent: {item.agent_name}")
        print(f"  Input tokens: {item.usage.input_tokens}")
        print(f"  Output tokens: {item.usage.output_tokens}")
        if item.handoff:
            print(f"  Handed off to: {item.handoff.target_agent}")
        for tool_call in item.tool_calls:
            print(f"  Tool: {tool_call.name} -> {tool_call.result[:100]}")
        print()

This programmatic inspection supplements the visual trace. You can pipe this output to a log aggregator and set up alerts for common failure patterns.

Setting Up Trace-Based Alerts

For production systems, you should create alerts based on trace data:

from agents import trace, Runner

async def monitored_run(agent, input_text: str, max_duration: float = 10.0):
    with trace(workflow_name="monitored-agent-run") as current_trace:
        import time
        start = time.monotonic()

        result = await Runner.run(agent, input=input_text)

        elapsed = time.monotonic() - start
        if elapsed > max_duration:
            await send_alert(
                channel="agent-alerts",
                message=f"Agent run exceeded {max_duration}s (took {elapsed:.2f}s). "
                        f"Trace: {current_trace.trace_url}",
            )

        return result

Debugging Checklist for Multi-Agent Systems

When a multi-agent workflow produces an incorrect result, follow this systematic approach:

  1. Open the trace — Start at the top level and identify which agent produced the final output
  2. Walk the handoff chain — Check each handoff to verify the right agent was selected and the context was transferred correctly
  3. Inspect generation spans — Read the actual prompts and completions at each step to find where the reasoning went wrong
  4. Check tool call results — Verify that tool calls returned the expected data
  5. Profile durations — Identify whether latency issues are causing timeouts or degraded behavior
  6. Examine guardrail spans — Check if any guardrails fired and whether they correctly allowed or blocked content

Tracing is not just a debugging tool — it is the observability layer that makes multi-agent systems manageable in production. Every multi-agent deployment should have tracing enabled, traces should be named and grouped for searchability, and alerts should fire when traces indicate degraded performance.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.