---
title: "Debugging Multi-Agent Workflows with OpenAI Traces"
description: "Master the art of debugging multi-agent systems using OpenAI's built-in tracing infrastructure to trace handoffs, profile tool calls, and identify bottlenecks in complex agent pipelines."
canonical: https://callsphere.ai/blog/debugging-multi-agent-workflows-openai-traces
category: "Learn Agentic AI"
tags: ["OpenAI", "Debugging", "Multi-Agent", "Tracing"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-06T01:02:41.605Z
---

# Debugging Multi-Agent Workflows with OpenAI Traces

> Master the art of debugging multi-agent systems using OpenAI's built-in tracing infrastructure to trace handoffs, profile tool calls, and identify bottlenecks in complex agent pipelines.

## Why Multi-Agent Debugging Is Different

Single-agent debugging is straightforward — you read the prompt, inspect the output, and fix the disconnect. Multi-agent systems are a different challenge entirely. When an orchestrator hands off to a specialist, which passes context to a reviewer, which calls three tools and then escalates back to the orchestrator, finding out why the final answer is wrong requires tracing through a chain of decisions spread across multiple agents.

OpenAI's Agents SDK includes a built-in tracing system that captures every agent invocation, handoff, tool call, and guardrail evaluation as structured spans within a trace. This post walks through how to use that tracing system to systematically debug multi-agent workflows in production.

## Understanding the Trace Hierarchy

Every call to `Runner.run()` produces a trace. Inside that trace, spans are nested in a hierarchy that mirrors the agent execution:

```mermaid
flowchart LR
    APP(["Agent or API"])
    SDK["OTel SDK
GenAI conventions"]
    COL["OTel Collector"]
    subgraph BACKENDS["Backends"]
        TR[("Traces
Tempo or Honeycomb")]
        MET[("Metrics
Prometheus")]
        LOG[("Logs
Loki or ELK")]
    end
    DASH["Grafana plus alerts"]
    PAGE(["Pager"])
    APP --> SDK --> COL
    COL --> TR
    COL --> MET
    COL --> LOG
    TR --> DASH
    MET --> DASH
    LOG --> DASH
    DASH --> PAGE
    style SDK fill:#4f46e5,stroke:#4338ca,color:#fff
    style DASH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PAGE fill:#dc2626,stroke:#b91c1c,color:#fff
```

- **Trace** — the top-level container for the entire workflow
**Agent span** — one per agent invocation
**Generation span** — each LLM call made by the agent
- **Tool call span** — each function tool invocation
- **Handoff span** — when the agent transfers to another agent
- **Guardrail span** — input or output guardrail evaluations

This hierarchy is the foundation of all debugging. Let us start by enabling tracing and examining the output.

## Enabling and Viewing Traces

Tracing is enabled by default when you use the Agents SDK. Every run produces a trace visible in the OpenAI dashboard:

```python
from agents import Agent, Runner
import asyncio

research_agent = Agent(
    name="ResearchAgent",
    instructions="You research topics thoroughly using available tools.",
)

writer_agent = Agent(
    name="WriterAgent",
    instructions="You write clear, structured content based on research.",
)

orchestrator = Agent(
    name="Orchestrator",
    instructions="Route research requests to ResearchAgent, then pass results to WriterAgent.",
    handoffs=[research_agent, writer_agent],
)

async def main():
    result = await Runner.run(
        orchestrator,
        input="Write a report on the current state of quantum computing.",
    )
    print(result.final_output)
    # Trace URL is printed automatically to the console

asyncio.run(main())
```

After running this, the console outputs a trace URL. Opening it in the OpenAI dashboard reveals the full span hierarchy — every agent that was invoked, every LLM generation, every handoff event.

## Naming Traces for Searchability

In production, you need to find specific traces quickly. Use the `trace` context manager to give meaningful names and metadata:

```python
from agents import trace, Runner

async def handle_support_ticket(ticket_id: str, message: str):
    with trace(
        workflow_name="support-ticket-resolution",
        group_id=ticket_id,
        metadata={"ticket_id": ticket_id, "channel": "email"},
    ):
        result = await Runner.run(
            triage_agent,
            input=message,
        )
        return result.final_output
```

The `workflow_name` groups related traces in the dashboard. The `group_id` ties traces to a specific entity (like a ticket or session). The `metadata` dictionary adds arbitrary key-value pairs you can filter on later.

## Diagnosing Handoff Failures

The most common multi-agent bug is a handoff that goes to the wrong agent or fails to transfer critical context. Here is how to diagnose it:

```python
from agents import Agent, handoff, RunContextWrapper

def transfer_with_context(ctx: RunContextWrapper[None]) -> str:
    """Provide context when handing off to the billing specialist."""
    return (
        "The customer has already verified their identity. "
        "Account ID: " + ctx.context.get("account_id", "unknown")
    )

billing_agent = Agent(
    name="BillingSpecialist",
    instructions="Handle billing inquiries. The customer is already verified.",
)

triage_agent = Agent(
    name="TriageAgent",
    instructions="Route billing questions to BillingSpecialist.",
    handoffs=[
        handoff(
            agent=billing_agent,
            on_handoff=transfer_with_context,
            tool_name_override="transfer_to_billing",
            tool_description_override="Transfer to billing specialist for payment and invoice questions.",
        )
    ],
)
```

In the trace, the handoff span shows:

1. Which agent initiated the handoff
2. The tool call that triggered it (the handoff appears as a tool call)
3. The context string returned by `on_handoff`
4. Which agent received the handoff

If the context string is empty or incorrect, the downstream agent will lack the information it needs. The trace makes this immediately visible.

## Profiling Tool Call Latency

Slow workflows are usually caused by slow tool calls. The trace shows the duration of every span, so you can identify which tool calls are the bottleneck:

```python
from agents import function_tool
import httpx
import time

@function_tool
async def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant articles."""
    start = time.monotonic()
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://internal-api.example.com/search",
            json={"query": query, "limit": 5},
        )
        response.raise_for_status()
        elapsed = time.monotonic() - start

        if elapsed > 2.0:
            # This will appear in application logs alongside the trace
            print(f"SLOW_TOOL: search_knowledge_base took {elapsed:.2f}s for query: {query}")

        results = response.json()
        return "\n".join(
            f"- {r['title']}: {r['snippet']}" for r in results["articles"]
        )
```

In the trace dashboard, sort spans by duration to find the slowest ones. Common findings include:

- External API calls that take 3-5 seconds and dominate total latency
- Multiple sequential tool calls that could be parallelized
- Redundant tool calls where the agent asks for the same data twice

## Analyzing Generation Spans for Token Waste

Each generation span shows the input tokens, output tokens, and model used. This is invaluable for spotting token waste:

```python
from agents import Agent, ModelSettings

# Problem: agent gets the entire conversation history, using excessive tokens
# Solution: use truncation to manage context window
efficient_agent = Agent(
    name="EfficientAgent",
    instructions="You answer questions concisely.",
    model_settings=ModelSettings(
        truncation="auto",  # Automatically truncate long histories
        max_tokens=500,     # Cap output length
    ),
)
```

When reviewing generation spans in the trace, look for:

- Input token counts that grow linearly with conversation length
- Output tokens that are much longer than necessary
- Repeated context that could be summarized before passing to the next agent

## Correlating Errors Across Agents

When the final output is wrong, the error often originates several agents upstream. The trace lets you walk backwards from the final agent to find the root cause:

```python
from agents import Agent, Runner

async def debug_workflow(user_input: str):
    result = await Runner.run(orchestrator, input=user_input)

    # Print the full execution path
    for item in result.raw_responses:
        print(f"Agent: {item.agent_name}")
        print(f"  Input tokens: {item.usage.input_tokens}")
        print(f"  Output tokens: {item.usage.output_tokens}")
        if item.handoff:
            print(f"  Handed off to: {item.handoff.target_agent}")
        for tool_call in item.tool_calls:
            print(f"  Tool: {tool_call.name} -> {tool_call.result[:100]}")
        print()
```

This programmatic inspection supplements the visual trace. You can pipe this output to a log aggregator and set up alerts for common failure patterns.

## Setting Up Trace-Based Alerts

For production systems, you should create alerts based on trace data:

```python
from agents import trace, Runner

async def monitored_run(agent, input_text: str, max_duration: float = 10.0):
    with trace(workflow_name="monitored-agent-run") as current_trace:
        import time
        start = time.monotonic()

        result = await Runner.run(agent, input=input_text)

        elapsed = time.monotonic() - start
        if elapsed > max_duration:
            await send_alert(
                channel="agent-alerts",
                message=f"Agent run exceeded {max_duration}s (took {elapsed:.2f}s). "
                        f"Trace: {current_trace.trace_url}",
            )

        return result
```

## Debugging Checklist for Multi-Agent Systems

When a multi-agent workflow produces an incorrect result, follow this systematic approach:

1. **Open the trace** — Start at the top level and identify which agent produced the final output
2. **Walk the handoff chain** — Check each handoff to verify the right agent was selected and the context was transferred correctly
3. **Inspect generation spans** — Read the actual prompts and completions at each step to find where the reasoning went wrong
4. **Check tool call results** — Verify that tool calls returned the expected data
5. **Profile durations** — Identify whether latency issues are causing timeouts or degraded behavior
6. **Examine guardrail spans** — Check if any guardrails fired and whether they correctly allowed or blocked content

Tracing is not just a debugging tool — it is the observability layer that makes multi-agent systems manageable in production. Every multi-agent deployment should have tracing enabled, traces should be named and grouped for searchability, and alerts should fire when traces indicate degraded performance.

---

Source: https://callsphere.ai/blog/debugging-multi-agent-workflows-openai-traces
