---
title: "Testing Multi-Agent Handoffs: Verifying Routing Logic and Context Transfer"
description: "Learn how to test multi-agent handoff logic, verify conversation routing, validate context transfer between agents, and test boundary conditions in agent orchestration systems."
canonical: https://callsphere.ai/blog/testing-multi-agent-handoffs-routing-logic-context-transfer
category: "Learn Agentic AI"
tags: ["Multi-Agent", "Handoffs", "Routing", "Testing", "Python", "Orchestration"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T11:21:58.934Z
---

# Testing Multi-Agent Handoffs: Verifying Routing Logic and Context Transfer

> Learn how to test multi-agent handoff logic, verify conversation routing, validate context transfer between agents, and test boundary conditions in agent orchestration systems.

## Why Handoff Testing Is Critical

In multi-agent systems, a triage agent routes conversations to specialized agents — billing, technical support, sales. Handoff failures are some of the most damaging bugs: a customer asking about a refund gets routed to tech support, or context is lost during transfer and the next agent asks the customer to repeat everything.

Testing handoffs requires verifying three things: the router selects the right destination agent, the full conversation context transfers correctly, and edge cases like ambiguous requests or mid-conversation re-routing work properly.

## Modeling Handoffs for Testability

Define handoffs as explicit, inspectable objects rather than implicit side effects.

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Handoff:
    source_agent: str
    target_agent: str
    reason: str
    context: dict = field(default_factory=dict)
    conversation_history: list[dict] = field(default_factory=list)

@dataclass
class HandoffResult:
    should_handoff: bool
    handoff: Optional[Handoff] = None
    response: Optional[str] = None

class TriageAgent:
    def __init__(self, llm, available_agents: list[str]):
        self.llm = llm
        self.available_agents = available_agents

    def process(self, message: str, history: list[dict]) -> HandoffResult:
        # LLM determines routing
        decision = self.llm.chat([
            {"role": "system", "content": self._build_routing_prompt()},
            *history,
            {"role": "user", "content": message},
        ])
        parsed = self._parse_decision(decision["content"])

        if parsed["action"] == "handoff":
            return HandoffResult(
                should_handoff=True,
                handoff=Handoff(
                    source_agent="triage",
                    target_agent=parsed["target"],
                    reason=parsed["reason"],
                    context=parsed.get("extracted_context", {}),
                    conversation_history=history + [
                        {"role": "user", "content": message}
                    ],
                ),
            )
        return HandoffResult(should_handoff=False, response=parsed["response"])
```

## Testing Routing Decisions

Use a FakeLLM to control routing decisions and verify the triage agent routes correctly.

```python
import pytest

@pytest.fixture
def fake_llm():
    return FakeLLM(responses=[])

def make_routing_response(target: str, reason: str) -> str:
    return f'{{"action": "handoff", "target": "{target}", "reason": "{reason}"}}'

def test_billing_question_routes_to_billing(fake_llm):
    fake_llm.responses = [make_routing_response("billing", "refund request")]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech", "sales"])

    result = triage.process("I want a refund for my last charge", history=[])

    assert result.should_handoff is True
    assert result.handoff.target_agent == "billing"

def test_technical_issue_routes_to_tech(fake_llm):
    fake_llm.responses = [make_routing_response("tech", "login error")]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech", "sales"])

    result = triage.process("I cannot log in to my account", history=[])

    assert result.should_handoff is True
    assert result.handoff.target_agent == "tech"

def test_general_question_stays_in_triage(fake_llm):
    fake_llm.responses = ['{"action": "respond", "response": "How can I help?"}']
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech", "sales"])

    result = triage.process("Hello", history=[])

    assert result.should_handoff is False
    assert result.response is not None
```

## Testing Context Transfer

Verify that all relevant context passes from the source agent to the destination agent.

```python
def test_context_includes_conversation_history(fake_llm):
    fake_llm.responses = [make_routing_response("billing", "payment issue")]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech"])

    history = [
        {"role": "user", "content": "Hi, I have a problem"},
        {"role": "assistant", "content": "Sure, what is the issue?"},
    ]
    result = triage.process("I was double charged $49.99", history=history)

    # Full history must transfer — no lost context
    assert len(result.handoff.conversation_history) == 3
    assert result.handoff.conversation_history[-1]["content"] == "I was double charged $49.99"

def test_extracted_context_contains_key_info(fake_llm):
    fake_llm.responses = [
        '{"action": "handoff", "target": "billing", "reason": "refund",'
        ' "extracted_context": {"amount": "$49.99", "issue": "double charge"}}'
    ]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing"])

    result = triage.process("I was double charged $49.99", history=[])

    assert result.handoff.context["amount"] == "$49.99"
    assert result.handoff.context["issue"] == "double charge"
```

## Testing Boundary Conditions

Edge cases where routing logic is most likely to fail.

```python
def test_invalid_target_agent_raises_error(fake_llm):
    fake_llm.responses = [make_routing_response("nonexistent_agent", "test")]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech"])

    with pytest.raises(ValueError, match="Unknown agent"):
        triage.process("Route me somewhere", history=[])

def test_ambiguous_request_asks_clarification(fake_llm):
    """When the intent is unclear, triage should ask rather than guess."""
    fake_llm.responses = ['{"action": "respond", "response": "Could you clarify?"}']
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech"])

    result = triage.process("I have a problem", history=[])

    assert result.should_handoff is False
    assert "clarif" in result.response.lower()

def test_mid_conversation_rerouting(fake_llm):
    """Agent should re-route if the topic changes mid-conversation."""
    fake_llm.responses = [make_routing_response("tech", "now a tech issue")]
    triage = TriageAgent(llm=fake_llm, available_agents=["billing", "tech"])

    history = [
        {"role": "user", "content": "I need a refund"},
        {"role": "assistant", "content": "Let me connect you to billing."},
        {"role": "user", "content": "Actually, my app keeps crashing"},
    ]
    result = triage.process("The crash happens on every login", history=history)

    assert result.handoff.target_agent == "tech"
```

## FAQ

### How do I test handoffs with the OpenAI Agents SDK?

The Agents SDK models handoffs as special tool calls. Mock the LLM to return a handoff tool call, then verify the runner transfers control to the correct agent and carries the conversation state.

### Should handoff tests use real LLMs?

Use mocked LLMs for unit tests of routing logic. Use real LLMs in a small set of integration tests that verify end-to-end handoff flows, especially for ambiguous cases where routing quality depends on prompt wording.

### What is the most common handoff bug?

Lost context. The destination agent does not receive the conversation history or extracted entities, so it asks the user to repeat information. Always assert that the handoff object contains the complete conversation history.

---

#MultiAgent #Handoffs #Routing #Testing #Python #Orchestration #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/testing-multi-agent-handoffs-routing-logic-context-transfer
