Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Learn how to mock LLM API calls in your AI agent tests using FakeLLM objects, response fixtures, and assertion patterns for fast, deterministic, cost-free unit tests.

Why Unit Testing Agents Requires Special Patterns

AI agents depend on LLM calls that are non-deterministic, slow, and expensive. A single GPT-4 call takes 2-10 seconds and costs tokens — making it impractical to run hundreds of tests on every commit. Unit tests must be fast, free, and repeatable, which means you need a strategy for replacing real LLM calls with controlled substitutes.

The core challenge is that LLM outputs vary between calls even with temperature=0. Your tests need to verify your agent's logic — tool selection, state management, output parsing — without coupling to the exact wording an LLM produces.

Strategy 1: FakeLLM Classes

Create a drop-in replacement for your LLM client that returns predetermined responses.

flowchart TD
    START["Unit Testing AI Agents: Mocking LLM Calls for Fas…"] --> A
    A["Why Unit Testing Agents Requires Specia…"]
    A --> B
    B["Strategy 1: FakeLLM Classes"]
    B --> C
    C["Strategy 2: Response Fixtures with pyte…"]
    C --> D
    D["Strategy 3: Patching with unittest.mock"]
    D --> E
    E["Assertion Patterns for Agent Tests"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from typing import Any

@dataclass
class FakeLLM:
    """A deterministic LLM replacement for unit tests."""
    responses: list[str] = field(default_factory=list)
    call_log: list[dict] = field(default_factory=list)
    _call_index: int = 0

    def chat(self, messages: list[dict], **kwargs) -> dict:
        self.call_log.append({"messages": messages, **kwargs})
        response = self.responses[self._call_index]
        self._call_index += 1
        return {"role": "assistant", "content": response}

This pattern lets you pre-load a sequence of responses and later inspect exactly what your agent sent to the LLM.

Strategy 2: Response Fixtures with pytest

Store realistic LLM responses as fixtures so multiple tests can share them.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import pytest
import json
from pathlib import Path

@pytest.fixture
def tool_call_response():
    """Fixture simulating an LLM response that invokes a tool."""
    return {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "id": "call_abc123",
                "type": "function",
                "function": {
                    "name": "search_database",
                    "arguments": json.dumps({"query": "open tickets", "limit": 10}),
                },
            }
        ],
    }

@pytest.fixture
def fixture_dir():
    return Path(__file__).parent / "fixtures"

def load_fixture(fixture_dir: Path, name: str) -> dict:
    return json.loads((fixture_dir / f"{name}.json").read_text())

Storing fixtures as JSON files in a tests/fixtures/ directory keeps tests clean and makes it easy to update expected responses when your prompts change.

Strategy 3: Patching with unittest.mock

Use unittest.mock.patch to intercept LLM calls at the boundary.

from unittest.mock import patch, MagicMock
from my_agent.core import Agent

def test_agent_extracts_entities():
    fake_response = MagicMock()
    fake_response.choices = [
        MagicMock(message=MagicMock(
            content='{"entities": ["Acme Corp", "Jane Doe"]}',
            tool_calls=None,
        ))
    ]

    with patch("my_agent.core.openai_client.chat.completions.create") as mock_create:
        mock_create.return_value = fake_response
        agent = Agent()
        result = agent.extract_entities("Contact Jane Doe at Acme Corp")

    assert result == ["Acme Corp", "Jane Doe"]
    mock_create.assert_called_once()
    call_args = mock_create.call_args
    assert any("extract" in str(m) for m in call_args.kwargs["messages"])

Assertion Patterns for Agent Tests

Focus your assertions on what your code controls, not on LLM output text.

def test_agent_selects_correct_tool(fake_llm):
    """Verify the agent passes the right tools to the LLM."""
    fake_llm.responses = ['{"action": "search", "query": "test"}']
    agent = Agent(llm=fake_llm)

    agent.run("Find recent orders")

    call = fake_llm.call_log[0]
    tool_names = [t["function"]["name"] for t in call["tools"]]
    assert "search_orders" in tool_names
    assert "delete_account" not in tool_names  # safety check

def test_agent_retries_on_parse_failure(fake_llm):
    """Verify retry logic when LLM returns malformed JSON."""
    fake_llm.responses = ["not json", '{"action": "search"}']
    agent = Agent(llm=fake_llm, max_retries=2)

    result = agent.run("Find orders")

    assert len(fake_llm.call_log) == 2  # retried once
    assert result["action"] == "search"

FAQ

How do I handle streaming responses in unit tests?

Create an async generator fixture that yields predetermined chunks. Replace the streaming client method with this generator using patch. This lets you test your chunk-assembly logic without a real stream.

Should I use temperature=0 instead of mocking?

Setting temperature=0 reduces variance but does not eliminate it — model updates can still change outputs. It also still costs tokens and takes seconds per call. Use temperature=0 for integration tests, but always mock for unit tests.

How many response fixtures should I maintain?

Keep a small, representative set: one normal response, one tool-call response, one refusal, one malformed response, and one empty response. Five to ten fixtures cover most agent logic paths without becoming a maintenance burden.


#UnitTesting #AIAgents #Mocking #Pytest #Python #Testing #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.