Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Why Unit Testing Agents Requires Special Patterns

AI agents depend on LLM calls that are non-deterministic, slow, and expensive. A single GPT-4 call takes 2-10 seconds and costs tokens — making it impractical to run hundreds of tests on every commit. Unit tests must be fast, free, and repeatable, which means you need a strategy for replacing real LLM calls with controlled substitutes.

The core challenge is that LLM outputs vary between calls even with temperature=0. Your tests need to verify your agent's logic — tool selection, state management, output parsing — without coupling to the exact wording an LLM produces.

Strategy 1: FakeLLM Classes

Create a drop-in replacement for your LLM client that returns predetermined responses.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from typing import Any

@dataclass
class FakeLLM:
    """A deterministic LLM replacement for unit tests."""
    responses: list[str] = field(default_factory=list)
    call_log: list[dict] = field(default_factory=list)
    _call_index: int = 0

    def chat(self, messages: list[dict], **kwargs) -> dict:
        self.call_log.append({"messages": messages, **kwargs})
        response = self.responses[self._call_index]
        self._call_index += 1
        return {"role": "assistant", "content": response}

This pattern lets you pre-load a sequence of responses and later inspect exactly what your agent sent to the LLM.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Strategy 2: Response Fixtures with pytest

Store realistic LLM responses as fixtures so multiple tests can share them.

import pytest
import json
from pathlib import Path

@pytest.fixture
def tool_call_response():
    """Fixture simulating an LLM response that invokes a tool."""
    return {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "id": "call_abc123",
                "type": "function",
                "function": {
                    "name": "search_database",
                    "arguments": json.dumps({"query": "open tickets", "limit": 10}),
                },
            }
        ],
    }

@pytest.fixture
def fixture_dir():
    return Path(__file__).parent / "fixtures"

def load_fixture(fixture_dir: Path, name: str) -> dict:
    return json.loads((fixture_dir / f"{name}.json").read_text())

Storing fixtures as JSON files in a tests/fixtures/ directory keeps tests clean and makes it easy to update expected responses when your prompts change.

Strategy 3: Patching with unittest.mock

Use unittest.mock.patch to intercept LLM calls at the boundary.

from unittest.mock import patch, MagicMock
from my_agent.core import Agent

def test_agent_extracts_entities():
    fake_response = MagicMock()
    fake_response.choices = [
        MagicMock(message=MagicMock(
            content='{"entities": ["Acme Corp", "Jane Doe"]}',
            tool_calls=None,
        ))
    ]

    with patch("my_agent.core.openai_client.chat.completions.create") as mock_create:
        mock_create.return_value = fake_response
        agent = Agent()
        result = agent.extract_entities("Contact Jane Doe at Acme Corp")

    assert result == ["Acme Corp", "Jane Doe"]
    mock_create.assert_called_once()
    call_args = mock_create.call_args
    assert any("extract" in str(m) for m in call_args.kwargs["messages"])

Assertion Patterns for Agent Tests

Focus your assertions on what your code controls, not on LLM output text.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def test_agent_selects_correct_tool(fake_llm):
    """Verify the agent passes the right tools to the LLM."""
    fake_llm.responses = ['{"action": "search", "query": "test"}']
    agent = Agent(llm=fake_llm)

    agent.run("Find recent orders")

    call = fake_llm.call_log[0]
    tool_names = [t["function"]["name"] for t in call["tools"]]
    assert "search_orders" in tool_names
    assert "delete_account" not in tool_names  # safety check

def test_agent_retries_on_parse_failure(fake_llm):
    """Verify retry logic when LLM returns malformed JSON."""
    fake_llm.responses = ["not json", '{"action": "search"}']
    agent = Agent(llm=fake_llm, max_retries=2)

    result = agent.run("Find orders")

    assert len(fake_llm.call_log) == 2  # retried once
    assert result["action"] == "search"

FAQ

How do I handle streaming responses in unit tests?

Create an async generator fixture that yields predetermined chunks. Replace the streaming client method with this generator using patch. This lets you test your chunk-assembly logic without a real stream.

Should I use `temperature=0` instead of mocking?

Setting temperature=0 reduces variance but does not eliminate it — model updates can still change outputs. It also still costs tokens and takes seconds per call. Use temperature=0 for integration tests, but always mock for unit tests.

How many response fixtures should I maintain?

Keep a small, representative set: one normal response, one tool-call response, one refusal, one malformed response, and one empty response. Five to ten fixtures cover most agent logic paths without becoming a maintenance burden.

#UnitTesting #AIAgents #Mocking #Pytest #Python #Testing #AgenticAI #LearnAI #AIEngineering

Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Why Unit Testing Agents Requires Special Patterns

Strategy 1: FakeLLM Classes

Strategy 2: Response Fixtures with pytest

Strategy 3: Patching with unittest.mock

Assertion Patterns for Agent Tests

FAQ

How do I handle streaming responses in unit tests?

Should I use `temperature=0` instead of mocking?

How many response fixtures should I maintain?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

Why Unit Testing Agents Requires Special Patterns

Strategy 1: FakeLLM Classes

Strategy 2: Response Fixtures with pytest

Strategy 3: Patching with unittest.mock

Assertion Patterns for Agent Tests

FAQ

How do I handle streaming responses in unit tests?

Should I use temperature=0 instead of mocking?

How many response fixtures should I maintain?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

Should I use `temperature=0` instead of mocking?