Skip to content
Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests
Learn Agentic AI11 min read40 views

Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Learn how to mock LLM API calls in your AI agent tests using FakeLLM objects, response fixtures, and assertion patterns for fast, deterministic, cost-free unit tests.

Why Unit Testing Agents Requires Special Patterns

AI agents depend on LLM calls that are non-deterministic, slow, and expensive. A single GPT-4 call takes 2-10 seconds and costs tokens — making it impractical to run hundreds of tests on every commit. Unit tests must be fast, free, and repeatable, which means you need a strategy for replacing real LLM calls with controlled substitutes.

The core challenge is that LLM outputs vary between calls even with temperature=0. Your tests need to verify your agent's logic — tool selection, state management, output parsing — without coupling to the exact wording an LLM produces.

Strategy 1: FakeLLM Classes

Create a drop-in replacement for your LLM client that returns predetermined responses.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from typing import Any

@dataclass
class FakeLLM:
    """A deterministic LLM replacement for unit tests."""
    responses: list[str] = field(default_factory=list)
    call_log: list[dict] = field(default_factory=list)
    _call_index: int = 0

    def chat(self, messages: list[dict], **kwargs) -> dict:
        self.call_log.append({"messages": messages, **kwargs})
        response = self.responses[self._call_index]
        self._call_index += 1
        return {"role": "assistant", "content": response}

This pattern lets you pre-load a sequence of responses and later inspect exactly what your agent sent to the LLM.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Strategy 2: Response Fixtures with pytest

Store realistic LLM responses as fixtures so multiple tests can share them.

import pytest
import json
from pathlib import Path

@pytest.fixture
def tool_call_response():
    """Fixture simulating an LLM response that invokes a tool."""
    return {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "id": "call_abc123",
                "type": "function",
                "function": {
                    "name": "search_database",
                    "arguments": json.dumps({"query": "open tickets", "limit": 10}),
                },
            }
        ],
    }

@pytest.fixture
def fixture_dir():
    return Path(__file__).parent / "fixtures"

def load_fixture(fixture_dir: Path, name: str) -> dict:
    return json.loads((fixture_dir / f"{name}.json").read_text())

Storing fixtures as JSON files in a tests/fixtures/ directory keeps tests clean and makes it easy to update expected responses when your prompts change.

Strategy 3: Patching with unittest.mock

Use unittest.mock.patch to intercept LLM calls at the boundary.

from unittest.mock import patch, MagicMock
from my_agent.core import Agent

def test_agent_extracts_entities():
    fake_response = MagicMock()
    fake_response.choices = [
        MagicMock(message=MagicMock(
            content='{"entities": ["Acme Corp", "Jane Doe"]}',
            tool_calls=None,
        ))
    ]

    with patch("my_agent.core.openai_client.chat.completions.create") as mock_create:
        mock_create.return_value = fake_response
        agent = Agent()
        result = agent.extract_entities("Contact Jane Doe at Acme Corp")

    assert result == ["Acme Corp", "Jane Doe"]
    mock_create.assert_called_once()
    call_args = mock_create.call_args
    assert any("extract" in str(m) for m in call_args.kwargs["messages"])

Assertion Patterns for Agent Tests

Focus your assertions on what your code controls, not on LLM output text.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

def test_agent_selects_correct_tool(fake_llm):
    """Verify the agent passes the right tools to the LLM."""
    fake_llm.responses = ['{"action": "search", "query": "test"}']
    agent = Agent(llm=fake_llm)

    agent.run("Find recent orders")

    call = fake_llm.call_log[0]
    tool_names = [t["function"]["name"] for t in call["tools"]]
    assert "search_orders" in tool_names
    assert "delete_account" not in tool_names  # safety check

def test_agent_retries_on_parse_failure(fake_llm):
    """Verify retry logic when LLM returns malformed JSON."""
    fake_llm.responses = ["not json", '{"action": "search"}']
    agent = Agent(llm=fake_llm, max_retries=2)

    result = agent.run("Find orders")

    assert len(fake_llm.call_log) == 2  # retried once
    assert result["action"] == "search"

FAQ

How do I handle streaming responses in unit tests?

Create an async generator fixture that yields predetermined chunks. Replace the streaming client method with this generator using patch. This lets you test your chunk-assembly logic without a real stream.

Should I use temperature=0 instead of mocking?

Setting temperature=0 reduces variance but does not eliminate it — model updates can still change outputs. It also still costs tokens and takes seconds per call. Use temperature=0 for integration tests, but always mock for unit tests.

How many response fixtures should I maintain?

Keep a small, representative set: one normal response, one tool-call response, one refusal, one malformed response, and one empty response. Five to ten fixtures cover most agent logic paths without becoming a maintenance burden.


#UnitTesting #AIAgents #Mocking #Pytest #Python #Testing #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Agents

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

AI Agents

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

AI Agents

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

HVAC

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

AI Voice Agents

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.