TypeScript AI Agent Testing: Vitest, Mock LLMs, and Snapshot Testing

The Testing Challenge with AI Agents

AI agents are inherently non-deterministic. The same prompt can produce different responses across runs, making traditional assertion-based testing unreliable. A robust agent testing strategy separates what you can test deterministically — tool execution, input validation, state management, routing logic — from what requires fuzzy evaluation — the quality and correctness of LLM-generated text.

This guide walks through practical patterns for testing TypeScript AI agents using Vitest.

Setting Up Vitest

Install Vitest and configure it for a TypeScript project:

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

npm install -D vitest @vitest/coverage-v8

// vitest.config.ts
import { defineConfig } from "vitest/config";
import path from "path";

export default defineConfig({
  test: {
    globals: true,
    environment: "node",
    coverage: {
      provider: "v8",
      include: ["src/**/*.ts"],
      exclude: ["src/**/*.test.ts"],
    },
    testTimeout: 30_000, // Agent tests may be slow
  },
  resolve: {
    alias: {
      "@": path.resolve(__dirname, "src"),
    },
  },
});

Mocking LLM Responses

The most important testing pattern is replacing the LLM client with a mock that returns predetermined responses:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

// src/lib/__mocks__/openai-client.ts
import { vi } from "vitest";

export function createMockOpenAI() {
  return {
    chat: {
      completions: {
        create: vi.fn(),
      },
    },
  };
}

export function mockChatResponse(content: string, toolCalls?: any[]) {
  return {
    choices: [
      {
        message: {
          role: "assistant",
          content,
          tool_calls: toolCalls ?? null,
        },
        finish_reason: toolCalls ? "tool_calls" : "stop",
      },
    ],
    usage: { prompt_tokens: 100, completion_tokens: 50, total_tokens: 150 },
  };
}

export function mockToolCallResponse(name: string, args: object) {
  return mockChatResponse(null as any, [
    {
      id: "call_mock_123",
      type: "function",
      function: {
        name,
        arguments: JSON.stringify(args),
      },
    },
  ]);
}

Testing Tool Execution Deterministically

Tools are pure functions with defined inputs and outputs — test them directly:

// src/tools/weather.test.ts
import { describe, it, expect, vi } from "vitest";
import { weatherTool } from "./weather";

// Mock the external API
vi.mock("./weather-api", () => ({
  fetchWeather: vi.fn().mockResolvedValue({
    temperature: 22,
    condition: "sunny",
    humidity: 45,
  }),
}));

describe("weatherTool", () => {
  it("returns formatted weather data for valid city", async () => {
    const result = await weatherTool.execute({
      city: "San Francisco",
      units: "celsius",
    });

    expect(result).toEqual({
      temperature: 22,
      condition: "sunny",
      humidity: 45,
    });
  });

  it("validates input schema rejects empty city", () => {
    const parsed = weatherTool.inputSchema.safeParse({ city: "" });
    expect(parsed.success).toBe(false);
  });

  it("applies default units when not specified", () => {
    const parsed = weatherTool.inputSchema.safeParse({ city: "Tokyo" });
    expect(parsed.success).toBe(true);
    if (parsed.success) {
      expect(parsed.data.units).toBe("celsius");
    }
  });
});

Testing the Agent Loop

Test that the agent correctly orchestrates tool calls and handles multi-step conversations:

// src/agent/support-agent.test.ts
import { describe, it, expect, vi, beforeEach } from "vitest";
import { runAgent } from "./support-agent";
import { createMockOpenAI, mockChatResponse, mockToolCallResponse } from "../lib/__mocks__/openai-client";

describe("Support Agent", () => {
  let mockClient: ReturnType<typeof createMockOpenAI>;

  beforeEach(() => {
    mockClient = createMockOpenAI();
  });

  it("calls search tool when user asks a question", async () => {
    // First call: model decides to search
    mockClient.chat.completions.create
      .mockResolvedValueOnce(
        mockToolCallResponse("search_docs", { query: "reset password" })
      )
      // Second call: model responds with answer
      .mockResolvedValueOnce(
        mockChatResponse("To reset your password, go to Settings > Security.")
      );

    const result = await runAgent(mockClient as any, "How do I reset my password?");

    expect(result.text).toContain("reset your password");
    expect(mockClient.chat.completions.create).toHaveBeenCalledTimes(2);
  });

  it("respects maximum iteration limit", async () => {
    // Model keeps calling tools indefinitely
    mockClient.chat.completions.create.mockResolvedValue(
      mockToolCallResponse("search_docs", { query: "something" })
    );

    const result = await runAgent(mockClient as any, "loop forever", { maxIterations: 3 });

    expect(result.text).toContain("maximum iterations");
    expect(mockClient.chat.completions.create).toHaveBeenCalledTimes(3);
  });
});

Snapshot Testing for Agent Outputs

When you want to catch unexpected changes in agent behavior without brittle exact-match assertions, use snapshots on structured outputs:

it("produces expected structured analysis", async () => {
  mockClient.chat.completions.create.mockResolvedValueOnce(
    mockChatResponse(JSON.stringify({
      sentiment: "positive",
      confidence: 0.92,
      topics: ["product", "pricing"],
    }))
  );

  const result = await analyzeText(mockClient as any, "Great product, fair price!");

  expect(result).toMatchSnapshot();
});

Run vitest --update to update snapshots when behavior intentionally changes. Review snapshot diffs in pull requests to catch unintended regressions.

CI Integration

Add agent tests to your CI pipeline:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx vitest run --coverage
      - uses: actions/upload-artifact@v4
        with:
          name: coverage
          path: coverage/

Because all LLM calls are mocked, these tests are fast, deterministic, and free — no API keys needed in CI.

FAQ

Should I ever test with real LLM API calls?

Yes, but separately from your main test suite. Run a small set of "smoke tests" or "evaluation tests" against the real API on a schedule (daily or pre-release). These tests use fuzzy assertions — checking that responses contain expected keywords or pass a rubric — rather than exact matches. Keep them in a separate test file with a longer timeout.

How do I test streaming responses?

Mock the streaming response as an async iterable. Create a helper that yields chunks with simulated delays. Test that your stream processing code correctly accumulates deltas, handles tool call fragments, and emits the final assembled message.

What code coverage target should I aim for?

Focus on 90%+ coverage for tool implementations, input validation, and routing logic. The agent loop orchestration should be covered by integration tests with mocked LLM responses. Do not chase coverage on thin wrapper code that just forwards calls to the LLM SDK.

#Testing #Vitest #TypeScript #AIAgents #Mocking #CICD #AgenticAI #LearnAI #AIEngineering

TypeScript AI Agent Testing: Vitest, Mock LLMs, and Snapshot Testing

The Testing Challenge with AI Agents

Setting Up Vitest

Mocking LLM Responses

Testing Tool Execution Deterministically

Testing the Agent Loop

Snapshot Testing for Agent Outputs

CI Integration

FAQ

Should I ever test with real LLM API calls?

How do I test streaming responses?

What code coverage target should I aim for?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)