---
title: "Designing Agent Test Suites: Unit, Integration, and Trajectory Tests"
description: "Agent testing needs three layers — unit, integration, trajectory — and most teams ship only one. The 2026 test-suite blueprint that catches real regressions."
canonical: https://callsphere.ai/blog/agent-test-suites-unit-integration-trajectory-2026
category: "Agentic AI"
tags: ["Testing", "Agent Evaluation", "MLOps", "Agentic AI"]
author: "CallSphere Team"
published: 2026-04-25T00:00:00.000Z
updated: 2026-05-06T00:18:34.630Z
---

# Designing Agent Test Suites: Unit, Integration, and Trajectory Tests

> Agent testing needs three layers — unit, integration, trajectory — and most teams ship only one. The 2026 test-suite blueprint that catches real regressions.

## Why Three Layers

Most teams that ship LLM agents have one layer of tests: an end-to-end smoke test or a hand-rolled eval. That is not enough. Agents have three layers of correctness, and a regression in any can ship undetected without dedicated tests.

This piece walks through unit, integration, and trajectory tests for agents — what each catches and how to design them.

## The Three Layers

```mermaid
flowchart TB
    Unit[Unit Tests
per-prompt, per-tool] --> What1[Catches: prompt + tool regressions]
    Integ[Integration Tests
multi-step but bounded] --> What2[Catches: composition + state bugs]
    Traj[Trajectory Tests
full agent runs] --> What3[Catches: planning + drift bugs]
```

Each layer catches different bugs. A change to a prompt may pass unit tests and fail trajectory tests. A new tool may pass unit tests, fail integration tests, and not even reach trajectory tests.

## Unit Tests

Unit tests cover:

- Single prompts (does this classifier prompt return correct labels for these inputs?)
- Single tool calls (does the tool return the expected shape for these inputs?)
- Prompt-output structural validity (does the schema parse cleanly?)
- Specific safety properties (does the model refuse these unsafe inputs?)

Unit tests run on every commit, fast (seconds to minutes). They are the analog of regular software unit tests; the LLM is treated as a function.

## Integration Tests

Integration tests cover bounded multi-step flows:

- Agent completes a 3-5 step task with mocked tools
- Plan-execute-reflect loop runs to completion
- Specialist agents collaborate on a known multi-agent task
- Memory writes and reads work as expected

These run on every PR but may be slower (minutes). Mocked tools mean they are reproducible.

## Trajectory Tests

Trajectory tests run the full agent end-to-end on real or realistic tasks:

- Customer-service scenarios with the full conversation
- Multi-day workflow scenarios
- Adversarial scenarios (jailbreak attempts, edge cases)
- Out-of-distribution scenarios

These are slower (tens of minutes) and may use real tools or sandboxes. Run pre-release or nightly. They catch what the smaller-scope tests cannot.

## Test Authoring Patterns

```mermaid
flowchart LR
    Real[Real production traces] --> Author[Convert to test cases]
    Bug[Reported bugs] --> Author
    Adversarial[Red-team scenarios] --> Author
    Author --> Suite[Test suite]
```

The cleanest test cases come from production traces of real failures. Adding "every reported bug becomes a test case" as a discipline means your test suite grows where it matters.

## Grading

Each layer needs a grading method:

- **Unit**: exact match, regex, JSON schema, deterministic
- **Integration**: deterministic where possible (database state changed correctly), LLM-judge where not
- **Trajectory**: LLM-judge with rubric, plus deterministic outcome checks

LLM-judge prompts are themselves test artifacts that need versioning.

## A Concrete Suite

For a customer-service voice agent:

```mermaid
flowchart TB
    Suite[Test Suite] --> U[Unit: 200 prompts]
    Suite --> I[Integration: 30 mocked flows]
    Suite --> T[Trajectory: 50 full conversations]
    U --> Time1[~2 min on CI]
    I --> Time2[~10 min on CI]
    T --> Time3[~40 min nightly]
```

This shape — many cheap tests, fewer expensive ones — is the standard 2026 pyramid for agent test suites.

## Regression Tracking

When tests fail, the data needed:

- Which test failed
- Which layer (unit/integration/trajectory)
- What changed (prompt? model? tool? code?)
- Diff against last passing version
- LLM-judge rationale if applicable

Good test infrastructure makes the answer obvious. Bad test infrastructure produces "the eval failed" with no actionable signal.

## Test Drift

LLM behavior changes when models are updated. Tests that passed yesterday can fail today even with no code change. Patterns to handle:

- **Pin model versions**: do not let provider auto-upgrade silently
- **Test against a stable baseline**: compare new behavior to last known-good
- **Flake detection**: run flaky tests N times; flag persistent failures vs transient

## What Counts as Coverage

Unlike traditional code, "coverage" for LLM agents is fuzzy. Approximations:

- Diversity of inputs covered by unit tests
- Diversity of paths covered by integration tests
- Diversity of trajectories covered by trajectory tests
- Coverage of known failure modes

A monthly review of "what bugs did we ship that the test suite missed" tells you where to invest.

## Test Suite as Living Documentation

The test suite is also the agent's behavioral specification. Reading the trajectory tests should tell a new engineer what the agent does. Test names that describe scenarios in plain language pay this back.

## Sources

- Inspect AI evaluation framework — [https://inspect.ai-safety-institute.org.uk](https://inspect.ai-safety-institute.org.uk)
- Promptfoo — [https://www.promptfoo.dev](https://www.promptfoo.dev)
- LangSmith evaluation — [https://docs.smith.langchain.com](https://docs.smith.langchain.com)
- Braintrust — [https://www.braintrust.dev](https://www.braintrust.dev)
- "Agent eval taxonomy 2026" — [https://arxiv.org](https://arxiv.org)

---

Source: https://callsphere.ai/blog/agent-test-suites-unit-integration-trajectory-2026