Why Three Layers

Most teams that ship LLM agents have one layer of tests: an end-to-end smoke test or a hand-rolled eval. That is not enough. Agents have three layers of correctness, and a regression in any can ship undetected without dedicated tests.

This piece walks through unit, integration, and trajectory tests for agents — what each catches and how to design them.

The Three Layers

flowchart TB
    Unit[Unit Tests<br/>per-prompt, per-tool] --> What1[Catches: prompt + tool regressions]
    Integ[Integration Tests<br/>multi-step but bounded] --> What2[Catches: composition + state bugs]
    Traj[Trajectory Tests<br/>full agent runs] --> What3[Catches: planning + drift bugs]

Each layer catches different bugs. A change to a prompt may pass unit tests and fail trajectory tests. A new tool may pass unit tests, fail integration tests, and not even reach trajectory tests.

Unit Tests

Unit tests cover:

Single prompts (does this classifier prompt return correct labels for these inputs?)
Single tool calls (does the tool return the expected shape for these inputs?)
Prompt-output structural validity (does the schema parse cleanly?)
Specific safety properties (does the model refuse these unsafe inputs?)

Unit tests run on every commit, fast (seconds to minutes). They are the analog of regular software unit tests; the LLM is treated as a function.

Integration Tests

Integration tests cover bounded multi-step flows:

Agent completes a 3-5 step task with mocked tools
Plan-execute-reflect loop runs to completion
Specialist agents collaborate on a known multi-agent task
Memory writes and reads work as expected

These run on every PR but may be slower (minutes). Mocked tools mean they are reproducible.

Trajectory Tests

Trajectory tests run the full agent end-to-end on real or realistic tasks:

Customer-service scenarios with the full conversation
Multi-day workflow scenarios
Adversarial scenarios (jailbreak attempts, edge cases)
Out-of-distribution scenarios

These are slower (tens of minutes) and may use real tools or sandboxes. Run pre-release or nightly. They catch what the smaller-scope tests cannot.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Test Authoring Patterns

flowchart LR
    Real[Real production traces] --> Author[Convert to test cases]
    Bug[Reported bugs] --> Author
    Adversarial[Red-team scenarios] --> Author
    Author --> Suite[Test suite]

The cleanest test cases come from production traces of real failures. Adding "every reported bug becomes a test case" as a discipline means your test suite grows where it matters.

Grading

Each layer needs a grading method:

Unit: exact match, regex, JSON schema, deterministic
Integration: deterministic where possible (database state changed correctly), LLM-judge where not
Trajectory: LLM-judge with rubric, plus deterministic outcome checks

LLM-judge prompts are themselves test artifacts that need versioning.

A Concrete Suite

For a customer-service voice agent:

flowchart TB
    Suite[Test Suite] --> U[Unit: 200 prompts]
    Suite --> I[Integration: 30 mocked flows]
    Suite --> T[Trajectory: 50 full conversations]
    U --> Time1[~2 min on CI]
    I --> Time2[~10 min on CI]
    T --> Time3[~40 min nightly]

This shape — many cheap tests, fewer expensive ones — is the standard 2026 pyramid for agent test suites.

Regression Tracking

When tests fail, the data needed:

Which test failed
Which layer (unit/integration/trajectory)
What changed (prompt? model? tool? code?)
Diff against last passing version
LLM-judge rationale if applicable

Good test infrastructure makes the answer obvious. Bad test infrastructure produces "the eval failed" with no actionable signal.

Test Drift

LLM behavior changes when models are updated. Tests that passed yesterday can fail today even with no code change. Patterns to handle:

Pin model versions: do not let provider auto-upgrade silently
Test against a stable baseline: compare new behavior to last known-good
Flake detection: run flaky tests N times; flag persistent failures vs transient

What Counts as Coverage

Unlike traditional code, "coverage" for LLM agents is fuzzy. Approximations:

Diversity of inputs covered by unit tests
Diversity of paths covered by integration tests
Diversity of trajectories covered by trajectory tests
Coverage of known failure modes

A monthly review of "what bugs did we ship that the test suite missed" tells you where to invest.

Test Suite as Living Documentation

The test suite is also the agent's behavioral specification. Reading the trajectory tests should tell a new engineer what the agent does. Test names that describe scenarios in plain language pay this back.

Sources

Inspect AI evaluation framework — https://inspect.ai-safety-institute.org.uk
Promptfoo — https://www.promptfoo.dev
LangSmith evaluation — https://docs.smith.langchain.com
Braintrust — https://www.braintrust.dev
"Agent eval taxonomy 2026" — https://arxiv.org

Designing Agent Test Suites: Unit, Integration, and Trajectory Tests

Why Three Layers

The Three Layers

Unit Tests

Integration Tests

Trajectory Tests

Test Authoring Patterns

Grading

A Concrete Suite

Regression Tracking

Test Drift

What Counts as Coverage

Test Suite as Living Documentation

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Hierarchical Goal Trees in Production AI Agents

Autonomous Agent Goal Decomposition: From High-Level Tasks to Atomic Actions

Agent Loop Design Patterns: Plan-Execute-Reflect for Production Autonomy

Decision-Making in AI Agents: Bayesian, Utility, and Heuristic Approaches

Chatbot Architecture in 2026: From Rule-Based to Agentic Pipelines

Red-Teaming Agents in 2026: Attack Trees, Prompt Injection, and Tool Abuse