Designing Agent Test Suites: Unit, Integration, and Trajectory Tests
Agent testing needs three layers — unit, integration, trajectory — and most teams ship only one. The 2026 test-suite blueprint that catches real regressions.
Why Three Layers
Most teams that ship LLM agents have one layer of tests: an end-to-end smoke test or a hand-rolled eval. That is not enough. Agents have three layers of correctness, and a regression in any can ship undetected without dedicated tests.
This piece walks through unit, integration, and trajectory tests for agents — what each catches and how to design them.
The Three Layers
flowchart TB
Unit[Unit Tests<br/>per-prompt, per-tool] --> What1[Catches: prompt + tool regressions]
Integ[Integration Tests<br/>multi-step but bounded] --> What2[Catches: composition + state bugs]
Traj[Trajectory Tests<br/>full agent runs] --> What3[Catches: planning + drift bugs]
Each layer catches different bugs. A change to a prompt may pass unit tests and fail trajectory tests. A new tool may pass unit tests, fail integration tests, and not even reach trajectory tests.
Unit Tests
Unit tests cover:
- Single prompts (does this classifier prompt return correct labels for these inputs?)
- Single tool calls (does the tool return the expected shape for these inputs?)
- Prompt-output structural validity (does the schema parse cleanly?)
- Specific safety properties (does the model refuse these unsafe inputs?)
Unit tests run on every commit, fast (seconds to minutes). They are the analog of regular software unit tests; the LLM is treated as a function.
Integration Tests
Integration tests cover bounded multi-step flows:
- Agent completes a 3-5 step task with mocked tools
- Plan-execute-reflect loop runs to completion
- Specialist agents collaborate on a known multi-agent task
- Memory writes and reads work as expected
These run on every PR but may be slower (minutes). Mocked tools mean they are reproducible.
Trajectory Tests
Trajectory tests run the full agent end-to-end on real or realistic tasks:
- Customer-service scenarios with the full conversation
- Multi-day workflow scenarios
- Adversarial scenarios (jailbreak attempts, edge cases)
- Out-of-distribution scenarios
These are slower (tens of minutes) and may use real tools or sandboxes. Run pre-release or nightly. They catch what the smaller-scope tests cannot.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Test Authoring Patterns
flowchart LR
Real[Real production traces] --> Author[Convert to test cases]
Bug[Reported bugs] --> Author
Adversarial[Red-team scenarios] --> Author
Author --> Suite[Test suite]
The cleanest test cases come from production traces of real failures. Adding "every reported bug becomes a test case" as a discipline means your test suite grows where it matters.
Grading
Each layer needs a grading method:
- Unit: exact match, regex, JSON schema, deterministic
- Integration: deterministic where possible (database state changed correctly), LLM-judge where not
- Trajectory: LLM-judge with rubric, plus deterministic outcome checks
LLM-judge prompts are themselves test artifacts that need versioning.
A Concrete Suite
For a customer-service voice agent:
flowchart TB
Suite[Test Suite] --> U[Unit: 200 prompts]
Suite --> I[Integration: 30 mocked flows]
Suite --> T[Trajectory: 50 full conversations]
U --> Time1[~2 min on CI]
I --> Time2[~10 min on CI]
T --> Time3[~40 min nightly]
This shape — many cheap tests, fewer expensive ones — is the standard 2026 pyramid for agent test suites.
Regression Tracking
When tests fail, the data needed:
- Which test failed
- Which layer (unit/integration/trajectory)
- What changed (prompt? model? tool? code?)
- Diff against last passing version
- LLM-judge rationale if applicable
Good test infrastructure makes the answer obvious. Bad test infrastructure produces "the eval failed" with no actionable signal.
Test Drift
LLM behavior changes when models are updated. Tests that passed yesterday can fail today even with no code change. Patterns to handle:
- Pin model versions: do not let provider auto-upgrade silently
- Test against a stable baseline: compare new behavior to last known-good
- Flake detection: run flaky tests N times; flag persistent failures vs transient
What Counts as Coverage
Unlike traditional code, "coverage" for LLM agents is fuzzy. Approximations:
- Diversity of inputs covered by unit tests
- Diversity of paths covered by integration tests
- Diversity of trajectories covered by trajectory tests
- Coverage of known failure modes
A monthly review of "what bugs did we ship that the test suite missed" tells you where to invest.
Test Suite as Living Documentation
The test suite is also the agent's behavioral specification. Reading the trajectory tests should tell a new engineer what the agent does. Test names that describe scenarios in plain language pay this back.
Sources
- Inspect AI evaluation framework — https://inspect.ai-safety-institute.org.uk
- Promptfoo — https://www.promptfoo.dev
- LangSmith evaluation — https://docs.smith.langchain.com
- Braintrust — https://www.braintrust.dev
- "Agent eval taxonomy 2026" — https://arxiv.org
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.