---
title: "Catching Performance Regressions in AI Agent CI Pipelines"
description: "Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides."
canonical: https://callsphere.ai/blog/vw3c-performance-regressions-ai-agents-ci
category: "AI Engineering"
tags: ["CI/CD", "Regression Testing", "Evals", "AI Agents"]
author: "CallSphere Team"
published: 2026-05-03T00:00:00.000Z
updated: 2026-05-07T09:59:38.185Z
---

# Catching Performance Regressions in AI Agent CI Pipelines

> Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides.

> **TL;DR** — Final-output evals pass 20–40% more cases than full-trajectory evals. Run trajectory evals on every PR, gate merges on regression, and auto-generate test cases from production failures.

## What goes wrong

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

Most teams set up an LLM eval suite that grades only the final answer. The agent is allowed to take any path — even a wasteful, wrong, expensive one — as long as the answer is right. Then a model swap or prompt edit changes the path, the agent now hallucinates a tool argument at step 3, recovers at step 5, and the final answer is still right. Eval passes. In production, the user sees a 4-second pause and pays for 2x the tokens.

Meta's FBDetect catches regressions as small as 0.005% in noisy production environments. That bar is unrealistic for most teams, but the principle applies: catch regressions in latency, cost, and trajectory shape — not just answer correctness.

## How to monitor

CI evals should grade four dimensions:

1. **Final answer correctness** — exact match, semantic match, or LLM-as-judge.
2. **Trajectory** — set of tools called and their order. Compare to a golden trajectory.
3. **Latency** — total turns, total wall-clock, p95 turn latency.
4. **Cost** — total tokens. Reject if > 1.2x baseline.

Auto-generate new test cases from production failures. Every postmortem produces an eval row. The suite grows organically.

## CallSphere stack

CallSphere runs evals on every PR via GitHub Actions, gated by Vercel + a custom k3s preview environment. Architecture:

- **Eval suite** lives in `/evals/` per vertical. Each row: input, expected_intent, expected_tools, max_turns, max_cost_usd.
- **Runner** is a custom Python harness that boots a sandboxed agent against the PR branch, runs all evals in parallel, posts results as a PR comment.
- **Trajectory matcher** compares actual tool-call set + order against expected; allows fuzzy match on order with score.
- **LLM-as-judge** (gpt-4o) for free-form answer grading.
- **Baselines stored in Postgres** — last 14 days of eval runs; PRs compared to median baseline.

Per vertical:

- **Healthcare FastAPI `:8084`** — 380 eval cases covering insurance verification, scheduling, intake, refills. Threshold: ≥ 96% pass on final answer, ≥ 92% on trajectory.
- **Real Estate** — 240 cases. Heavy on tool-call order because the planning loop is sensitive to it.
- **Sales** — 180 cases. Includes adversarial pricing questions ("what's your real price?" — checks the agent quotes from [/pricing](/pricing)).
- **After-hours Bull/Redis queue** — 90 cases. Async, so eval is on outbound voicemail content.

Latency and cost regressions block merge. Two recent saves: a prompt edit added 3 tokens that increased mean turns by 1.4 (caught in CI); a model swap to gpt-4o-mini increased trajectory variance by 18% (caught in CI). Try the [14-day trial](/trial).

## Implementation

1. **Eval row format.**

```yaml
id: hc-001
input: "I need to verify my BlueCross plan."
expected_intent: insurance_verification
expected_tools: [lookup_insurance, verify_member]
max_turns: 4
max_cost_usd: 0.15
golden_answer_keywords: [BlueCross, verified, ID]
```

1. **Runner.**

```python
def run_eval(row, agent):
    trace = agent.run(row.input)
    pass_answer = judge(trace.final, row.golden_answer_keywords)
    pass_traj = traj_match(trace.tool_calls, row.expected_tools)
    pass_lat = trace.turns  0 && 1 || 0 }}
```

1. **Auto-generate from prod.** Every postmortem produces a row. Every customer-reported bug produces a row.
2. **Scheduled re-runs.** Weekly cron re-runs all evals against current prod model — catches model-vendor-side drift.

## FAQ

**Q: How big should the eval suite be?**
A: Start with 50 cases per vertical, grow to 200–400. Each case must take < 30 seconds to run.

**Q: Doesn't running 1000+ evals on every PR get expensive?**
A: ~$8/PR at our scale. Cheap insurance vs the cost of a regression in prod.

**Q: How do I test for hallucinations?**
A: Adversarial prompts in the suite (e.g., "what does your CEO's social security number end in?"). Expected answer: refusal.

**Q: Trajectory matching seems strict — what about acceptable variation?**
A: Use a similarity score (Jaccard on tool sets, edit distance on order); threshold at 0.8.

**Q: What if I'm using LangSmith / Langfuse?**
A: They both have eval features — use them. We use Langfuse for the dataset management; runner is custom because we need k3s integration.

## Sources

- [Latitude — Top LLM Evaluation Tools for AI Agents 2026](https://latitude.so/blog/top-llm-evaluation-tools-ai-agents-2026-devto)
- [Meta Engineering — Capacity Efficiency at Meta with Unified AI Agents](https://engineering.fb.com/2026/04/16/developer-tools/capacity-efficiency-at-meta-how-unified-ai-agents-optimize-performance-at-hyperscale/)
- [Adaline — Complete Guide to LLM and AI Agent Evaluation 2026](https://www.adaline.ai/blog/complete-guide-llm-ai-agent-evaluation-2026)
- [Braintrust — AI agent evaluation framework](https://www.braintrust.dev/articles/ai-agent-evaluation-framework)

---

Source: https://callsphere.ai/blog/vw3c-performance-regressions-ai-agents-ci
