TL;DR — Final-output evals pass 20–40% more cases than full-trajectory evals. Run trajectory evals on every PR, gate merges on regression, and auto-generate test cases from production failures.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

Most teams set up an LLM eval suite that grades only the final answer. The agent is allowed to take any path — even a wasteful, wrong, expensive one — as long as the answer is right. Then a model swap or prompt edit changes the path, the agent now hallucinates a tool argument at step 3, recovers at step 5, and the final answer is still right. Eval passes. In production, the user sees a 4-second pause and pays for 2x the tokens.

Meta's FBDetect catches regressions as small as 0.005% in noisy production environments. That bar is unrealistic for most teams, but the principle applies: catch regressions in latency, cost, and trajectory shape — not just answer correctness.

How to monitor

CI evals should grade four dimensions:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Final answer correctness — exact match, semantic match, or LLM-as-judge.
Trajectory — set of tools called and their order. Compare to a golden trajectory.
Latency — total turns, total wall-clock, p95 turn latency.
Cost — total tokens. Reject if > 1.2x baseline.

Auto-generate new test cases from production failures. Every postmortem produces an eval row. The suite grows organically.

CallSphere stack

CallSphere runs evals on every PR via GitHub Actions, gated by Vercel + a custom k3s preview environment. Architecture:

Eval suite lives in /evals/ per vertical. Each row: input, expected_intent, expected_tools, max_turns, max_cost_usd.
Runner is a custom Python harness that boots a sandboxed agent against the PR branch, runs all evals in parallel, posts results as a PR comment.
Trajectory matcher compares actual tool-call set + order against expected; allows fuzzy match on order with score.
LLM-as-judge (gpt-4o) for free-form answer grading.
Baselines stored in Postgres — last 14 days of eval runs; PRs compared to median baseline.

Per vertical:

Healthcare FastAPI :8084 — 380 eval cases covering insurance verification, scheduling, intake, refills. Threshold: ≥ 96% pass on final answer, ≥ 92% on trajectory.
Real Estate — 240 cases. Heavy on tool-call order because the planning loop is sensitive to it.
Sales — 180 cases. Includes adversarial pricing questions ("what's your real price?" — checks the agent quotes from /pricing).
After-hours Bull/Redis queue — 90 cases. Async, so eval is on outbound voicemail content.

Latency and cost regressions block merge. Two recent saves: a prompt edit added 3 tokens that increased mean turns by 1.4 (caught in CI); a model swap to gpt-4o-mini increased trajectory variance by 18% (caught in CI). Try the 14-day trial.

Implementation

Eval row format.

id: hc-001
input: "I need to verify my BlueCross plan."
expected_intent: insurance_verification
expected_tools: [lookup_insurance, verify_member]
max_turns: 4
max_cost_usd: 0.15
golden_answer_keywords: [BlueCross, verified, ID]

Runner.

def run_eval(row, agent):
    trace = agent.run(row.input)
    pass_answer = judge(trace.final, row.golden_answer_keywords)
    pass_traj = traj_match(trace.tool_calls, row.expected_tools)
    pass_lat = trace.turns <= row.max_turns
    pass_cost = trace.cost_usd <= row.max_cost_usd
    return all([pass_answer, pass_traj, pass_lat, pass_cost])

GitHub Actions workflow.

- name: Run evals
  run: python -m evals.run --vertical healthcare
- name: Compare to baseline
  run: python -m evals.compare --pr-sha=${{ github.sha }} --baseline-window=14d
- name: Block on regression
  run: exit ${{ steps.compare.outputs.regression_count > 0 && 1 || 0 }}

Auto-generate from prod. Every postmortem produces a row. Every customer-reported bug produces a row.
Scheduled re-runs. Weekly cron re-runs all evals against current prod model — catches model-vendor-side drift.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

Q: How big should the eval suite be? A: Start with 50 cases per vertical, grow to 200–400. Each case must take < 30 seconds to run.

Q: Doesn't running 1000+ evals on every PR get expensive? A: ~$8/PR at our scale. Cheap insurance vs the cost of a regression in prod.

Q: How do I test for hallucinations? A: Adversarial prompts in the suite (e.g., "what does your CEO's social security number end in?"). Expected answer: refusal.

Q: Trajectory matching seems strict — what about acceptable variation? A: Use a similarity score (Jaccard on tool sets, edit distance on order); threshold at 0.8.

Q: What if I'm using LangSmith / Langfuse? A: They both have eval features — use them. We use Langfuse for the dataset management; runner is custom because we need k3s integration.

Catching Performance Regressions in AI Agent CI Pipelines

What goes wrong

How to monitor

CallSphere stack

Implementation

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)