By Sagar Shankaran, Founder of CallSphere
Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides.
Key takeaways
TL;DR — Final-output evals pass 20–40% more cases than full-trajectory evals. Run trajectory evals on every PR, gate merges on regression, and auto-generate test cases from production failures.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Most teams set up an LLM eval suite that grades only the final answer. The agent is allowed to take any path — even a wasteful, wrong, expensive one — as long as the answer is right. Then a model swap or prompt edit changes the path, the agent now hallucinates a tool argument at step 3, recovers at step 5, and the final answer is still right. Eval passes. In production, the user sees a 4-second pause and pays for 2x the tokens.
Meta's FBDetect catches regressions as small as 0.005% in noisy production environments. That bar is unrealistic for most teams, but the principle applies: catch regressions in latency, cost, and trajectory shape — not just answer correctness.
CI evals should grade four dimensions:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Auto-generate new test cases from production failures. Every postmortem produces an eval row. The suite grows organically.
CallSphere runs evals on every PR via GitHub Actions, gated by Vercel + a custom k3s preview environment. Architecture:
/evals/ per vertical. Each row: input, expected_intent, expected_tools, max_turns, max_cost_usd.Per vertical:
:8084 — 380 eval cases covering insurance verification, scheduling, intake, refills. Threshold: ≥ 96% pass on final answer, ≥ 92% on trajectory.Latency and cost regressions block merge. Two recent saves: a prompt edit added 3 tokens that increased mean turns by 1.4 (caught in CI); a model swap to gpt-4o-mini increased trajectory variance by 18% (caught in CI). Try the 14-day trial.
id: hc-001
input: "I need to verify my BlueCross plan."
expected_intent: insurance_verification
expected_tools: [lookup_insurance, verify_member]
max_turns: 4
max_cost_usd: 0.15
golden_answer_keywords: [BlueCross, verified, ID]
def run_eval(row, agent):
trace = agent.run(row.input)
pass_answer = judge(trace.final, row.golden_answer_keywords)
pass_traj = traj_match(trace.tool_calls, row.expected_tools)
pass_lat = trace.turns <= row.max_turns
pass_cost = trace.cost_usd <= row.max_cost_usd
return all([pass_answer, pass_traj, pass_lat, pass_cost])
- name: Run evals
run: python -m evals.run --vertical healthcare
- name: Compare to baseline
run: python -m evals.compare --pr-sha=${{ github.sha }} --baseline-window=14d
- name: Block on regression
run: exit ${{ steps.compare.outputs.regression_count > 0 && 1 || 0 }}
Auto-generate from prod. Every postmortem produces a row. Every customer-reported bug produces a row.
Scheduled re-runs. Weekly cron re-runs all evals against current prod model — catches model-vendor-side drift.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: How big should the eval suite be? A: Start with 50 cases per vertical, grow to 200–400. Each case must take < 30 seconds to run.
Q: Doesn't running 1000+ evals on every PR get expensive? A: ~$8/PR at our scale. Cheap insurance vs the cost of a regression in prod.
Q: How do I test for hallucinations? A: Adversarial prompts in the suite (e.g., "what does your CEO's social security number end in?"). Expected answer: refusal.
Q: Trajectory matching seems strict — what about acceptable variation? A: Use a similarity score (Jaccard on tool sets, edit distance on order); threshold at 0.8.
Q: What if I'm using LangSmith / Langfuse? A: They both have eval features — use them. We use Langfuse for the dataset management; runner is custom because we need k3s integration.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.
A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Head-to-head: OpenAI Frontier and Anthropic's managed agent stack — strengths, fit, and what each means for enterprise AI voice and chat deployment.
© 2026 CallSphere LLC. All rights reserved.