By Sagar Shankaran, Founder of CallSphere
Run offline evals as a CI gate. GitHub Actions wiring, threshold gates, LangSmith Experiments, and how to block merges on agent regression — with real YAML.
Key takeaways
If your team treats agent prompt and tool changes the same way you treat backend code — review, test, ship — but the "test" step is "the engineer ran it once in a notebook and it looked fine," you have a release process from 2023. Modern agent teams gate every merge on an offline continuous evaluation run that produces a numeric score against a held-out dataset, compares it to the baseline on main, and blocks the PR if any threshold regresses. This post is the working YAML, the working Python, and the operational lessons from running this gate against agents that handle ~280k voice and chat sessions per month on CallSphere. Setup time: half a day. Saves: roughly two production rollbacks per month.
Agent code is a moving target. The same git diff can pass review, deploy cleanly, and degrade quality by 8 points on factual accuracy because some interaction between a new tool argument and the existing system prompt only surfaces on certain conversation patterns. No unit test will catch that. No code review will catch that. The only thing that catches it is running the agent against a representative dataset and grading the outputs.
The argument I hear most often against wiring this into CI is "evals are expensive and slow." Both true. Both not as bad as you think once you tier the suite. Below is the architecture we run.
flowchart TD
A[Engineer opens PR] --> B{What changed?}
B -->|Agent prompt, tools, or model config| C[Run FULL eval suite]
B -->|Other code| D[Run SMOKE eval suite]
C --> E[LangSmith Experiment created]
D --> E
E --> F{Score vs main baseline}
F -->|All thresholds pass| G[Mark CI green + post diff comment]
F -->|Any threshold regressed| H[Mark CI red + block merge]
G --> I[Reviewer approves]
I --> J[Merge to main]
J --> K[Re-run eval on main, write new baseline]
K --> L[Deploy]
L --> M[Online evals on 5% of traffic]
M -->|Stable 24-48h| N[Promote 100%]
M -->|Drift| O[Auto-rollback]
style H fill:#fcc
style G fill:#cfc
style O fill:#fcc
Figure 1 — The eval gate is just another required check on the PR. The branch protection rule does the blocking; the workflow does the measuring.
We split the suite into three tiers — not because we love complexity, but because cost and signal are not linearly related.
| Tier | Rows | Runtime | When it runs | Blocks merge? |
|---|---|---|---|---|
| Smoke | 80 | ~90s | Every PR | Yes (loose thresholds) |
| Full | 700 | ~6 min | PRs touching agent code | Yes (tight thresholds) |
| Adversarial | 220 | ~3 min | Nightly + pre-release | Notifies, does not block |
The "smoke" suite is sampled from the regression dataset (covered in the companion observability workflow piece) plus a stratified sample of golden cases. It is fast enough that even a markdown-only docs PR runs it. The full suite runs only when path filters detect changes to agent prompts, tools, evaluator code, or model config.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Here is the working file we use. Trim to taste.
name: agent-eval-gate
on:
pull_request:
branches: [main]
push:
branches: [main]
permissions:
contents: read
pull-requests: write
jobs:
detect-scope:
runs-on: ubuntu-latest
outputs:
run_full: ${{ steps.filter.outputs.agent }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
agent:
- 'agents/**'
- 'prompts/**'
- 'tools/**'
- 'evaluators/**'
- 'pyproject.toml'
smoke-eval:
needs: detect-scope
runs-on: ubuntu-latest
timeout-minutes: 8
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: pip
- run: pip install -e '.[eval]'
- name: Run smoke eval
env:
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GIT_SHA: ${{ github.sha }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: python scripts/run_eval.py --suite smoke --gate
full-eval:
needs: detect-scope
if: needs.detect-scope.outputs.run_full == 'true'
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: pip
- run: pip install -e '.[eval]'
- name: Run full eval
env:
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GIT_SHA: ${{ github.sha }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: python scripts/run_eval.py --suite full --gate
- name: Comment results on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = fs.readFileSync('eval_summary.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body,
});
Two things to notice. First, the paths-filter step is what makes this affordable — most PRs do not touch agent code, and they pay only the smoke cost. Second, the comment step posts the experiment diff inline on the PR so reviewers do not need to leave GitHub to see whether quality moved.
The Python side is where the gating logic lives. The pattern: run evaluate(), fetch the baseline experiment from main, diff per-evaluator, exit non-zero if any threshold regresses.
import os
import sys
from langsmith import Client, evaluate
from my_agent import build_agent
from my_evaluators import EVALUATORS
THRESHOLDS = {
"factual_match": {"min": 0.92, "max_regression": 0.02},
"no_hallucination": {"min": 0.95, "max_regression": 0.01},
"tool_call_correct": {"min": 0.90, "max_regression": 0.03},
"tone_appropriate": {"min": 0.88, "max_regression": 0.05},
"latency_ok": {"min": 0.85, "max_regression": 0.03},
}
def predict(inputs):
return {"output": build_agent().invoke(inputs)}
def main(suite: str, gate: bool):
client = Client()
dataset = "smoke-suite" if suite == "smoke" else "regression-suite"
sha = os.environ["GIT_SHA"][:7]
pr = os.environ.get("PR_NUMBER", "main")
exp = evaluate(
predict,
data=dataset,
evaluators=EVALUATORS,
experiment_prefix=f"pr-{pr}-{sha}",
metadata={"sha": sha, "pr": pr, "suite": suite},
max_concurrency=8,
)
df = exp.to_pandas()
scores = {ev: df[f"feedback.{ev}"].mean() for ev in THRESHOLDS}
# Pull baseline (latest main run for this suite)
baseline = fetch_main_baseline(client, suite)
failures = []
for ev, t in THRESHOLDS.items():
cur = scores[ev]
base = baseline.get(ev, cur)
if cur < t["min"]:
failures.append(f"{ev}: {cur:.3f} below floor {t['min']}")
if (base - cur) > t["max_regression"]:
failures.append(
f"{ev}: regressed {base - cur:.3f} (limit {t['max_regression']})"
)
write_summary_md(scores, baseline, failures, exp.experiment_url)
if gate and failures:
print("EVAL GATE FAILED:")
for f in failures:
print(" -", f)
sys.exit(1)
if __name__ == "__main__":
import argparse
p = argparse.ArgumentParser()
p.add_argument("--suite", choices=["smoke", "full"], required=True)
p.add_argument("--gate", action="store_true")
main(**vars(p.parse_args()))
The gate has two distinct conditions: an absolute floor (factual_match must be ≥ 0.92, period) and a relative regression limit (a one-PR drop bigger than 0.02 is suspicious even if the absolute number is still high). Both are needed. Floors stop slow drift; regression limits stop sudden cliffs.
LangSmith's Experiments primitive is the load-bearing piece here. Every PR run is a named experiment tagged with the SHA. Every merge to main runs the same eval and tags it as the new baseline. fetch_main_baseline is just a query for the most recent main-tagged experiment on that dataset:
def fetch_main_baseline(client: Client, suite: str) -> dict:
runs = client.list_experiments(
dataset_name=f"{suite}-suite",
metadata_filter={"pr": "main"},
limit=1,
order_by="-created_at",
)
latest = next(iter(runs), None)
if not latest:
return {}
return {
ev: latest.aggregate_scores[ev]
for ev in THRESHOLDS
if ev in latest.aggregate_scores
}
This pattern is what makes the gate a true "did this PR make things worse" check rather than a static pass/fail. A PR that improves factual_match from 0.93 to 0.95 should not be blocked because tone_appropriate dipped from 0.93 to 0.91 — unless the regression limit says otherwise. We tune the limits per evaluator based on observed historical noise.
The CI run is theater unless you mark it required in branch protection rules. The settings we use on the agent repo:
smoke-eval, full-eval (when applicable), unit-tests, type-check.Without "branches up to date," a PR that passed eval against an old baseline can merge into a main that has since absorbed two improvements, and the new main quietly regresses against the now-stale baseline. The "up to date" rule forces a re-eval against current main.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
We have run this gate for about 14 months across the agents powering our healthcare, real estate, sales, salon, IT helpdesk, and after-hours verticals. Things we learned the hard way:
gpt-4o-2024-08-06 not gpt-4o. Floating aliases poison your historical baselines.Before we wired this in, our agent regression rate was about 1.3 incidents per week at our scale, and our mean time to detect a quality regression was 38 hours (i.e., long enough that real customers noticed and complained). After:
| Metric | Before gate | After 6 months |
|---|---|---|
| Quality regressions per week | 1.3 | 0.2 |
| Mean time to detect regression | 38h | 11 min (CI) |
| Engineer hours/wk on rollbacks | ~6 | ~1 |
| LangSmith + LLM eval cost | $0 | ~$340/mo |
| "Did this prompt change help?" arguments | Many | Zero |
The cost-to-value math is not subtle. The hardest part is the social shift: engineers must accept that an eval result is data, not opinion, even when it disagrees with their gut.
Normal tests assert exact equality on deterministic functions. Agent evals score graded outputs from non-deterministic functions against rubrics, with thresholds and regression bounds. The mental model is closer to a load test or a benchmark than a unit test — you compare distributions of results, not individual pass/fail.
Yes, but you will rebuild Experiments, Datasets, comparison views, and online evals yourself. We have done both. Off-the-shelf tooling saves about a quarter of engineering time once you account for the maintenance burden of a homegrown eval harness.
Two practices: (a) every shipped regression goes into the dataset (see the trace-to-fix workflow); (b) quarterly review where domain experts audit a sample of dataset rows and either refresh the reference outputs or retire rows that no longer reflect the product. Datasets are living artifacts.
They will, intentionally. When OpenAI ships a new model, we run the full suite against it on a branch, accept the new baseline, and update thresholds if the model is genuinely better at some evaluators and slightly worse at others. The gate makes the upgrade decision visible and quantitative instead of vibes-based.
Tier the suite (smoke vs. full), use path filters so most PRs only run smoke, parallelize evaluators (max_concurrency=8 is the sweet spot for our LLM rate limits), and cache pip + model assets. If your full eval is over 10 minutes you have either too many rows for one shard or evaluators that should be moved into the online eval sample instead of the offline gate.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI