Skip to content
Agentic AI
Agentic AI12 min read1 views

Continuous Evaluation: Wiring LangSmith into Your CI/CD for Agent Releases

Run offline evals as a CI gate. GitHub Actions wiring, threshold gates, LangSmith Experiments, and how to block merges on agent regression — with real YAML.

TL;DR

If your team treats agent prompt and tool changes the same way you treat backend code — review, test, ship — but the "test" step is "the engineer ran it once in a notebook and it looked fine," you have a release process from 2023. Modern agent teams gate every merge on an offline continuous evaluation run that produces a numeric score against a held-out dataset, compares it to the baseline on main, and blocks the PR if any threshold regresses. This post is the working YAML, the working Python, and the operational lessons from running this gate against agents that handle ~280k voice and chat sessions per month on CallSphere. Setup time: half a day. Saves: roughly two production rollbacks per month.

The Problem With "Looks Fine"

Agent code is a moving target. The same git diff can pass review, deploy cleanly, and degrade quality by 8 points on factual accuracy because some interaction between a new tool argument and the existing system prompt only surfaces on certain conversation patterns. No unit test will catch that. No code review will catch that. The only thing that catches it is running the agent against a representative dataset and grading the outputs.

The argument I hear most often against wiring this into CI is "evals are expensive and slow." Both true. Both not as bad as you think once you tier the suite. Below is the architecture we run.

The Pipeline at a Glance

flowchart TD
  A[Engineer opens PR] --> B{What changed?}
  B -->|Agent prompt, tools, or model config| C[Run FULL eval suite]
  B -->|Other code| D[Run SMOKE eval suite]
  C --> E[LangSmith Experiment created]
  D --> E
  E --> F{Score vs main baseline}
  F -->|All thresholds pass| G[Mark CI green + post diff comment]
  F -->|Any threshold regressed| H[Mark CI red + block merge]
  G --> I[Reviewer approves]
  I --> J[Merge to main]
  J --> K[Re-run eval on main, write new baseline]
  K --> L[Deploy]
  L --> M[Online evals on 5% of traffic]
  M -->|Stable 24-48h| N[Promote 100%]
  M -->|Drift| O[Auto-rollback]
  style H fill:#fcc
  style G fill:#cfc
  style O fill:#fcc

Figure 1 — The eval gate is just another required check on the PR. The branch protection rule does the blocking; the workflow does the measuring.

What Gets Tested at Each Tier

We split the suite into three tiers — not because we love complexity, but because cost and signal are not linearly related.

Tier Rows Runtime When it runs Blocks merge?
Smoke 80 ~90s Every PR Yes (loose thresholds)
Full 700 ~6 min PRs touching agent code Yes (tight thresholds)
Adversarial 220 ~3 min Nightly + pre-release Notifies, does not block

The "smoke" suite is sampled from the regression dataset (covered in the companion observability workflow piece) plus a stratified sample of golden cases. It is fast enough that even a markdown-only docs PR runs it. The full suite runs only when path filters detect changes to agent prompts, tools, evaluator code, or model config.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The GitHub Actions Workflow

Here is the working file we use. Trim to taste.

name: agent-eval-gate

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

permissions:
  contents: read
  pull-requests: write

jobs:
  detect-scope:
    runs-on: ubuntu-latest
    outputs:
      run_full: ${{ steps.filter.outputs.agent }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            agent:
              - 'agents/**'
              - 'prompts/**'
              - 'tools/**'
              - 'evaluators/**'
              - 'pyproject.toml'

  smoke-eval:
    needs: detect-scope
    runs-on: ubuntu-latest
    timeout-minutes: 8
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: pip
      - run: pip install -e '.[eval]'
      - name: Run smoke eval
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GIT_SHA: ${{ github.sha }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: python scripts/run_eval.py --suite smoke --gate

  full-eval:
    needs: detect-scope
    if: needs.detect-scope.outputs.run_full == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: pip
      - run: pip install -e '.[eval]'
      - name: Run full eval
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GIT_SHA: ${{ github.sha }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: python scripts/run_eval.py --suite full --gate

      - name: Comment results on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('eval_summary.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body,
            });

Two things to notice. First, the paths-filter step is what makes this affordable — most PRs do not touch agent code, and they pay only the smoke cost. Second, the comment step posts the experiment diff inline on the PR so reviewers do not need to leave GitHub to see whether quality moved.

The Eval Runner With Threshold Gating

The Python side is where the gating logic lives. The pattern: run evaluate(), fetch the baseline experiment from main, diff per-evaluator, exit non-zero if any threshold regresses.

import os
import sys
from langsmith import Client, evaluate
from my_agent import build_agent
from my_evaluators import EVALUATORS

THRESHOLDS = {
    "factual_match":      {"min": 0.92, "max_regression": 0.02},
    "no_hallucination":   {"min": 0.95, "max_regression": 0.01},
    "tool_call_correct":  {"min": 0.90, "max_regression": 0.03},
    "tone_appropriate":   {"min": 0.88, "max_regression": 0.05},
    "latency_ok":         {"min": 0.85, "max_regression": 0.03},
}

def predict(inputs):
    return {"output": build_agent().invoke(inputs)}

def main(suite: str, gate: bool):
    client = Client()
    dataset = "smoke-suite" if suite == "smoke" else "regression-suite"
    sha = os.environ["GIT_SHA"][:7]
    pr  = os.environ.get("PR_NUMBER", "main")

    exp = evaluate(
        predict,
        data=dataset,
        evaluators=EVALUATORS,
        experiment_prefix=f"pr-{pr}-{sha}",
        metadata={"sha": sha, "pr": pr, "suite": suite},
        max_concurrency=8,
    )

    df = exp.to_pandas()
    scores = {ev: df[f"feedback.{ev}"].mean() for ev in THRESHOLDS}

    # Pull baseline (latest main run for this suite)
    baseline = fetch_main_baseline(client, suite)

    failures = []
    for ev, t in THRESHOLDS.items():
        cur = scores[ev]
        base = baseline.get(ev, cur)
        if cur < t["min"]:
            failures.append(f"{ev}: {cur:.3f} below floor {t['min']}")
        if (base - cur) > t["max_regression"]:
            failures.append(
                f"{ev}: regressed {base - cur:.3f} (limit {t['max_regression']})"
            )

    write_summary_md(scores, baseline, failures, exp.experiment_url)

    if gate and failures:
        print("EVAL GATE FAILED:")
        for f in failures:
            print("  -", f)
        sys.exit(1)

if __name__ == "__main__":
    import argparse
    p = argparse.ArgumentParser()
    p.add_argument("--suite", choices=["smoke", "full"], required=True)
    p.add_argument("--gate", action="store_true")
    main(**vars(p.parse_args()))

The gate has two distinct conditions: an absolute floor (factual_match must be ≥ 0.92, period) and a relative regression limit (a one-PR drop bigger than 0.02 is suspicious even if the absolute number is still high). Both are needed. Floors stop slow drift; regression limits stop sudden cliffs.

What "Compare to Main" Actually Means

LangSmith's Experiments primitive is the load-bearing piece here. Every PR run is a named experiment tagged with the SHA. Every merge to main runs the same eval and tags it as the new baseline. fetch_main_baseline is just a query for the most recent main-tagged experiment on that dataset:

def fetch_main_baseline(client: Client, suite: str) -> dict:
    runs = client.list_experiments(
        dataset_name=f"{suite}-suite",
        metadata_filter={"pr": "main"},
        limit=1,
        order_by="-created_at",
    )
    latest = next(iter(runs), None)
    if not latest:
        return {}
    return {
        ev: latest.aggregate_scores[ev]
        for ev in THRESHOLDS
        if ev in latest.aggregate_scores
    }

This pattern is what makes the gate a true "did this PR make things worse" check rather than a static pass/fail. A PR that improves factual_match from 0.93 to 0.95 should not be blocked because tone_appropriate dipped from 0.93 to 0.91 — unless the regression limit says otherwise. We tune the limits per evaluator based on observed historical noise.

Branch Protection: Make It Required

The CI run is theater unless you mark it required in branch protection rules. The settings we use on the agent repo:

  • Require pull request reviews (1 approval).
  • Require status checks: smoke-eval, full-eval (when applicable), unit-tests, type-check.
  • Require branches to be up to date before merging.
  • Restrict who can push to main (humans no, bots yes).

Without "branches up to date," a PR that passed eval against an old baseline can merge into a main that has since absorbed two improvements, and the new main quietly regresses against the now-stale baseline. The "up to date" rule forces a re-eval against current main.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Operational Lessons From a Year of This

We have run this gate for about 14 months across the agents powering our healthcare, real estate, sales, salon, IT helpdesk, and after-hours verticals. Things we learned the hard way:

  1. Cache the agent build, not the eval results. The eval must re-run; the dependency install can be cached aggressively.
  2. Pin the judge model with a date stamp. gpt-4o-2024-08-06 not gpt-4o. Floating aliases poison your historical baselines.
  3. Budget the eval cost as a CI line item. Ours runs ~$340/month at our PR volume. That is one-tenth the cost of a single rollback in revenue + engineering time.
  4. Snapshot the dataset version. When you add 30 new regression rows, the baseline shifts. Tag both the dataset version and the experiment so old comparisons are still meaningful.
  5. Have a glass-break override. Rarely, you genuinely need to ship a fix that regresses one evaluator (e.g., tone got slightly more terse in exchange for fixing a factual bug). We use a labeled override that requires two approvers and posts to a Slack channel.
  6. Online evals are not a substitute for the offline gate. Online evals run on real traffic, which means by the time they fire, customers have already seen the bug. The gate prevents; online detects.
  7. Smoke must finish in under 2 minutes or developers route around it. This is a social, not technical, constraint. We trim the smoke suite quarterly to keep the wall-clock under target.

What Goes Wrong Without This

Before we wired this in, our agent regression rate was about 1.3 incidents per week at our scale, and our mean time to detect a quality regression was 38 hours (i.e., long enough that real customers noticed and complained). After:

Metric Before gate After 6 months
Quality regressions per week 1.3 0.2
Mean time to detect regression 38h 11 min (CI)
Engineer hours/wk on rollbacks ~6 ~1
LangSmith + LLM eval cost $0 ~$340/mo
"Did this prompt change help?" arguments Many Zero

The cost-to-value math is not subtle. The hardest part is the social shift: engineers must accept that an eval result is data, not opinion, even when it disagrees with their gut.

Frequently Asked Questions

How is this different from a normal test suite?

Normal tests assert exact equality on deterministic functions. Agent evals score graded outputs from non-deterministic functions against rubrics, with thresholds and regression bounds. The mental model is closer to a load test or a benchmark than a unit test — you compare distributions of results, not individual pass/fail.

Can I do this without LangSmith?

Yes, but you will rebuild Experiments, Datasets, comparison views, and online evals yourself. We have done both. Off-the-shelf tooling saves about a quarter of engineering time once you account for the maintenance burden of a homegrown eval harness.

How do I keep the dataset from going stale?

Two practices: (a) every shipped regression goes into the dataset (see the trace-to-fix workflow); (b) quarterly review where domain experts audit a sample of dataset rows and either refresh the reference outputs or retire rows that no longer reflect the product. Datasets are living artifacts.

What about model upgrades — do they break the gate?

They will, intentionally. When OpenAI ships a new model, we run the full suite against it on a branch, accept the new baseline, and update thresholds if the model is genuinely better at some evaluators and slightly worse at others. The gate makes the upgrade decision visible and quantitative instead of vibes-based.

How do I prevent the gate from becoming a bottleneck?

Tier the suite (smoke vs. full), use path filters so most PRs only run smoke, parallelize evaluators (max_concurrency=8 is the sweet spot for our LLM rate limits), and cache pip + model assets. If your full eval is over 10 minutes you have either too many rows for one shard or evaluators that should be moved into the online eval sample instead of the offline gate.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.