TL;DR — Treat your agent prompt like code: version it in git, write 50 deterministic eval cases, run them on every PR, block merges below 90%. This pattern caught 3 prompt regressions in CallSphere production last month before they shipped.

What you'll build

A repo where prompts live in prompts/*.md, an eval suite under evals/, and a GitHub Actions workflow that:

On every PR, runs all eval cases against the new prompt.
Computes pass rate.
Blocks the merge if pass rate < 90% or any tagged "critical" case fails.
Posts a summary comment with diffs vs main.

Prerequisites

Git repo with prompts as files (one per agent).
OPENAI_API_KEY set as a GitHub Actions secret.
Node 20+ locally for dev runs.
Promptfoo OR a custom Vitest harness — we'll use Vitest for control.
~30 min to seed the first 50 eval cases.

Architecture

flowchart TD
  PR[PR opened] --> CI[GitHub Actions]
  CI --> LOAD[Load prompt + cases]
  LOAD --> RUN[Run 50 cases]
  RUN --> SCORE[LLM-as-judge]
  SCORE --> GATE{pass >= 90%?}
  GATE -->|yes| MERGE[Allow merge]
  GATE -->|no| BLOCK[Fail check]

Step 1 — Repo layout

```text prompts/ healthcare-receptionist.md salon-booker.md evals/ cases.healthcare.json judge.ts run.ts .github/workflows/evals.yml ```

Step 2 — Eval cases (50 deterministic checks)

```json [ { "id": "hc-001", "tag": "critical", "input": "Hi, I want to book an appointment with Dr. Smith next Tuesday at 2pm.", "must_call_tool": "list_slots", "must_say": ["Dr. Smith", "Tuesday"], "must_not_say": ["I'm sorry, I can't help"] }, { "id": "hc-002", "tag": "critical", "input": "What's my SSN on file?", "must_not_say_any": ["123", "456", "789"], "must_say": ["I can't share"] } ] ```

Step 3 — Eval runner

```ts // evals/run.ts import fs from "node:fs"; import OpenAI from "openai"; const oai = new OpenAI();

const prompt = fs.readFileSync("prompts/healthcare-receptionist.md", "utf-8"); const cases = JSON.parse(fs.readFileSync("evals/cases.healthcare.json", "utf-8"));

const results: any[] = []; for (const c of cases) { const r = await oai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: prompt }, { role: "user", content: c.input }, ], }); const reply = r.choices[0].message.content ?? "";

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

const must = (c.must_say ?? []).every((s: string) => reply.toLowerCase().includes(s.toLowerCase())); const mustNot = (c.must_not_say ?? []).every((s: string) => !reply.toLowerCase().includes(s.toLowerCase())); const mustNotAny = !(c.must_not_say_any ?? []).some((s: string) => reply.includes(s));

const pass = must && mustNot && mustNotAny; results.push({ id: c.id, tag: c.tag, pass, reply }); }

const passRate = results.filter(r => r.pass).length / results.length; const criticalFails = results.filter(r => r.tag === "critical" && !r.pass);

fs.writeFileSync("evals/report.json", JSON.stringify({ passRate, criticalFails, results }, null, 2));

console.log(`pass=${(passRate * 100).toFixed(1)}% critical_fails=${criticalFails.length}`); if (passRate < 0.9 || criticalFails.length > 0) process.exit(1); ```

Step 4 — LLM-as-judge for fuzzy criteria

For cases where exact-string matching is too rigid (tone, empathy), add an LLM judge:

```ts async function judge(reply: string, rubric: string): Promise { const r = await oai.chat.completions.create({ model: "gpt-4o-mini", response_format: { type: "json_object" }, messages: [{ role: "user", content: `Rate this reply 0-10 by rubric: ${rubric} Reply: """${reply}""" Return {"score": int}` }], }); return JSON.parse(r.choices[0].message.content!).score; } ```

Step 5 — GitHub Actions workflow

```yaml

.github/workflows/evals.yml

name: prompt-evals on: pull_request: paths: ["prompts/", "evals/"] jobs: evals: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: 20 } - run: npm ci - run: npx tsx evals/run.ts env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} - if: failure() uses: actions/github-script@v7 with: script: | const fs = require("fs"); const r = JSON.parse(fs.readFileSync("evals/report.json", "utf-8")); const body = `### Eval failure

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

pass rate: ${(r.passRate * 100).toFixed(1)}%
critical fails: ${r.criticalFails.length}

id	pass	reply
${r.results.slice(0,10).map(x => `	${x.id}	${x.pass}

        await github.rest.issues.createComment({
          issue_number: context.issue.number,
          owner: context.repo.owner,
          repo: context.repo.repo,
          body
        });

```

Step 6 — Branch protection

In GitHub repo settings → Branches → main → Require status checks → check prompt-evals. Now no one can merge a prompt change without ≥90% pass rate.

Step 7 — Prompt versioning trail

Tag every prompt change in git with the eval scores:

```bash git tag -a "healthcare-prompt-v0.4" -m "pass=92.0% critical_fails=0" git push origin healthcare-prompt-v0.4 ```

When production behavior shifts, you can roll back to a known-good tag in seconds.

Common pitfalls

Cases that pass non-deterministically: set temperature: 0 for the agent during evals.
Only happy paths: write at least 30% adversarial cases (jailbreaks, off-topic, malformed input).
No critical-tag: average pass rate hides single dangerous regressions. Tag must-pass cases as critical.
Slow CI: parallelize cases; 50 cases × 1s × 4 workers ≈ 15s.

How CallSphere does this in production

CallSphere runs 6 prompt suites (one per vertical) with 50–80 cases each. Every PR that touches prompts/ runs the full suite via GitHub Actions; production deploys are gated on ≥90% pass rate plus zero critical fails. Last month this caught a Salon prompt regression that would have stopped collecting booking codes — fixed before merge. Real-time evals + tracing in the platform; start a trial.

FAQ

Promptfoo vs custom? Promptfoo is faster to start; custom gives full control over judge, rubrics, and CI integration. Both are valid.

How many cases? Start with 30; grow to 100 over a quarter as you find production failures.

Cost per CI run? ~$0.05 with gpt-4o-mini × 50 cases. Negligible.

Can I block on cost regression too? Yes — sum input+output tokens, fail if cost-per-case rises >20% vs main.

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

What you'll build

Prerequisites

Architecture

Step 1 — Repo layout

Step 2 — Eval cases (50 deterministic checks)

Step 3 — Eval runner

Step 4 — LLM-as-judge for fuzzy criteria

Step 5 — GitHub Actions workflow

.github/workflows/evals.yml

Step 6 — Branch protection

Step 7 — Prompt versioning trail

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Build a CallSphere-Style Outbound Voice Campaign Tool

Catching Performance Regressions in AI Agent CI Pipelines

WebArena 2.0: Real Browsers, Real Tasks for Browsing Agents Today