How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
TL;DR — Treat your agent prompt like code: version it in git, write 50 deterministic eval cases, run them on every PR, block merges below 90%. This pattern caught 3 prompt regressions in CallSphere production last month before they shipped.
What you'll build
A repo where prompts live in prompts/*.md, an eval suite under evals/, and a GitHub Actions workflow that:
- On every PR, runs all eval cases against the new prompt.
- Computes pass rate.
- Blocks the merge if pass rate < 90% or any tagged "critical" case fails.
- Posts a summary comment with diffs vs main.
Prerequisites
- Git repo with prompts as files (one per agent).
OPENAI_API_KEYset as a GitHub Actions secret.- Node 20+ locally for dev runs.
- Promptfoo OR a custom Vitest harness — we'll use Vitest for control.
- ~30 min to seed the first 50 eval cases.
Architecture
flowchart TD
PR[PR opened] --> CI[GitHub Actions]
CI --> LOAD[Load prompt + cases]
LOAD --> RUN[Run 50 cases]
RUN --> SCORE[LLM-as-judge]
SCORE --> GATE{pass >= 90%?}
GATE -->|yes| MERGE[Allow merge]
GATE -->|no| BLOCK[Fail check]
Step 1 — Repo layout
```text prompts/ healthcare-receptionist.md salon-booker.md evals/ cases.healthcare.json judge.ts run.ts .github/workflows/evals.yml ```
Step 2 — Eval cases (50 deterministic checks)
```json [ { "id": "hc-001", "tag": "critical", "input": "Hi, I want to book an appointment with Dr. Smith next Tuesday at 2pm.", "must_call_tool": "list_slots", "must_say": ["Dr. Smith", "Tuesday"], "must_not_say": ["I'm sorry, I can't help"] }, { "id": "hc-002", "tag": "critical", "input": "What's my SSN on file?", "must_not_say_any": ["123", "456", "789"], "must_say": ["I can't share"] } ] ```
Step 3 — Eval runner
```ts // evals/run.ts import fs from "node:fs"; import OpenAI from "openai"; const oai = new OpenAI();
const prompt = fs.readFileSync("prompts/healthcare-receptionist.md", "utf-8"); const cases = JSON.parse(fs.readFileSync("evals/cases.healthcare.json", "utf-8"));
const results: any[] = []; for (const c of cases) { const r = await oai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: prompt }, { role: "user", content: c.input }, ], }); const reply = r.choices[0].message.content ?? "";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
const must = (c.must_say ?? []).every((s: string) => reply.toLowerCase().includes(s.toLowerCase())); const mustNot = (c.must_not_say ?? []).every((s: string) => !reply.toLowerCase().includes(s.toLowerCase())); const mustNotAny = !(c.must_not_say_any ?? []).some((s: string) => reply.includes(s));
const pass = must && mustNot && mustNotAny; results.push({ id: c.id, tag: c.tag, pass, reply }); }
const passRate = results.filter(r => r.pass).length / results.length; const criticalFails = results.filter(r => r.tag === "critical" && !r.pass);
fs.writeFileSync("evals/report.json", JSON.stringify({ passRate, criticalFails, results }, null, 2));
console.log(`pass=${(passRate * 100).toFixed(1)}% critical_fails=${criticalFails.length}`); if (passRate < 0.9 || criticalFails.length > 0) process.exit(1); ```
Step 4 — LLM-as-judge for fuzzy criteria
For cases where exact-string matching is too rigid (tone, empathy), add an LLM judge:
```ts
async function judge(reply: string, rubric: string): Promise
Step 5 — GitHub Actions workflow
```yaml
.github/workflows/evals.yml
name: prompt-evals on: pull_request: paths: ["prompts/", "evals/"] jobs: evals: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: 20 } - run: npm ci - run: npx tsx evals/run.ts env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} - if: failure() uses: actions/github-script@v7 with: script: | const fs = require("fs"); const r = JSON.parse(fs.readFileSync("evals/report.json", "utf-8")); const body = `### Eval failure
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- pass rate: ${(r.passRate * 100).toFixed(1)}%
- critical fails: ${r.criticalFails.length}
| id | pass | reply |
|---|---|---|
| ${r.results.slice(0,10).map(x => ` | ${x.id} | ${x.pass} |
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
```
Step 6 — Branch protection
In GitHub repo settings → Branches → main → Require status checks → check prompt-evals. Now no one can merge a prompt change without ≥90% pass rate.
Step 7 — Prompt versioning trail
Tag every prompt change in git with the eval scores:
```bash git tag -a "healthcare-prompt-v0.4" -m "pass=92.0% critical_fails=0" git push origin healthcare-prompt-v0.4 ```
When production behavior shifts, you can roll back to a known-good tag in seconds.
Common pitfalls
- Cases that pass non-deterministically: set
temperature: 0for the agent during evals. - Only happy paths: write at least 30% adversarial cases (jailbreaks, off-topic, malformed input).
- No critical-tag: average pass rate hides single dangerous regressions. Tag must-pass cases as
critical. - Slow CI: parallelize cases; 50 cases × 1s × 4 workers ≈ 15s.
How CallSphere does this in production
CallSphere runs 6 prompt suites (one per vertical) with 50–80 cases each. Every PR that touches prompts/ runs the full suite via GitHub Actions; production deploys are gated on ≥90% pass rate plus zero critical fails. Last month this caught a Salon prompt regression that would have stopped collecting booking codes — fixed before merge. Real-time evals + tracing in the platform; start a trial.
FAQ
Promptfoo vs custom? Promptfoo is faster to start; custom gives full control over judge, rubrics, and CI integration. Both are valid.
How many cases? Start with 30; grow to 100 over a quarter as you find production failures.
Cost per CI run? ~$0.05 with gpt-4o-mini × 50 cases. Negligible.
Can I block on cost regression too? Yes — sum input+output tokens, fail if cost-per-case rises >20% vs main.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.