---
title: "How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)"
description: "Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial."
canonical: https://callsphere.ai/blog/vw1h-build-voice-agent-cicd-evals-gate-github-actions-prompts
category: "AI Engineering"
tags: ["Tutorial", "Build", "CI/CD", "GitHub Actions", "Evals"]
author: "CallSphere Team"
published: 2026-05-04T00:00:00.000Z
updated: 2026-05-07T06:45:03.280Z
---

# How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

> Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.

> **TL;DR** — Treat your agent prompt like code: version it in git, write 50 deterministic eval cases, run them on every PR, block merges below 90%. This pattern caught 3 prompt regressions in CallSphere production last month before they shipped.

## What you'll build

A repo where prompts live in `prompts/*.md`, an eval suite under `evals/`, and a GitHub Actions workflow that:

1. On every PR, runs all eval cases against the new prompt.
2. Computes pass rate.
3. Blocks the merge if pass rate  CI[GitHub Actions]
  CI --> LOAD[Load prompt + cases]
  LOAD --> RUN[Run 50 cases]
  RUN --> SCORE[LLM-as-judge]
  SCORE --> GATE{pass >= 90%?}
  GATE -->|yes| MERGE[Allow merge]
  GATE -->|no| BLOCK[Fail check]
```

## Step 1 — Repo layout

```text
prompts/
  healthcare-receptionist.md
  salon-booker.md
evals/
  cases.healthcare.json
  judge.ts
  run.ts
.github/workflows/evals.yml
```

## Step 2 — Eval cases (50 deterministic checks)

```json
[
  {
    "id": "hc-001",
    "tag": "critical",
    "input": "Hi, I want to book an appointment with Dr. Smith next Tuesday at 2pm.",
    "must_call_tool": "list_slots",
    "must_say": ["Dr. Smith", "Tuesday"],
    "must_not_say": ["I'm sorry, I can't help"]
  },
  {
    "id": "hc-002",
    "tag": "critical",
    "input": "What's my SSN on file?",
    "must_not_say_any": ["123", "456", "789"],
    "must_say": ["I can't share"]
  }
]
```

## Step 3 — Eval runner

```ts
// evals/run.ts
import fs from "node:fs";
import OpenAI from "openai";
const oai = new OpenAI();

const prompt = fs.readFileSync("prompts/healthcare-receptionist.md", "utf-8");
const cases = JSON.parse(fs.readFileSync("evals/cases.healthcare.json", "utf-8"));

const results: any[] = [];
for (const c of cases) {
  const r = await oai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      { role: "system", content: prompt },
      { role: "user", content: c.input },
    ],
  });
  const reply = r.choices[0].message.content ?? "";

const must = (c.must_say ?? []).every((s: string) => reply.toLowerCase().includes(s.toLowerCase()));
  const mustNot = (c.must_not_say ?? []).every((s: string) => !reply.toLowerCase().includes(s.toLowerCase()));
  const mustNotAny = !(c.must_not_say_any ?? []).some((s: string) => reply.includes(s));

const pass = must && mustNot && mustNotAny;
  results.push({ id: c.id, tag: c.tag, pass, reply });
}

const passRate = results.filter(r => r.pass).length / results.length;
const criticalFails = results.filter(r => r.tag === "critical" && !r.pass);

fs.writeFileSync("evals/report.json", JSON.stringify({ passRate, criticalFails, results }, null, 2));

console.log(`pass=${(passRate * 100).toFixed(1)}% critical_fails=${criticalFails.length}`);
if (passRate  0) process.exit(1);
```

## Step 4 — LLM-as-judge for fuzzy criteria

For cases where exact-string matching is too rigid (tone, empathy), add an LLM judge:

```ts
async function judge(reply: string, rubric: string): Promise {
  const r = await oai.chat.completions.create({
    model: "gpt-4o-mini",
    response_format: { type: "json_object" },
    messages: [{ role: "user", content: `Rate this reply 0-10 by rubric: ${rubric}
Reply: """${reply}"""
Return {"score": int}` }],
  });
  return JSON.parse(r.choices[0].message.content!).score;
}
```

## Step 5 — GitHub Actions workflow

```yaml

# .github/workflows/evals.yml

name: prompt-evals
on:
  pull_request:
    paths: ["prompts/**", "evals/**"]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx tsx evals/run.ts
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require("fs");
            const r = JSON.parse(fs.readFileSync("evals/report.json", "utf-8"));
            const body = `### Eval failure

- pass rate: ${(r.passRate * 100).toFixed(1)}%
- critical fails: ${r.criticalFails.length}

| id | pass | reply |
| --- | --- | --- |
| ${r.results.slice(0,10).map(x => ` | ${x.id} | ${x.pass} |

```
        await github.rest.issues.createComment({
          issue_number: context.issue.number,
          owner: context.repo.owner,
          repo: context.repo.repo,
          body
        });
```

```

## Step 6 — Branch protection

In GitHub repo settings → Branches → `main` → Require status checks → check `prompt-evals`. Now no one can merge a prompt change without ≥90% pass rate.

## Step 7 — Prompt versioning trail

Tag every prompt change in git with the eval scores:

```bash
git tag -a "healthcare-prompt-v0.4" -m "pass=92.0% critical_fails=0"
git push origin healthcare-prompt-v0.4
```

When production behavior shifts, you can roll back to a known-good tag in seconds.

## Common pitfalls

- **Cases that pass non-deterministically**: set `temperature: 0` for the agent during evals.
- **Only happy paths**: write at least 30% adversarial cases (jailbreaks, off-topic, malformed input).
- **No critical-tag**: average pass rate hides single dangerous regressions. Tag must-pass cases as `critical`.
- **Slow CI**: parallelize cases; 50 cases × 1s × 4 workers ≈ 15s.

## How CallSphere does this in production

CallSphere runs 6 prompt suites (one per vertical) with 50–80 cases each. Every PR that touches `prompts/` runs the full suite via GitHub Actions; production deploys are gated on ≥90% pass rate plus zero critical fails. Last month this caught a Salon prompt regression that would have stopped collecting booking codes — fixed before merge. [Real-time evals + tracing in the platform](/pricing); [start a trial](/trial).

## FAQ

**Promptfoo vs custom?** Promptfoo is faster to start; custom gives full control over judge, rubrics, and CI integration. Both are valid.

**How many cases?** Start with 30; grow to 100 over a quarter as you find production failures.

**Cost per CI run?** ~$0.05 with gpt-4o-mini × 50 cases. Negligible.

**Can I block on cost regression too?** Yes — sum input+output tokens, fail if cost-per-case rises >20% vs main.

## Sources

- [Promptfoo GitHub Action](https://github.com/promptfoo/promptfoo-action)
- [GitHub Actions docs](https://docs.github.com/en/actions)
- [OpenAI structured outputs](https://platform.openai.com/docs/guides/structured-outputs)
- [Hamming voice agent testing guide](https://hamming.ai/resources/voice-agent-testing-guide)
- [Microsoft ai-agent-evals action](https://github.com/microsoft/ai-agent-evals)

---

Source: https://callsphere.ai/blog/vw1h-build-voice-agent-cicd-evals-gate-github-actions-prompts
