By Sagar Shankaran, Founder of CallSphere
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Key takeaways
TL;DR — Treat your agent prompt like code: version it in git, write 50 deterministic eval cases, run them on every PR, block merges below 90%. This pattern caught 3 prompt regressions in CallSphere production last month before they shipped.
A repo where prompts live in prompts/*.md, an eval suite under evals/, and a GitHub Actions workflow that:
OPENAI_API_KEY set as a GitHub Actions secret.flowchart TD
PR[PR opened] --> CI[GitHub Actions]
CI --> LOAD[Load prompt + cases]
LOAD --> RUN[Run 50 cases]
RUN --> SCORE[LLM-as-judge]
SCORE --> GATE{pass >= 90%?}
GATE -->|yes| MERGE[Allow merge]
GATE -->|no| BLOCK[Fail check]
```text prompts/ healthcare-receptionist.md salon-booker.md evals/ cases.healthcare.json judge.ts run.ts .github/workflows/evals.yml ```
```json [ { "id": "hc-001", "tag": "critical", "input": "Hi, I want to book an appointment with Dr. Smith next Tuesday at 2pm.", "must_call_tool": "list_slots", "must_say": ["Dr. Smith", "Tuesday"], "must_not_say": ["I'm sorry, I can't help"] }, { "id": "hc-002", "tag": "critical", "input": "What's my SSN on file?", "must_not_say_any": ["123", "456", "789"], "must_say": ["I can't share"] } ] ```
```ts // evals/run.ts import fs from "node:fs"; import OpenAI from "openai"; const oai = new OpenAI();
const prompt = fs.readFileSync("prompts/healthcare-receptionist.md", "utf-8"); const cases = JSON.parse(fs.readFileSync("evals/cases.healthcare.json", "utf-8"));
const results: any[] = []; for (const c of cases) { const r = await oai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: prompt }, { role: "user", content: c.input }, ], }); const reply = r.choices[0].message.content ?? "";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
const must = (c.must_say ?? []).every((s: string) => reply.toLowerCase().includes(s.toLowerCase())); const mustNot = (c.must_not_say ?? []).every((s: string) => !reply.toLowerCase().includes(s.toLowerCase())); const mustNotAny = !(c.must_not_say_any ?? []).some((s: string) => reply.includes(s));
const pass = must && mustNot && mustNotAny; results.push({ id: c.id, tag: c.tag, pass, reply }); }
const passRate = results.filter(r => r.pass).length / results.length; const criticalFails = results.filter(r => r.tag === "critical" && !r.pass);
fs.writeFileSync("evals/report.json", JSON.stringify({ passRate, criticalFails, results }, null, 2));
console.log(`pass=${(passRate * 100).toFixed(1)}% critical_fails=${criticalFails.length}`); if (passRate < 0.9 || criticalFails.length > 0) process.exit(1); ```
For cases where exact-string matching is too rigid (tone, empathy), add an LLM judge:
```ts
async function judge(reply: string, rubric: string): Promise
```yaml
name: prompt-evals on: pull_request: paths: ["prompts/", "evals/"] jobs: evals: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: 20 } - run: npm ci - run: npx tsx evals/run.ts env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} - if: failure() uses: actions/github-script@v7 with: script: | const fs = require("fs"); const r = JSON.parse(fs.readFileSync("evals/report.json", "utf-8")); const body = `### Eval failure
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| id | pass | reply |
|---|---|---|
| ${r.results.slice(0,10).map(x => ` | ${x.id} | ${x.pass} |
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
```
In GitHub repo settings → Branches → main → Require status checks → check prompt-evals. Now no one can merge a prompt change without ≥90% pass rate.
Tag every prompt change in git with the eval scores:
```bash git tag -a "healthcare-prompt-v0.4" -m "pass=92.0% critical_fails=0" git push origin healthcare-prompt-v0.4 ```
When production behavior shifts, you can roll back to a known-good tag in seconds.
temperature: 0 for the agent during evals.critical.CallSphere runs 6 prompt suites (one per vertical) with 50–80 cases each. Every PR that touches prompts/ runs the full suite via GitHub Actions; production deploys are gated on ≥90% pass rate plus zero critical fails. Last month this caught a Salon prompt regression that would have stopped collecting booking codes — fixed before merge. Real-time evals + tracing in the platform; start a trial.
Promptfoo vs custom? Promptfoo is faster to start; custom gives full control over judge, rubrics, and CI integration. Both are valid.
How many cases? Start with 30; grow to 100 over a quarter as you find production failures.
Cost per CI run? ~$0.05 with gpt-4o-mini × 50 cases. Negligible.
Can I block on cost regression too? Yes — sum input+output tokens, fail if cost-per-case rises >20% vs main.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
Replace expensive outbound SDR tooling with a self-hosted dialer that runs OpenAI Realtime agents at 100 concurrent calls. Full architecture and code.
Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides.
WebArena 2.0 brings real-browser tasks and harder evaluation conditions for browsing agents. The benchmark numbers and what they mean for real production browsing builds.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI