Skip to content
AI Engineering
AI Engineering9 min read0 views

Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker

OpenAI acquired PromptFoo for $86M in March 2026. Treating agent evals as merge-blocking CI gates is the new production baseline.

On March 9, 2026 OpenAI acquired PromptFoo for $86 million. The acquisition signals what production agent teams already learned: evals are not nice-to-have. They are the merge gate.

What changed

PromptFoo hit 10,800 GitHub stars by Q1 2026 as the leading open-source LLM eval and red-teaming CLI. The OpenAI acquisition closed in March 2026 at $86M; OpenAI plans to integrate PromptFoo into the Agents SDK as the official CI gate for OpenAI-built agents.

The pattern that drove the acquisition: every serious production agent team in 2026 runs evals on every PR. Not periodic batch evals. Per-PR evals that block merge if the score drops below a threshold.

Three specific patterns matured:

  1. Threshold gates. A GitHub Action runs PromptFoo against the PR's prompt changes; if the success rate drops below 85% (configurable), the PR cannot merge.
  2. Red-team gates. Adversarial prompts injected into the eval suite; if the model leaks PII or follows an injection, the PR fails.
  3. Cost gates. Token spend per task tracked in the eval suite; PRs that increase cost-per-task by more than 10% require manual approval.

Why it matters for production agent teams

Agent quality regresses silently. A prompt tweak that improves one user journey can break two others. Without merge-blocking evals, regressions ship and surface as customer complaints days or weeks later.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Three concrete benefits of CI-gated evals:

Faster iteration. Engineers can ship prompt changes confidently because the eval gate catches regressions before they reach production.

Cross-model migration. When you migrate from Sonnet 4.5 to Sonnet 4.6 (or to Opus 4.7), the same eval suite tells you in 30 minutes whether the migration is safe.

Compliance proof. Regulated industries need an audit trail showing every model change was tested. Eval CI logs are that trail.

How CallSphere applies this

CallSphere ships agent changes through a 4-stage gate:

  1. Unit eval (PR-time): PromptFoo runs ~200 representative tasks per agent on every PR. Required pass rate: 85%.
  2. Red-team eval (PR-time): ~50 adversarial prompts (injection, jailbreak, PII leak attempts). Required pass rate: 100%.
  3. Cost eval (PR-time): Average tokens-per-task tracked. Increase >10% requires manual approval.
  4. Canary deploy (post-merge): 5% traffic for 24 hours, monitored against production baselines.

For our 37 agents, this means our nightly eval cost is non-trivial — but the cost of a regression in a HIPAA-regulated voice agent is much higher. Our IT Helpdesk U Rack IT and behavioral health verticals require the highest eval thresholds; our consumer real estate triage agent has a lower bar.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

We also run weekly red-team campaigns that inject novel attacks into the eval suite. The campaigns find failures the static eval cannot.

Migration / build steps

  1. Build a representative eval set. Sample 100-500 production conversations per agent. Anonymize and freeze as your eval baseline.
  2. Wire PromptFoo into GitHub Actions. The official action takes 20 minutes to set up.
  3. Set thresholds conservatively. Start at 80% pass rate; tighten over time as your eval suite stabilizes.
  4. Add red-team evals. Inject prompt injection, PII extraction, and policy-violation prompts. These should never pass.
  5. Track eval cost. Eval runs are not free. Most teams budget 1-3% of model spend for evals.
graph LR
    A[PR Opened] --> B[PromptFoo CI Run]
    B --> C{Pass Rate >= 85%?}
    C -->|no| D[Block Merge]
    C -->|yes| E{Red-Team Pass?}
    E -->|no| D
    E -->|yes| F{Cost OK?}
    F -->|no| G[Manual Review]
    F -->|yes| H[Approve Merge]
    H --> I[Canary Deploy 5%]

FAQ

How big should the eval set be? 100-500 tasks per agent. Larger sets are slower; smaller sets miss regressions. CallSphere runs ~200 per agent.

What is the right pass rate threshold? Domain-dependent. Voice agents in regulated industries: 95%+. Consumer-facing chat: 85%+. Internal ops: 75%+.

Doesn't PromptFoo's OpenAI acquisition raise lock-in concerns? PromptFoo is open source and OpenAI committed to keeping the OSS core. Watch the license over the next 12 months but the immediate risk is low.

What about Braintrust as an alternative? Excellent for teams that want a hosted product with a UI. Braintrust's CI gates work the same way; the choice is hosted vs self-hosted.

Where can I see CallSphere's eval coverage? Our enterprise plans on the pricing page include eval suites maintained by our team. The 22% affiliate program helps you launch with our standard evals.

Sources

## Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker: production view Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker is also a cost-per-conversation problem hiding in plain sight. Once you instrument tokens-in, tokens-out, tool calls, ASR seconds, and TTS seconds against booked-revenue per call, the right tradeoff between Realtime API and an async ASR + LLM + TTS pipeline becomes obvious — and it's almost never the same answer for healthcare as it is for salons. ## Shipping the agent to production Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs **37 agents** across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop. Structured tools beat free-form text every time. Our **90+ function tools** all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries. The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in **115+ database tables** spanning all 6 verticals. ## FAQ **How does this apply to a CallSphere pilot specifically?** Setup runs 3–5 business days, the trial is 14 days with no credit card, and pricing tiers are $149, $499, and $1,499 — so a vertical-specific pilot is a same-week decision, not a quarterly project. For a topic like "Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations. **What does the typical first-week implementation look like?** Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar. **Where does this break down at scale?** The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer. ## Talk to us Want to see how this maps to your stack? Book a live walkthrough at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting), or try the vertical-specific demo at [escalation.callsphere.tech](https://escalation.callsphere.tech). 14-day trial, no credit card, pilot live in 3–5 business days.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.