By Sagar Shankaran, Founder of CallSphere
OpenAI acquired PromptFoo for $86M in March 2026. Treating agent evals as merge-blocking CI gates is the new production baseline.
Key takeaways
On March 9, 2026 OpenAI acquired PromptFoo for $86 million. The acquisition signals what production agent teams already learned: evals are not nice-to-have. They are the merge gate.
PromptFoo hit 10,800 GitHub stars by Q1 2026 as the leading open-source LLM eval and red-teaming CLI. The OpenAI acquisition closed in March 2026 at $86M; OpenAI plans to integrate PromptFoo into the Agents SDK as the official CI gate for OpenAI-built agents.
The pattern that drove the acquisition: every serious production agent team in 2026 runs evals on every PR. Not periodic batch evals. Per-PR evals that block merge if the score drops below a threshold.
Three specific patterns matured:
Agent quality regresses silently. A prompt tweak that improves one user journey can break two others. Without merge-blocking evals, regressions ship and surface as customer complaints days or weeks later.
Three concrete benefits of CI-gated evals:
Faster iteration. Engineers can ship prompt changes confidently because the eval gate catches regressions before they reach production.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Cross-model migration. When you migrate from Sonnet 4.5 to Sonnet 4.6 (or to Opus 4.7), the same eval suite tells you in 30 minutes whether the migration is safe.
Compliance proof. Regulated industries need an audit trail showing every model change was tested. Eval CI logs are that trail.
CallSphere ships agent changes through a 4-stage gate:
For our 37 agents, this means our nightly eval cost is non-trivial — but the cost of a regression in a HIPAA-regulated voice agent is much higher. Our IT Helpdesk U Rack IT and behavioral health verticals require the highest eval thresholds; our consumer real estate triage agent has a lower bar.
We also run weekly red-team campaigns that inject novel attacks into the eval suite. The campaigns find failures the static eval cannot.
graph LR
A[PR Opened] --> B[PromptFoo CI Run]
B --> C{Pass Rate >= 85%?}
C -->|no| D[Block Merge]
C -->|yes| E{Red-Team Pass?}
E -->|no| D
E -->|yes| F{Cost OK?}
F -->|no| G[Manual Review]
F -->|yes| H[Approve Merge]
H --> I[Canary Deploy 5%]
How big should the eval set be? 100-500 tasks per agent. Larger sets are slower; smaller sets miss regressions. CallSphere runs ~200 per agent.
What is the right pass rate threshold? Domain-dependent. Voice agents in regulated industries: 95%+. Consumer-facing chat: 85%+. Internal ops: 75%+.
Doesn't PromptFoo's OpenAI acquisition raise lock-in concerns? PromptFoo is open source and OpenAI committed to keeping the OSS core. Watch the license over the next 12 months but the immediate risk is low.
What about Braintrust as an alternative? Excellent for teams that want a hosted product with a UI. Braintrust's CI gates work the same way; the choice is hosted vs self-hosted.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Where can I see CallSphere's eval coverage? Our enterprise plans on the pricing page include eval suites maintained by our team. The 22% affiliate program helps you launch with our standard evals.
Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker is also a cost-per-conversation problem hiding in plain sight. Once you instrument tokens-in, tokens-out, tool calls, ASR seconds, and TTS seconds against booked-revenue per call, the right tradeoff between Realtime API and an async ASR + LLM + TTS pipeline becomes obvious — and it's almost never the same answer for healthcare as it is for salons.
Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs 37 agents across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop.
Structured tools beat free-form text every time. Our 90+ function tools all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries.
The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in 115+ database tables spanning all 6 verticals.
How does this apply to a CallSphere pilot specifically? Setup runs 3–5 business days, the trial is 14 days with no credit card, and pricing tiers are $149, $499, and $1,499 — so a vertical-specific pilot is a same-week decision, not a quarterly project. For a topic like "Agent Evals as CI Gates: How PromptFoo Became the Merge Blocker", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.
What does the typical first-week implementation look like? Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.
Where does this break down at scale? The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.
Want to see how this maps to your stack? Book a live walkthrough at calendly.com/sagar-callsphere/new-meeting, or try the vertical-specific demo at escalation.callsphere.tech. 14-day trial, no credit card, pilot live in 3–5 business days.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
The public MCP registry crossed 9,400 servers in April 2026. Here is a curated walkthrough of the SaaS MCP servers CallSphere mounts in production, with OAuth 2.1 PKCE patterns.
A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.
Neo4j's agent-memory project ships short-term, long-term, and reasoning memory in one graph. Microsoft Agent Framework and LangChain both wire it in. Here is the production pattern.
How leaders should think about Claude Sonnet 4.6 customer support — adoption patterns, ROI, competitive dynamics, and what CX automation means for the next 12 months.
AI SDK 5 ships fully typed chat for React, Svelte, Vue, and Angular plus first-class agent loop primitives. Here are the patterns that matter for shipping in 2026.
© 2026 CallSphere LLC. All rights reserved.