Feature Flags for AI: Statsig vs GrowthBook vs LaunchDarkly (2026)
Compare Statsig (now OpenAI-owned), GrowthBook with MCP, and LaunchDarkly for shipping AI prompt and model changes safely. Real flag patterns: prompt rollout, model swap, kill switch.
TL;DR — Statsig (acquired by OpenAI) bundles flags + experimentation; GrowthBook is the open-source alternative with MCP integration; LaunchDarkly is the enterprise default. Use flags for prompts and model versions, not for feature toggles — AI changes need % rollout, not all-or-nothing.
What you'll set up
A voice agent that pulls its system prompt and model name from a feature-flag service at session start, with cohort assignment by tenant ID and a global kill switch. Three patterns shown — pick one platform.
Architecture
flowchart LR
CALL[New call] --> AGENT[Voice agent]
AGENT -->|user_id| FLAGS[Flag SDK]
FLAGS --> SDK_LD[LaunchDarkly]
FLAGS --> SDK_ST[Statsig]
FLAGS --> SDK_GB[GrowthBook]
SDK_LD --> CONFIG[(Eval rules)]
AGENT -->|chosen prompt + model| LLM[OpenAI Realtime]
AGENT -.->|track event| EVENTS[Events table]
Step 1 — Define what to flag
For AI agents, flag prompts and models, not features:
```json { "agent_system_prompt_v": { "rollout": "v3", "fallback": "v2" }, "agent_model": { "rollout": "gpt-realtime", "fallback": "gpt-realtime-mini" }, "agent_tool_set": { "rollout": "core+v2-extras", "fallback": "core" }, "agent_kill_switch": false } ```
Kill switch is non-negotiable — if a model regression hits, you flip one bool and traffic moves back to the known-good config in seconds.
Step 2a — LaunchDarkly client (Python)
```python import ldclient from ldclient.config import Config ldclient.set_config(Config("${LD_SDK_KEY}")) ld = ldclient.get()
def session_config(call_id: str, tenant_id: str): ctx = ldclient.Context.builder(call_id).kind("call") \ .set("tenant", tenant_id).set("vertical", "healthcare").build() if ld.variation("agent_kill_switch", ctx, False): return SAFE_DEFAULTS return { "prompt_v": ld.variation("agent_system_prompt_v", ctx, "v2"), "model": ld.variation("agent_model", ctx, "gpt-realtime-mini"), } ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
LaunchDarkly's percentage rollouts and segments are mature; experiment add-on hooks into your warehouse.
Step 2b — Statsig (now OpenAI-owned)
```python from statsig import statsig, StatsigUser statsig.initialize("${STATSIG_SDK_KEY}")
def session_config(call_id, tenant_id): user = StatsigUser(user_id=call_id, custom={"tenant": tenant_id}) if statsig.check_gate(user, "agent_kill_switch"): return SAFE_DEFAULTS cfg = statsig.get_config(user, "voice_agent") return { "prompt_v": cfg.get("prompt_v", "v2"), "model": cfg.get("model", "gpt-realtime-mini") } ```
Statsig is best when you want flags + experiment results in the same dashboard. Now part of OpenAI, it has tightened LLM-aware defaults (auto-track input/output tokens per gate).
Step 2c — GrowthBook with MCP
```python from growthbook import GrowthBook gb = GrowthBook(api_host="https://api.growthbook.io", client_key="${GB_KEY}")
def session_config(call_id, tenant_id): gb.set_attributes({"id": call_id, "tenant": tenant_id}) if gb.is_on("agent_kill_switch"): return SAFE_DEFAULTS return { "prompt_v": gb.get_feature_value("agent_system_prompt_v", "v2"), "model": gb.get_feature_value("agent_model", "gpt-realtime-mini"), } ```
GrowthBook's MCP server (2026) lets Claude or other agents read/write flags directly — useful for "the agent ships its own canary".
Step 3 — Track outcomes
```python def end_call(call_id, success, latency_ms, ev): ev.track(call_id, "call_end", {"success": success, "latency_ms": latency_ms, "prompt_v": cfg["prompt_v"], "model": cfg["model"]}) ```
Without outcome events, you have flags but no learning. Tie every call to its flag values so the experiment view can compute lift.
Step 4 — Stream flag changes (no restart)
All three SDKs poll/SSE for changes. For Python, ensure the SDK is initialized once at process start, not per call (huge perf hit otherwise).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 5 — Bake in a fallback when the flag service is down
```python SAFE_DEFAULTS = {"prompt_v": "v2", "model": "gpt-realtime-mini", "tool_set": "core"} ```
Every variation() call must have a hard-coded default. If LaunchDarkly's CDN is degraded, you keep serving — just without rollout granularity.
Step 6 — Cohort by tenant for vertical safety
For multi-vertical agents (healthcare, salon, etc.), cohort by vertical so a healthcare-only prompt change can't leak to dental.
```json // LaunchDarkly targeting { "rules": [ { "if": { "vertical": "healthcare" }, "then": { "variation": "v3-healthcare" }}, { "default": "v2" } ] } ```
Step 7 — Audit log + approval
Lock prompt and model flags behind LaunchDarkly Workflows / Statsig approval gates. A 1-click change to model = 2-eyes review.
Pitfalls
- Per-call SDK init — ~100 ms tax. Init once, share the client.
- Sticky bucketing — for a multi-turn voice call, hash on call_id, not tenant_id, so a tenant doesn't see different prompt mid-call.
- Outcome events without sampling at high QPS bills you on the events plan. Sample at 10% for telemetry, 100% for experiments.
- Kill switch cache TTL — if your SDK caches for 60 s, your kill switch takes 60 s to fire. Set polling to 5-10 s.
- Flag sprawl — every flag is debt. Delete flags after rollout completes; both Statsig and LaunchDarkly have cleanup reminders.
How CallSphere does this in production
CallSphere uses GrowthBook self-hosted as the source of truth for system prompts and model selection across 37 voice agents and 6 verticals. Every call assigns a flag bundle at start (cached for the call), and we ship outcome events to Postgres for experiment analysis. Healthcare gets stricter cohort rules; behavioral health has its own approval workflow. 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate.
FAQ
Q: Why flag prompts vs versioning them in code? Speed of rollback (seconds vs deploy), and percentage rollout per cohort. Code versioning is for the content of prompts; flags are for which one is live.
Q: Statsig + OpenAI — concerns? Acquisition closed 2025. So far the SDK and pricing haven't changed; data residency policies are the watch-item.
Q: Open-source alternative if all three feel heavy? Unleash or PostHog feature flags. Both work fine for AI but have less LLM-specific tooling.
Q: How do I roll back instantly?
Flip the kill switch to true. With ~5s SDK polling, traffic moves to safe defaults globally in <30 s.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.