TL;DR — Statsig (acquired by OpenAI) bundles flags + experimentation; GrowthBook is the open-source alternative with MCP integration; LaunchDarkly is the enterprise default. Use flags for prompts and model versions, not for feature toggles — AI changes need % rollout, not all-or-nothing.

What you'll set up

A voice agent that pulls its system prompt and model name from a feature-flag service at session start, with cohort assignment by tenant ID and a global kill switch. Three patterns shown — pick one platform.

Architecture

flowchart LR
  CALL[New call] --> AGENT[Voice agent]
  AGENT -->|user_id| FLAGS[Flag SDK]
  FLAGS --> SDK_LD[LaunchDarkly]
  FLAGS --> SDK_ST[Statsig]
  FLAGS --> SDK_GB[GrowthBook]
  SDK_LD --> CONFIG[(Eval rules)]
  AGENT -->|chosen prompt + model| LLM[OpenAI Realtime]
  AGENT -.->|track event| EVENTS[Events table]

Step 1 — Define what to flag

For AI agents, flag prompts and models, not features:

```json { "agent_system_prompt_v": { "rollout": "v3", "fallback": "v2" }, "agent_model": { "rollout": "gpt-realtime", "fallback": "gpt-realtime-mini" }, "agent_tool_set": { "rollout": "core+v2-extras", "fallback": "core" }, "agent_kill_switch": false } ```

Kill switch is non-negotiable — if a model regression hits, you flip one bool and traffic moves back to the known-good config in seconds.

Step 2a — LaunchDarkly client (Python)

```python import ldclient from ldclient.config import Config ldclient.set_config(Config("${LD_SDK_KEY}")) ld = ldclient.get()

def session_config(call_id: str, tenant_id: str): ctx = ldclient.Context.builder(call_id).kind("call") \ .set("tenant", tenant_id).set("vertical", "healthcare").build() if ld.variation("agent_kill_switch", ctx, False): return SAFE_DEFAULTS return { "prompt_v": ld.variation("agent_system_prompt_v", ctx, "v2"), "model": ld.variation("agent_model", ctx, "gpt-realtime-mini"), } ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

LaunchDarkly's percentage rollouts and segments are mature; experiment add-on hooks into your warehouse.

Step 2b — Statsig (now OpenAI-owned)

```python from statsig import statsig, StatsigUser statsig.initialize("${STATSIG_SDK_KEY}")

def session_config(call_id, tenant_id): user = StatsigUser(user_id=call_id, custom={"tenant": tenant_id}) if statsig.check_gate(user, "agent_kill_switch"): return SAFE_DEFAULTS cfg = statsig.get_config(user, "voice_agent") return { "prompt_v": cfg.get("prompt_v", "v2"), "model": cfg.get("model", "gpt-realtime-mini") } ```

Statsig is best when you want flags + experiment results in the same dashboard. Now part of OpenAI, it has tightened LLM-aware defaults (auto-track input/output tokens per gate).

Step 2c — GrowthBook with MCP

```python from growthbook import GrowthBook gb = GrowthBook(api_host="https://api.growthbook.io", client_key="${GB_KEY}")

def session_config(call_id, tenant_id): gb.set_attributes({"id": call_id, "tenant": tenant_id}) if gb.is_on("agent_kill_switch"): return SAFE_DEFAULTS return { "prompt_v": gb.get_feature_value("agent_system_prompt_v", "v2"), "model": gb.get_feature_value("agent_model", "gpt-realtime-mini"), } ```

GrowthBook's MCP server (2026) lets Claude or other agents read/write flags directly — useful for "the agent ships its own canary".

Step 3 — Track outcomes

```python def end_call(call_id, success, latency_ms, ev): ev.track(call_id, "call_end", {"success": success, "latency_ms": latency_ms, "prompt_v": cfg["prompt_v"], "model": cfg["model"]}) ```

Without outcome events, you have flags but no learning. Tie every call to its flag values so the experiment view can compute lift.

Step 4 — Stream flag changes (no restart)

All three SDKs poll/SSE for changes. For Python, ensure the SDK is initialized once at process start, not per call (huge perf hit otherwise).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 5 — Bake in a fallback when the flag service is down

```python SAFE_DEFAULTS = {"prompt_v": "v2", "model": "gpt-realtime-mini", "tool_set": "core"} ```

Every variation() call must have a hard-coded default. If LaunchDarkly's CDN is degraded, you keep serving — just without rollout granularity.

Step 6 — Cohort by tenant for vertical safety

For multi-vertical agents (healthcare, salon, etc.), cohort by vertical so a healthcare-only prompt change can't leak to dental.

```json // LaunchDarkly targeting { "rules": [ { "if": { "vertical": "healthcare" }, "then": { "variation": "v3-healthcare" }}, { "default": "v2" } ] } ```

Step 7 — Audit log + approval

Lock prompt and model flags behind LaunchDarkly Workflows / Statsig approval gates. A 1-click change to model = 2-eyes review.

Pitfalls

Per-call SDK init — ~100 ms tax. Init once, share the client.
Sticky bucketing — for a multi-turn voice call, hash on call_id, not tenant_id, so a tenant doesn't see different prompt mid-call.
Outcome events without sampling at high QPS bills you on the events plan. Sample at 10% for telemetry, 100% for experiments.
Kill switch cache TTL — if your SDK caches for 60 s, your kill switch takes 60 s to fire. Set polling to 5-10 s.
Flag sprawl — every flag is debt. Delete flags after rollout completes; both Statsig and LaunchDarkly have cleanup reminders.

How CallSphere does this in production

CallSphere uses GrowthBook self-hosted as the source of truth for system prompts and model selection across 37 voice agents and 6 verticals. Every call assigns a flag bundle at start (cached for the call), and we ship outcome events to Postgres for experiment analysis. Healthcare gets stricter cohort rules; behavioral health has its own approval workflow. 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: Why flag prompts vs versioning them in code? Speed of rollback (seconds vs deploy), and percentage rollout per cohort. Code versioning is for the content of prompts; flags are for which one is live.

Q: Statsig + OpenAI — concerns? Acquisition closed 2025. So far the SDK and pricing haven't changed; data residency policies are the watch-item.

Q: Open-source alternative if all three feel heavy? Unleash or PostHog feature flags. Both work fine for AI but have less LLM-specific tooling.

Q: How do I roll back instantly? Flip the kill switch to true. With ~5s SDK polling, traffic moves to safe defaults globally in <30 s.

Feature Flags for AI: Statsig vs GrowthBook vs LaunchDarkly (2026)

What you'll set up

Architecture

Step 1 — Define what to flag

Step 2a — LaunchDarkly client (Python)

Step 2b — Statsig (now OpenAI-owned)

Step 2c — GrowthBook with MCP

Step 3 — Track outcomes

Step 4 — Stream flag changes (no restart)

Step 5 — Bake in a fallback when the flag service is down

Step 6 — Cohort by tenant for vertical safety

Step 7 — Audit log + approval

Pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Build a CallSphere-Style Outbound Voice Campaign Tool

Build a CallSphere-Style Multi-Agent for HVAC Dispatch

Build a Voice Agent on AWS App Runner with FastAPI + Bedrock (2026)