By Sagar Shankaran, Founder of CallSphere
Run a 5% → 25% → 50% → 100% canary on a voice agent with Argo Rollouts, AnalysisTemplate against eval pass-rate, and the new Metric AI plugin from ArgoCon 2026.
Key takeaways
TL;DR — Argo Rollouts replaces a Deployment with a Rollout CRD. AnalysisTemplates measure live metrics (eval pass-rate, p95 first-token latency, error budget burn) at each step. The Metric AI plugin from ArgoCon 2026 lets the rollout controller reason about why a metric moved, not just whether it crossed a threshold.
An Argo Rollouts Rollout for the voice agent: 5% → 25% → 50% → 100% with AnalysisTemplates that hit our internal eval harness and Prometheus, plus the Metric AI plugin to root-cause regressions and auto-rollback.
flowchart LR
GIT[deploy repo] --> ROLL[Rollout CRD]
ROLL -->|5%| C1[canary v2]
C1 --> ANA[AnalysisTemplate]
ANA -->|prom + evals| METRIC[Metric AI plugin]
METRIC --> DECIDE{Pass?}
DECIDE -->|yes| C2[25% canary]
DECIDE -->|no| RB[Auto rollback]
C2 --> C3[50%]
C3 --> FULL[100%]
```bash kubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml kubectl apply -f https://github.com/argoproj-labs/rollouts-metricai-plugin/releases/latest/download/install.yaml ```
The Metric AI plugin runs as a sidecar to the rollouts controller; it can run kubectl logs, query Prometheus, and ask Claude to assess "is this canary healthy".
```yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: { name: voice-agent } spec: replicas: 10 selector: matchLabels: { app: voice-agent } template: metadata: { labels: { app: voice-agent }} spec: containers: - name: agent image: ghcr.io/acme/voice-agent:v1 strategy: canary: canaryService: voice-agent-canary stableService: voice-agent-stable trafficRouting: istio: virtualService: { name: voice-vs, routes: [primary] } steps: - setWeight: 5 - pause: { duration: 5m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 25 - pause: { duration: 10m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 50 - pause: { duration: 15m } - setWeight: 100 ```
```yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: { name: voice-canary } spec: metrics: - name: eval-pass-rate successCondition: result[0] >= 0.92 failureLimit: 0 provider: web: url: https://evals.internal/api/run?suite=voice&version={{args.canary-hash}} jsonPath: "{$.passRate}" - name: p95-first-token successCondition: result[0] <= 800 provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile(0.95, sum(rate(voice_first_token_ms_bucket{version="canary"}[5m])) by (le)) - name: ai-judge provider: plugin: metricai/judge: prompt: | Compare canary vs stable error logs and Prometheus deltas. Return PASS or FAIL with one-sentence reason. ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The third metric is the AI judge. It looks at logs, traces, and metric deltas and emits a PASS/FAIL — catches "subtle prompt regression" failures that fixed thresholds miss.
```python
@app.get("/api/run") def run(suite: str, version: str): results = harness.run_suite(suite, model_endpoint=f"http://voice-agent-canary:8080") return { "passRate": results.pass_rate, "totals": results.totals } ```
The eval service hits voice-agent-canary (the canary subset) directly so it's measuring what real traffic will see.
```yaml - setWeight: 50 - pause: {} # indefinite; kubectl argo rollouts promote voice-agent ```
Some teams want a human in the loop for the last step. Use kubectl argo rollouts promote voice-agent to release.
```yaml spec: rollbackWindow: { revisions: 3 } strategy: canary: abortScaleDownDelaySeconds: 30 ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Failed analysis aborts the rollout in <60 s; old ReplicaSet is already running, so traffic just stays on stable. No customer-visible outage.
```yaml trigger.on-rollout-aborted: | send: [voice-aborted] template.voice-aborted: | message: | :rotating_light: voice-agent canary aborted at step {{.step}} Reason: {{.failure_reason}} See: https://argo.example.com/rollouts/voice-agent ```
[5m] rate when the canary has been live for 2 min returns zero. Always set pause.duration >> query window.setWeight: 1 can have zero pods on a 10-replica fleet (1% of 10 rounds to 0). Set canary.dynamicStableScale: true or use maxSurge: 1.istioctl proxy-config routes.maxTokensPerAnalysis: 2000 plugin config.failureLimit: 0 means one bad data point aborts. Sometimes you want consecutiveErrorLimit: 3 for noisy metrics.CallSphere canary-rolls every voice-agent change behind Argo Rollouts: 5% → 25% → 50% → 100% with eval pass-rate ≥0.92 and first-token p95 ≤ 800 ms gates. The Metric AI plugin caught a regression in our healthcare agent two weeks ago where a prompt edit caused +30% tool-call rate without changing latency — pure threshold gates would have shipped it. 37 agents, 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate, demo.
Q: Argo Rollouts vs Flagger? Argo if you're already on ArgoCD. Flagger if you want zero manifest changes (uses standard Deployments). Both feature-parity for canary.
Q: How do I avoid alert fatigue on noisy evals? Run the eval suite on a fixed seed, embed it as a frozen test file in the agent repo, version it, and gate canary on regression vs stable — not absolute pass-rate.
Q: What about WebRTC sticky sessions during canary?
Use sessionAffinity: ClientIP on the stable Service; canary picks up new sessions only. In-flight calls finish on stable.
Q: Cost of the AI judge? ~$0.01 per analysis tick with Claude Haiku. Cheap insurance.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI