Canary AI Agent Versions with Argo Rollouts + Metric AI Plugin (2026)
Run a 5% → 25% → 50% → 100% canary on a voice agent with Argo Rollouts, AnalysisTemplate against eval pass-rate, and the new Metric AI plugin from ArgoCon 2026.
TL;DR — Argo Rollouts replaces a Deployment with a Rollout CRD. AnalysisTemplates measure live metrics (eval pass-rate, p95 first-token latency, error budget burn) at each step. The Metric AI plugin from ArgoCon 2026 lets the rollout controller reason about why a metric moved, not just whether it crossed a threshold.
What you'll set up
An Argo Rollouts Rollout for the voice agent: 5% → 25% → 50% → 100% with AnalysisTemplates that hit our internal eval harness and Prometheus, plus the Metric AI plugin to root-cause regressions and auto-rollback.
Architecture
flowchart LR
GIT[deploy repo] --> ROLL[Rollout CRD]
ROLL -->|5%| C1[canary v2]
C1 --> ANA[AnalysisTemplate]
ANA -->|prom + evals| METRIC[Metric AI plugin]
METRIC --> DECIDE{Pass?}
DECIDE -->|yes| C2[25% canary]
DECIDE -->|no| RB[Auto rollback]
C2 --> C3[50%]
C3 --> FULL[100%]
Step 1 — Install Argo Rollouts + the Metric AI plugin
```bash kubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml kubectl apply -f https://github.com/argoproj-labs/rollouts-metricai-plugin/releases/latest/download/install.yaml ```
The Metric AI plugin runs as a sidecar to the rollouts controller; it can run kubectl logs, query Prometheus, and ask Claude to assess "is this canary healthy".
Step 2 — Replace Deployment with Rollout
```yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: { name: voice-agent } spec: replicas: 10 selector: matchLabels: { app: voice-agent } template: metadata: { labels: { app: voice-agent }} spec: containers: - name: agent image: ghcr.io/acme/voice-agent:v1 strategy: canary: canaryService: voice-agent-canary stableService: voice-agent-stable trafficRouting: istio: virtualService: { name: voice-vs, routes: [primary] } steps: - setWeight: 5 - pause: { duration: 5m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 25 - pause: { duration: 10m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 50 - pause: { duration: 15m } - setWeight: 100 ```
Step 3 — AnalysisTemplate: eval pass-rate + p95 latency
```yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: { name: voice-canary } spec: metrics: - name: eval-pass-rate successCondition: result[0] >= 0.92 failureLimit: 0 provider: web: url: https://evals.internal/api/run?suite=voice&version={{args.canary-hash}} jsonPath: "{$.passRate}" - name: p95-first-token successCondition: result[0] <= 800 provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile(0.95, sum(rate(voice_first_token_ms_bucket{version="canary"}[5m])) by (le)) - name: ai-judge provider: plugin: metricai/judge: prompt: | Compare canary vs stable error logs and Prometheus deltas. Return PASS or FAIL with one-sentence reason. ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The third metric is the AI judge. It looks at logs, traces, and metric deltas and emits a PASS/FAIL — catches "subtle prompt regression" failures that fixed thresholds miss.
Step 4 — Wire eval harness to the canary
```python
evals/serve.py — runs as a Service in cluster
@app.get("/api/run") def run(suite: str, version: str): results = harness.run_suite(suite, model_endpoint=f"http://voice-agent-canary:8080") return { "passRate": results.pass_rate, "totals": results.totals } ```
The eval service hits voice-agent-canary (the canary subset) directly so it's measuring what real traffic will see.
Step 5 — Manual gate before 100%
```yaml - setWeight: 50 - pause: {} # indefinite; kubectl argo rollouts promote voice-agent ```
Some teams want a human in the loop for the last step. Use kubectl argo rollouts promote voice-agent to release.
Step 6 — Auto-rollback hooks
```yaml spec: rollbackWindow: { revisions: 3 } strategy: canary: abortScaleDownDelaySeconds: 30 ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Failed analysis aborts the rollout in <60 s; old ReplicaSet is already running, so traffic just stays on stable. No customer-visible outage.
Step 7 — Slack notification template
```yaml trigger.on-rollout-aborted: | send: [voice-aborted] template.voice-aborted: | message: | :rotating_light: voice-agent canary aborted at step {{.step}} Reason: {{.failure_reason}} See: https://argo.example.com/rollouts/voice-agent ```
Pitfalls
- AnalysisTemplate query window — using
[5m]rate when the canary has been live for 2 min returns zero. Always setpause.duration>> query window. - canary at
setWeight: 1can have zero pods on a 10-replica fleet (1% of 10 rounds to 0). Setcanary.dynamicStableScale: trueor usemaxSurge: 1. - Istio VirtualService routes named wrong in trafficRouting → silent no-op. Verify with
istioctl proxy-config routes. - AI plugin cost runaway — the AI judge calls Claude every analysis tick. Cap with
maxTokensPerAnalysis: 2000plugin config. failureLimit: 0means one bad data point aborts. Sometimes you wantconsecutiveErrorLimit: 3for noisy metrics.
How CallSphere does this in production
CallSphere canary-rolls every voice-agent change behind Argo Rollouts: 5% → 25% → 50% → 100% with eval pass-rate ≥0.92 and first-token p95 ≤ 800 ms gates. The Metric AI plugin caught a regression in our healthcare agent two weeks ago where a prompt edit caused +30% tool-call rate without changing latency — pure threshold gates would have shipped it. 37 agents, 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate, demo.
FAQ
Q: Argo Rollouts vs Flagger? Argo if you're already on ArgoCD. Flagger if you want zero manifest changes (uses standard Deployments). Both feature-parity for canary.
Q: How do I avoid alert fatigue on noisy evals? Run the eval suite on a fixed seed, embed it as a frozen test file in the agent repo, version it, and gate canary on regression vs stable — not absolute pass-rate.
Q: What about WebRTC sticky sessions during canary?
Use sessionAffinity: ClientIP on the stable Service; canary picks up new sessions only. In-flight calls finish on stable.
Q: Cost of the AI judge? ~$0.01 per analysis tick with Claude Haiku. Cheap insurance.
Sources
- Canary Argo Rollouts docs
- ArgoCon Europe 2026: Argo Rollouts AI integration — Carlos Sanchez & Kevin Dubois
- Progressive Delivery: Canary Deployments with Argo Rollouts and Flagger — Calmops
- Canary deployment strategy with Argo Rollouts — Red Hat Developer
- A/B Testing and Canary Deployments for Models — APXML
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.