Skip to content
AI Engineering
AI Engineering12 min read0 views

Canary AI Agent Versions with Argo Rollouts + Metric AI Plugin (2026)

Run a 5% → 25% → 50% → 100% canary on a voice agent with Argo Rollouts, AnalysisTemplate against eval pass-rate, and the new Metric AI plugin from ArgoCon 2026.

TL;DR — Argo Rollouts replaces a Deployment with a Rollout CRD. AnalysisTemplates measure live metrics (eval pass-rate, p95 first-token latency, error budget burn) at each step. The Metric AI plugin from ArgoCon 2026 lets the rollout controller reason about why a metric moved, not just whether it crossed a threshold.

What you'll set up

An Argo Rollouts Rollout for the voice agent: 5% → 25% → 50% → 100% with AnalysisTemplates that hit our internal eval harness and Prometheus, plus the Metric AI plugin to root-cause regressions and auto-rollback.

Architecture

flowchart LR
  GIT[deploy repo] --> ROLL[Rollout CRD]
  ROLL -->|5%| C1[canary v2]
  C1 --> ANA[AnalysisTemplate]
  ANA -->|prom + evals| METRIC[Metric AI plugin]
  METRIC --> DECIDE{Pass?}
  DECIDE -->|yes| C2[25% canary]
  DECIDE -->|no| RB[Auto rollback]
  C2 --> C3[50%]
  C3 --> FULL[100%]

Step 1 — Install Argo Rollouts + the Metric AI plugin

```bash kubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml kubectl apply -f https://github.com/argoproj-labs/rollouts-metricai-plugin/releases/latest/download/install.yaml ```

The Metric AI plugin runs as a sidecar to the rollouts controller; it can run kubectl logs, query Prometheus, and ask Claude to assess "is this canary healthy".

Step 2 — Replace Deployment with Rollout

```yaml apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: { name: voice-agent } spec: replicas: 10 selector: matchLabels: { app: voice-agent } template: metadata: { labels: { app: voice-agent }} spec: containers: - name: agent image: ghcr.io/acme/voice-agent:v1 strategy: canary: canaryService: voice-agent-canary stableService: voice-agent-stable trafficRouting: istio: virtualService: { name: voice-vs, routes: [primary] } steps: - setWeight: 5 - pause: { duration: 5m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 25 - pause: { duration: 10m } - analysis: templates: [{ templateName: voice-canary }] - setWeight: 50 - pause: { duration: 15m } - setWeight: 100 ```

Step 3 — AnalysisTemplate: eval pass-rate + p95 latency

```yaml apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: { name: voice-canary } spec: metrics: - name: eval-pass-rate successCondition: result[0] >= 0.92 failureLimit: 0 provider: web: url: https://evals.internal/api/run?suite=voice&version={{args.canary-hash}} jsonPath: "{$.passRate}" - name: p95-first-token successCondition: result[0] <= 800 provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile(0.95, sum(rate(voice_first_token_ms_bucket{version="canary"}[5m])) by (le)) - name: ai-judge provider: plugin: metricai/judge: prompt: | Compare canary vs stable error logs and Prometheus deltas. Return PASS or FAIL with one-sentence reason. ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The third metric is the AI judge. It looks at logs, traces, and metric deltas and emits a PASS/FAIL — catches "subtle prompt regression" failures that fixed thresholds miss.

Step 4 — Wire eval harness to the canary

```python

evals/serve.py — runs as a Service in cluster

@app.get("/api/run") def run(suite: str, version: str): results = harness.run_suite(suite, model_endpoint=f"http://voice-agent-canary:8080") return { "passRate": results.pass_rate, "totals": results.totals } ```

The eval service hits voice-agent-canary (the canary subset) directly so it's measuring what real traffic will see.

Step 5 — Manual gate before 100%

```yaml - setWeight: 50 - pause: {} # indefinite; kubectl argo rollouts promote voice-agent ```

Some teams want a human in the loop for the last step. Use kubectl argo rollouts promote voice-agent to release.

Step 6 — Auto-rollback hooks

```yaml spec: rollbackWindow: { revisions: 3 } strategy: canary: abortScaleDownDelaySeconds: 30 ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Failed analysis aborts the rollout in <60 s; old ReplicaSet is already running, so traffic just stays on stable. No customer-visible outage.

Step 7 — Slack notification template

```yaml trigger.on-rollout-aborted: | send: [voice-aborted] template.voice-aborted: | message: | :rotating_light: voice-agent canary aborted at step {{.step}} Reason: {{.failure_reason}} See: https://argo.example.com/rollouts/voice-agent ```

Pitfalls

  • AnalysisTemplate query window — using [5m] rate when the canary has been live for 2 min returns zero. Always set pause.duration >> query window.
  • canary at setWeight: 1 can have zero pods on a 10-replica fleet (1% of 10 rounds to 0). Set canary.dynamicStableScale: true or use maxSurge: 1.
  • Istio VirtualService routes named wrong in trafficRouting → silent no-op. Verify with istioctl proxy-config routes.
  • AI plugin cost runaway — the AI judge calls Claude every analysis tick. Cap with maxTokensPerAnalysis: 2000 plugin config.
  • failureLimit: 0 means one bad data point aborts. Sometimes you want consecutiveErrorLimit: 3 for noisy metrics.

How CallSphere does this in production

CallSphere canary-rolls every voice-agent change behind Argo Rollouts: 5% → 25% → 50% → 100% with eval pass-rate ≥0.92 and first-token p95 ≤ 800 ms gates. The Metric AI plugin caught a regression in our healthcare agent two weeks ago where a prompt edit caused +30% tool-call rate without changing latency — pure threshold gates would have shipped it. 37 agents, 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate, demo.

FAQ

Q: Argo Rollouts vs Flagger? Argo if you're already on ArgoCD. Flagger if you want zero manifest changes (uses standard Deployments). Both feature-parity for canary.

Q: How do I avoid alert fatigue on noisy evals? Run the eval suite on a fixed seed, embed it as a frozen test file in the agent repo, version it, and gate canary on regression vs stable — not absolute pass-rate.

Q: What about WebRTC sticky sessions during canary? Use sessionAffinity: ClientIP on the stable Service; canary picks up new sessions only. In-flight calls finish on stable.

Q: Cost of the AI judge? ~$0.01 per analysis tick with Claude Haiku. Cheap insurance.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like