Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Drift Detection for AI Infra: Atlantis + driftctl + Daily Plan (2026)

Catch out-of-band changes to AI voice infrastructure with Atlantis-driven Terraform PRs, driftctl coverage scans, and a daily \`terraform plan -detailed-exitcode\` cron.

TL;DR — Atlantis runs every terraform plan/apply as a PR comment so unreviewed changes can't ship. driftctl scans cloud APIs for resources Terraform doesn't know about. A nightly plan -detailed-exitcode cron fires Slack on any drift. Three layers, one auditable trail.

What you'll set up

An Atlantis server running on the same k3s as the AI voice agent that watches the infra repo, plus driftctl coverage scans nightly, plus a Slack alert on any -detailed-exitcode divergence. Result: zero out-of-band changes survive 24 hours.

Architecture

flowchart LR
  PR[Infra PR] --> ATLANTIS[Atlantis server]
  ATLANTIS -->|plan comment| PRREVIEW[Reviewer]
  PRREVIEW -->|atlantis apply| ATLANTIS
  ATLANTIS --> AWS[AWS / Hetzner]
  CRON[Daily cron] --> PLAN[terraform plan -detailed-exitcode]
  CRON --> DRIFTCTL[driftctl scan]
  PLAN --> SLACK[Slack alert]
  DRIFTCTL --> SLACK

Step 1 — Deploy Atlantis on k3s

```yaml apiVersion: apps/v1 kind: Deployment metadata: { name: atlantis } spec: template: spec: containers: - name: atlantis image: ghcr.io/runatlantis/atlantis:v0.31.0 args: - server - --gh-user=acme-bot - --gh-token=$(GH_TOKEN) - --gh-webhook-secret=$(WH_SECRET) - --repo-allowlist=github.com/acme/infra - --atlantis-url=https://atlantis.acme.com - --automerge=false - --require-approval - --require-mergeable ```

--require-approval forces a human review; --automerge=false keeps Atlantis honest — no auto-merge after apply.

Step 2 — Atlantis project config

```yaml

atlantis.yaml (root of infra repo)

version: 3 projects:

  • name: voice-agent-prod dir: prod/voice terraform_version: v1.10.5 workflow: locked autoplan: { when_modified: ["/*.tf", "/*.tfvars"], enabled: true } workflows: locked: plan: steps: - init - plan: { extra_args: [-lock-timeout=10m] } apply: steps: - apply

```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Webhook + GitHub commit-status checks

In the GitHub repo settings, add a webhook to https://atlantis.acme.com/events with the secret. Branch protection: require the atlantis/plan and atlantis/apply checks before merge.

Step 4 — Add nightly drift detection cron

```yaml apiVersion: batch/v1 kind: CronJob metadata: { name: drift-check } spec: schedule: "0 5 * * *" jobTemplate: spec: template: spec: restartPolicy: OnFailure containers: - name: drift image: hashicorp/terraform:1.10.5 command: ["sh","-c"] args: - | cd /infra/prod/voice && \ terraform init -input=false && \ terraform plan -detailed-exitcode -lock=false || \ curl -X POST $SLACK_URL -d '{"text":":warning: Drift in voice-prod"}' ```

-detailed-exitcode returns 0 (no changes), 2 (changes), or 1 (error). Code 2 means drift; we Slack and (optionally) open an issue.

Step 5 — driftctl coverage scan

```bash driftctl scan \ --from tfstate+s3://tf-state-bucket/voice/prod \ --to aws+us-east-1 \ --output json://drift.json

jq '.summary' drift.json

{ "total_resources": 142, "total_drifted": 3, "total_unmanaged": 7, "coverage": 92 }

```

driftctl finds unmanaged resources too — things created in the console that Terraform doesn't even know exist. Coverage % is your "is IaC complete" KPI.

Step 6 — Tag everything with managed_by to filter

```hcl default_tags { tags = { managed_by = "terraform" project = "voice-agent" repo = "github.com/acme/infra" } } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

In driftctl, filter --filter "Tags.managed_by=='terraform'" to ignore third-party-created resources you'll never manage (like AWS-managed default VPCs).

Step 7 — Auto-create remediation PR

When drift fires:

```bash git checkout -b drift-fix-$(date +%s) terraform plan -out=plan.out gh pr create --title "Drift fix: voice-prod" \ --body "Detected drift on $(date). Plan:\n\$(terraform show plan.out)" ```

The PR shows the drift, Atlantis re-plans, reviewer accepts.

Pitfalls

  • Atlantis as a single point — run it HA with locking via DynamoDB or you'll have plan races.
  • automerge: true is tempting and dangerous. Disable it.
  • driftctl false positives — provider versions add fields driftctl thinks are drift. Pin both.
  • CronJob without timeoutSeconds can run for hours if Terraform hangs on a lock. Set activeDeadlineSeconds: 600.
  • terraform refresh modifies state — the plan -refresh-only is what you want for pure detection.

How CallSphere does this in production

CallSphere runs Atlantis in front of every Terraform repo (we've got separate repos per HIPAA tenant). Daily plan -detailed-exitcode runs in GitHub Actions; driftctl coverage scans monthly. We've caught 4 console-poke incidents in 6 months — all fixed within 4 hours of detection. 37 agents, 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: Atlantis vs Terragrunt vs Spacelift? Atlantis is open-source and self-hosted. Terragrunt is a wrapper for DRY. Spacelift is SaaS. Pick on infra spend and team size.

Q: How do I handle "emergency console fix" workflow? Document it: console fix → immediate Terraform PR within 24h to absorb the change. Drift detector slacks the responder if PR doesn't land.

Q: driftctl on k8s resources too? driftctl is cloud-only. For k8s use ArgoCD diff or kubectl diff.

Q: Performance? terraform plan over 200 resources: ~30 s. driftctl scan over a 1000-resource AWS account: ~2 min.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.