---
title: "Drift Detection for AI Infra: Atlantis + driftctl + Daily Plan (2026)"
description: "Catch out-of-band changes to AI voice infrastructure with Atlantis-driven Terraform PRs, driftctl coverage scans, and a daily \\`terraform plan -detailed-exitcode\\` cron."
canonical: https://callsphere.ai/blog/vw6h-drift-detection-atlantis-driftctl-ai-infra-2026
category: "AI Infrastructure"
tags: ["Drift Detection", "Atlantis", "Terraform", "GitOps", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-19T00:00:00.000Z
updated: 2026-05-07T16:46:17.454Z
---

# Drift Detection for AI Infra: Atlantis + driftctl + Daily Plan (2026)

> Catch out-of-band changes to AI voice infrastructure with Atlantis-driven Terraform PRs, driftctl coverage scans, and a daily \`terraform plan -detailed-exitcode\` cron.

> **TL;DR** — Atlantis runs every `terraform plan/apply` as a PR comment so unreviewed changes can't ship. driftctl scans cloud APIs for resources Terraform doesn't know about. A nightly `plan -detailed-exitcode` cron fires Slack on any drift. Three layers, one auditable trail.

## What you'll set up

An Atlantis server running on the same k3s as the AI voice agent that watches the infra repo, plus driftctl coverage scans nightly, plus a Slack alert on any `-detailed-exitcode` divergence. Result: zero out-of-band changes survive 24 hours.

## Architecture

```mermaid
flowchart LR
  PR[Infra PR] --> ATLANTIS[Atlantis server]
  ATLANTIS -->|plan comment| PRREVIEW[Reviewer]
  PRREVIEW -->|atlantis apply| ATLANTIS
  ATLANTIS --> AWS[AWS / Hetzner]
  CRON[Daily cron] --> PLAN[terraform plan -detailed-exitcode]
  CRON --> DRIFTCTL[driftctl scan]
  PLAN --> SLACK[Slack alert]
  DRIFTCTL --> SLACK
```

## Step 1 — Deploy Atlantis on k3s

```yaml
apiVersion: apps/v1
kind: Deployment
metadata: { name: atlantis }
spec:
  template:
    spec:
      containers:
        - name: atlantis
          image: ghcr.io/runatlantis/atlantis:v0.31.0
          args:
            - server
            - --gh-user=acme-bot
            - --gh-token=$(GH_TOKEN)
            - --gh-webhook-secret=$(WH_SECRET)
            - --repo-allowlist=github.com/acme/infra
            - --atlantis-url=[https://atlantis.acme.com](https://atlantis.acme.com)
            - --automerge=false
            - --require-approval
            - --require-mergeable
```

`--require-approval` forces a human review; `--automerge=false` keeps Atlantis honest — no auto-merge after apply.

## Step 2 — Atlantis project config

```yaml

# atlantis.yaml (root of infra repo)

version: 3
projects:

- name: voice-agent-prod
dir: prod/voice
terraform_version: v1.10.5
workflow: locked
autoplan: { when_modified: ["**/*.tf", "**/*.tfvars"], enabled: true }
workflows:
  locked:
plan:
  steps:
- init
- plan: { extra_args: [-lock-timeout=10m] }
apply:
  steps:
- apply

```

## Step 3 — Webhook + GitHub commit-status checks

In the GitHub repo settings, add a webhook to `https://atlantis.acme.com/events` with the secret. Branch protection: require the `atlantis/plan` and `atlantis/apply` checks before merge.

## Step 4 — Add nightly drift detection cron

```yaml
apiVersion: batch/v1
kind: CronJob
metadata: { name: drift-check }
spec:
  schedule: "0 5 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: drift
              image: hashicorp/terraform:1.10.5
              command: ["sh","-c"]
              args:
                - |
                  cd /infra/prod/voice && \
                    terraform init -input=false && \
                    terraform plan -detailed-exitcode -lock=false || \
                    curl -X POST $SLACK_URL -d '{"text":":warning: Drift in voice-prod"}'
```

`-detailed-exitcode` returns 0 (no changes), 2 (changes), or 1 (error). Code 2 means drift; we Slack and (optionally) open an issue.

## Step 5 — driftctl coverage scan

```bash
driftctl scan \
  --from tfstate+s3://tf-state-bucket/voice/prod \
  --to aws+us-east-1 \
  --output json://drift.json

jq '.summary' drift.json

# { "total_resources": 142, "total_drifted": 3, "total_unmanaged": 7, "coverage": 92 }

```

driftctl finds *unmanaged* resources too — things created in the console that Terraform doesn't even know exist. Coverage % is your "is IaC complete" KPI.

## Step 6 — Tag everything with `managed_by` to filter

```hcl
default_tags {
  tags = {
    managed_by = "terraform"
    project    = "voice-agent"
    repo       = "github.com/acme/infra"
  }
}
```

In driftctl, filter `--filter "Tags.managed_by=='terraform'"` to ignore third-party-created resources you'll never manage (like AWS-managed default VPCs).

## Step 7 — Auto-create remediation PR

When drift fires:

```bash
git checkout -b drift-fix-$(date +%s)
terraform plan -out=plan.out
gh pr create --title "Drift fix: voice-prod" \
  --body "Detected drift on $(date). Plan:\n\$(terraform show plan.out)"
```

The PR shows the drift, Atlantis re-plans, reviewer accepts.

## Pitfalls

- **Atlantis as a single point** — run it HA with locking via DynamoDB or you'll have plan races.
- **`automerge: true`** is tempting and dangerous. Disable it.
- **driftctl false positives** — provider versions add fields driftctl thinks are drift. Pin both.
- **CronJob without timeoutSeconds** can run for hours if Terraform hangs on a lock. Set `activeDeadlineSeconds: 600`.
- **`terraform refresh` modifies state** — the `plan -refresh-only` is what you want for pure detection.

## How CallSphere does this in production

CallSphere runs Atlantis in front of every Terraform repo (we've got separate repos per HIPAA tenant). Daily `plan -detailed-exitcode` runs in GitHub Actions; driftctl coverage scans monthly. We've caught 4 console-poke incidents in 6 months — all fixed within 4 hours of detection. 37 agents, 90+ tools, 115+ DB tables, $149/$499/$1499, 14-day [trial](/trial), 22% [affiliate](/affiliate).

## FAQ

**Q: Atlantis vs Terragrunt vs Spacelift?**
Atlantis is open-source and self-hosted. Terragrunt is a wrapper for DRY. Spacelift is SaaS. Pick on infra spend and team size.

**Q: How do I handle "emergency console fix" workflow?**
Document it: console fix → immediate Terraform PR within 24h to absorb the change. Drift detector slacks the responder if PR doesn't land.

**Q: driftctl on k8s resources too?**
driftctl is cloud-only. For k8s use ArgoCD diff or `kubectl diff`.

**Q: Performance?**
`terraform plan` over 200 resources: ~30 s. driftctl scan over a 1000-resource AWS account: ~2 min.

## Sources

- [Open-source Tools for detecting drift in Terraform-managed Infrastructure](https://seifrajhi.github.io/blog/drift-detecting-in-terraform/)
- [Automating Terraform via GitHub Pull Requests with Atlantis — REI Systems](https://www.reisystems.com/automating-terraform-via-github-pull-requests-with-atlantis-for-drift-free-auditable-infrastructure/)
- [snyk/driftctl GitHub](https://github.com/snyk/driftctl)
- [Terraform Atlantis in 2026 Setup, Security & Alternatives — Scalr](https://scalr.com/learning-center/the-ultimate-guide-to-terraform-atlantis-with-terragrunt/)
- [8 Terraform Drift Detection Tools Enterprise Teams Actually Use in 2026 — env zero](https://www.env0.com/blog/8-terraform-drift-detection-tools-enterprise-teams-actually-use-in-2026)

---

Source: https://callsphere.ai/blog/vw6h-drift-detection-atlantis-driftctl-ai-infra-2026
