---
title: "Deploying AI Agents on Kubernetes: Production Architecture"
description: "A hands-on guide to deploying AI agent systems on Kubernetes, covering pod design, autoscaling based on queue depth, GPU scheduling, secrets management, health checks, and production-ready Helm charts for LLM-powered services."
canonical: https://callsphere.ai/blog/deploying-ai-agents-kubernetes-production
category: "Agentic AI"
tags: ["Kubernetes", "AI Deployment", "DevOps", "Infrastructure", "Production", "AI Agents"]
author: "CallSphere Team"
published: 2026-01-11T00:00:00.000Z
updated: 2026-05-07T09:26:54.350Z
---

# Deploying AI Agents on Kubernetes: Production Architecture

> A hands-on guide to deploying AI agent systems on Kubernetes, covering pod design, autoscaling based on queue depth, GPU scheduling, secrets management, health checks, and production-ready Helm charts for LLM-powered services.

## Why Kubernetes for AI Agents

AI agent systems have unique deployment requirements: they make long-running API calls (30-120 seconds), consume variable memory depending on context window size, need access to external secrets (API keys), and benefit from horizontal scaling based on queue depth rather than CPU utilization. Kubernetes handles all of these requirements with its declarative resource management, autoscaling primitives, and secret management.

## Architecture Overview

A production AI agent deployment on Kubernetes typically has four components:

```mermaid
flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions
build plus test"]
    REG[("Container registry
GHCR or ECR")]
    HELM["Helm chart
values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment
rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA
CPU and queue depth"]
    POD[("Inference pods
GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
```

```
                 Ingress (nginx/traefik)
                        |
                   API Gateway
                   /    |    \
          Agent API  Worker Pods  Vector DB
              |          |           |
           Redis     Redis Queue   Qdrant/
          (cache)    (task queue)   Weaviate
```

- **Agent API**: Handles HTTP requests, enqueues tasks, returns results
- **Worker Pods**: Process agent tasks from the queue (LLM calls, tool execution)
- **Vector DB**: Serves retrieval queries for RAG pipelines
- **Redis**: Shared cache and task queue

## Pod Design for AI Agents

### The Agent Worker Pod

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-worker
  namespace: ai-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-worker
  template:
    metadata:
      labels:
        app: agent-worker
    spec:
      containers:
        - name: worker
          image: myregistry/agent-worker:v1.2.0
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-api-keys
                  key: anthropic-key
            - name: REDIS_URL
              value: "redis://redis-master:6379/0"
            - name: WORKER_CONCURRENCY
              value: "4"
            - name: MAX_CONTEXT_TOKENS
              value: "100000"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
      terminationGracePeriodSeconds: 120
```

Key design decisions:

- **Memory limits at 2Gi**: Agent workers need memory for conversation context, tool results, and parsed documents. 2Gi handles most workloads.
- **terminationGracePeriodSeconds: 120**: Agent tasks can run for minutes. Give pods enough time to finish current work before shutdown.
- **Startup probe with high failure threshold**: The worker may need time to load models or establish connections.

### Health Check Implementation

```python
from fastapi import FastAPI
import asyncio

app = FastAPI()
worker_healthy = True
worker_ready = False

@app.get("/health")
async def health():
    if not worker_healthy:
        return {"status": "unhealthy"}, 503
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    if not worker_ready:
        return {"status": "not ready"}, 503
    return {
        "status": "ready",
        "active_tasks": task_counter.value,
        "queue_depth": await get_queue_depth(),
    }

@app.on_event("startup")
async def startup():
    global worker_ready
    # Verify LLM API connectivity
    try:
        await test_llm_connection()
        await test_redis_connection()
        worker_ready = True
    except Exception as e:
        logger.error("startup_failed", error=str(e))
```

## Autoscaling AI Agent Workers

Standard CPU-based autoscaling does not work for AI agents. Workers spend most of their time waiting for LLM API responses (I/O bound), so CPU stays low even when the system is overloaded. Scale based on queue depth instead.

### KEDA (Kubernetes Event-Driven Autoscaling)

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-worker-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: agent-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    - type: redis
      metadata:
        address: redis-master:6379
        listName: agent:task_queue
        listLength: "5"  # Scale up when >5 tasks per worker
        activationListLength: "1"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        query: |
          avg(agent_task_duration_seconds{quantile="0.95"}) > 30
        threshold: "1"
```

This configuration scales workers when:

- The Redis task queue exceeds 5 items per worker (primary trigger)
- The P95 task duration exceeds 30 seconds (indicating overload)

### Scaling Considerations

| Factor | Recommendation |
| --- | --- |
| Min replicas | 2 (high availability) |
| Max replicas | Based on LLM API rate limits |
| Scale-up speed | Aggressive (15s polling) |
| Scale-down speed | Conservative (300s cooldown) |
| Tasks per worker | 3-5 concurrent (I/O bound) |

## Secrets Management

Never put API keys in environment variables directly in manifests. Use Kubernetes Secrets with an external secrets operator:

```yaml
# Using External Secrets Operator with AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: llm-api-keys
  namespace: ai-agents
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: llm-api-keys
    creationPolicy: Owner
  data:
    - secretKey: anthropic-key
      remoteRef:
        key: /production/ai-agents/anthropic-api-key
    - secretKey: openai-key
      remoteRef:
        key: /production/ai-agents/openai-api-key
```

## Persistent Storage for Agent State

Agents that maintain conversation history or checkpoint state need persistent storage:

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: agent-checkpoints
  namespace: ai-agents
spec:
  accessModes:
    - ReadWriteMany  # Multiple workers need access
  storageClassName: efs-sc  # EFS for shared access
  resources:
    requests:
      storage: 50Gi
```

For most production systems, using Redis or PostgreSQL for agent state is preferable to filesystem storage:

```yaml
# Redis for agent state and caching
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-master
  namespace: ai-agents
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          command: ["redis-server", "--maxmemory", "1gb",
                   "--maxmemory-policy", "allkeys-lru",
                   "--appendonly", "yes"]
          resources:
            requests:
              memory: "1Gi"
              cpu: "250m"
            limits:
              memory: "1.5Gi"
          volumeMounts:
            - name: redis-data
              mountPath: /data
      volumes:
        - name: redis-data
          persistentVolumeClaim:
            claimName: redis-pvc
```

## Network Policies

Restrict agent pods to only communicate with necessary services:

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-worker-policy
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      app: agent-worker
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: agent-api
      ports:
        - port: 8080
  egress:
    # Allow Redis
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - port: 6379
    # Allow external LLM APIs (HTTPS)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - port: 443
          protocol: TCP
    # Allow DNS
    - to: []
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
```

## Monitoring and Alerting

Deploy Prometheus ServiceMonitor and Grafana dashboards:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: agent-worker-monitor
  namespace: ai-agents
spec:
  selector:
    matchLabels:
      app: agent-worker
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
```

Key metrics to expose from your agent workers:

```python
from prometheus_client import Counter, Histogram, Gauge

# Task metrics
tasks_processed = Counter("agent_tasks_total", "Total tasks processed",
                          ["status", "model"])
task_duration = Histogram("agent_task_duration_seconds", "Task duration",
                          buckets=[1, 5, 10, 30, 60, 120, 300])
active_tasks = Gauge("agent_active_tasks", "Currently running tasks")

# LLM metrics
llm_requests = Counter("llm_requests_total", "LLM API calls",
                        ["model", "status"])
llm_tokens = Counter("llm_tokens_total", "Tokens used",
                      ["model", "direction"])  # input/output
llm_latency = Histogram("llm_request_duration_seconds", "LLM call latency",
                         ["model"])

# Cost metrics
llm_cost = Counter("llm_cost_dollars_total", "Estimated LLM cost",
                    ["model"])
```

## Graceful Shutdown

When Kubernetes terminates a pod (during scaling, updates, or node drain), the worker must finish its current task:

```python
import signal
import asyncio

shutdown_event = asyncio.Event()

def handle_shutdown(signum, frame):
    logger.info("Received shutdown signal, finishing current tasks...")
    shutdown_event.set()

signal.signal(signal.SIGTERM, handle_shutdown)

async def worker_loop():
    while not shutdown_event.is_set():
        task = await get_task_from_queue(timeout=5)
        if task:
            active_tasks.inc()
            try:
                await process_task(task)
                tasks_processed.labels(status="success", model=task.model).inc()
            except Exception as e:
                tasks_processed.labels(status="error", model=task.model).inc()
                await requeue_task(task)  # Put it back for another worker
            finally:
                active_tasks.dec()

    logger.info("Worker shutdown complete")
```

## Key Takeaways

Deploying AI agents on Kubernetes is fundamentally about adapting Kubernetes primitives to the unique characteristics of LLM workloads: I/O-bound processing, long task durations, variable memory usage, and queue-based scaling. The patterns covered here -- KEDA-based autoscaling, generous termination grace periods, queue-depth triggers, and LLM-specific health checks -- form the foundation of a production-ready deployment.

---

Source: https://callsphere.ai/blog/deploying-ai-agents-kubernetes-production
