---
title: "Kubernetes Operators for AI Agents: Custom Controllers for Agent Lifecycle Management"
description: "Build a Kubernetes Operator for AI agent lifecycle management using Custom Resource Definitions, reconciliation loops, and status management to automate agent provisioning and scaling."
canonical: https://callsphere.ai/blog/kubernetes-operators-ai-agents-custom-controllers-lifecycle-management
category: "Learn Agentic AI"
tags: ["Kubernetes Operators", "CRD", "AI Agents", "Custom Controllers", "Automation"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.066Z
---

# Kubernetes Operators for AI Agents: Custom Controllers for Agent Lifecycle Management

> Build a Kubernetes Operator for AI agent lifecycle management using Custom Resource Definitions, reconciliation loops, and status management to automate agent provisioning and scaling.

## What Is a Kubernetes Operator

A Kubernetes Operator extends the Kubernetes API with custom resources and controllers that encode domain-specific operational knowledge. Instead of manually creating Deployments, Services, ConfigMaps, and HPAs for each AI agent, you define an `AIAgent` custom resource and let the Operator reconcile all the underlying infrastructure automatically.

This transforms agent deployment from "create six YAML files and apply them in the right order" to "declare what agent you want and let the Operator handle the rest."

## Custom Resource Definition (CRD)

First, define what an AIAgent resource looks like:

```mermaid
flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions
build plus test"]
    REG[("Container registry
GHCR or ECR")]
    HELM["Helm chart
values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment
rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA
CPU and queue depth"]
    POD[("Inference pods
GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
```

```yaml
# crd-aiagent.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: aiagents.ai.example.com
spec:
  group: ai.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["model", "replicas"]
              properties:
                model:
                  type: string
                  description: "LLM model to use"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                temperature:
                  type: number
                  default: 0.7
                maxTokens:
                  type: integer
                  default: 4096
                image:
                  type: string
                tools:
                  type: array
                  items:
                    type: string
                autoscaling:
                  type: object
                  properties:
                    enabled:
                      type: boolean
                      default: false
                    minReplicas:
                      type: integer
                    maxReplicas:
                      type: integer
            status:
              type: object
              properties:
                phase:
                  type: string
                readyReplicas:
                  type: integer
                lastUpdated:
                  type: string
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Model
          type: string
          jsonPath: .spec.model
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Phase
          type: string
          jsonPath: .status.phase
  scope: Namespaced
  names:
    plural: aiagents
    singular: aiagent
    kind: AIAgent
    shortNames:
      - aia
```

Apply the CRD and now you can create AIAgent resources:

```yaml
# my-support-agent.yaml
apiVersion: ai.example.com/v1alpha1
kind: AIAgent
metadata:
  name: support-agent
  namespace: ai-agents
spec:
  model: "gpt-4o"
  replicas: 3
  temperature: 0.5
  maxTokens: 2048
  image: "myregistry/support-agent:2.0.0"
  tools:
    - "knowledge-base-search"
    - "ticket-creator"
    - "calendar-lookup"
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 15
```

## Building the Operator in Python with Kopf

Kopf is a Python framework for building Kubernetes Operators. It handles watch streams, retry logic, and status updates.

```python
# operator.py
import kopf
import kubernetes
from kubernetes import client

@kopf.on.create("ai.example.com", "v1alpha1", "aiagents")
async def create_agent(spec, name, namespace, logger, **kwargs):
    """Reconcile when a new AIAgent is created."""
    logger.info(f"Creating AI agent: {name}")

    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    # Create ConfigMap with agent settings
    configmap = client.V1ConfigMap(
        metadata=client.V1ObjectMeta(
            name=f"{name}-config",
            namespace=namespace,
        ),
        data={
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
            "TOOLS": ",".join(spec.get("tools", [])),
        },
    )
    kopf.adopt(configmap)
    core_v1.create_namespaced_config_map(namespace, configmap)

    # Create Deployment
    deployment = build_deployment(name, namespace, spec)
    kopf.adopt(deployment)
    apps_v1.create_namespaced_deployment(namespace, deployment)

    # Create Service
    service = build_service(name, namespace, spec)
    kopf.adopt(service)
    core_v1.create_namespaced_service(namespace, service)

    return {"phase": "Running", "readyReplicas": 0}

def build_deployment(name: str, namespace: str, spec: dict):
    """Build a Deployment object from AIAgent spec."""
    return client.V1Deployment(
        metadata=client.V1ObjectMeta(
            name=name,
            namespace=namespace,
        ),
        spec=client.V1DeploymentSpec(
            replicas=spec.get("replicas", 1),
            selector=client.V1LabelSelector(
                match_labels={"aiagent": name}
            ),
            template=client.V1PodTemplateSpec(
                metadata=client.V1ObjectMeta(
                    labels={"aiagent": name}
                ),
                spec=client.V1PodSpec(
                    containers=[
                        client.V1Container(
                            name="agent",
                            image=spec["image"],
                            ports=[client.V1ContainerPort(
                                container_port=8000
                            )],
                            env_from=[
                                client.V1EnvFromSource(
                                    config_map_ref=client.V1ConfigMapEnvSource(
                                        name=f"{name}-config"
                                    )
                                )
                            ],
                        )
                    ]
                ),
            ),
        ),
    )

def build_service(name: str, namespace: str, spec: dict):
    return client.V1Service(
        metadata=client.V1ObjectMeta(
            name=f"{name}-svc",
            namespace=namespace,
        ),
        spec=client.V1ServiceSpec(
            selector={"aiagent": name},
            ports=[client.V1ServicePort(
                port=80, target_port=8000
            )],
        ),
    )
```

## Handling Updates with the Reconciliation Loop

When someone changes the AIAgent spec, the Operator detects the diff and updates resources:

```python
@kopf.on.update("ai.example.com", "v1alpha1", "aiagents")
async def update_agent(spec, name, namespace, diff, logger, **kwargs):
    """Reconcile when an AIAgent spec changes."""
    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    for field, old_val, new_val in diff:
        logger.info(f"Field changed: {field} from {old_val} to {new_val}")

    # Update ConfigMap
    configmap_patch = {
        "data": {
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
        }
    }
    core_v1.patch_namespaced_config_map(
        f"{name}-config", namespace, configmap_patch
    )

    # Update Deployment replicas and image
    deployment_patch = {
        "spec": {
            "replicas": spec.get("replicas", 1),
            "template": {
                "spec": {
                    "containers": [{
                        "name": "agent",
                        "image": spec["image"],
                    }]
                }
            }
        }
    }
    apps_v1.patch_namespaced_deployment(
        name, namespace, deployment_patch
    )

    return {"phase": "Updating"}
```

## Status Management

Update the custom resource status to reflect the actual state:

```python
@kopf.timer("ai.example.com", "v1alpha1", "aiagents", interval=30)
async def monitor_agent(spec, name, namespace, patch, logger, **kwargs):
    """Periodically check agent health and update status."""
    apps_v1 = client.AppsV1Api()

    try:
        deployment = apps_v1.read_namespaced_deployment(name, namespace)
        ready = deployment.status.ready_replicas or 0
        desired = deployment.spec.replicas

        phase = "Running" if ready == desired else "Scaling"

        patch.status["readyReplicas"] = ready
        patch.status["phase"] = phase
        patch.status["lastUpdated"] = "2026-03-17T00:00:00Z"
    except kubernetes.client.exceptions.ApiException as e:
        patch.status["phase"] = "Error"
        logger.error(f"Failed to read deployment: {e}")
```

## Using the Operator

Once deployed, managing agents becomes declarative:

```bash
# Create an agent
kubectl apply -f my-support-agent.yaml

# List all agents
kubectl get aiagents -n ai-agents

# Scale an agent (edit the spec)
kubectl patch aiagent support-agent -n ai-agents \
  --type=merge -p '{"spec": {"replicas": 5}}'

# Delete an agent (cleans up all child resources)
kubectl delete aiagent support-agent -n ai-agents
```

## FAQ

### When should I build an Operator versus using Helm charts?

Use Helm when your deployment is a one-time packaging problem — you need to template and parameterize YAML. Build an Operator when you need ongoing lifecycle management — automatic scaling adjustments, health monitoring, backup scheduling, or coordinated multi-resource updates that respond to runtime conditions. Operators encode operational knowledge that Helm charts cannot express.

### How do I test a Kubernetes Operator locally?

Use kind (Kubernetes in Docker) or minikube to run a local cluster. Kopf supports running outside the cluster with `kopf run operator.py` which connects to your kubeconfig context. Write integration tests that create custom resources and assert the expected child resources appear. Use pytest with the kubernetes client library to verify Deployment, Service, and ConfigMap creation.

### What happens to child resources when the custom resource is deleted?

When you call `kopf.adopt()` on child resources, Kubernetes sets owner references. Deleting the parent AIAgent triggers garbage collection of all owned Deployments, Services, and ConfigMaps automatically. This prevents orphaned resources. Without adoption, you must handle cleanup manually in a `@kopf.on.delete` handler.

---

#KubernetesOperators #CRD #AIAgents #CustomControllers #Automation #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/kubernetes-operators-ai-agents-custom-controllers-lifecycle-management
