Skip to content
Kubernetes for AI Agents: Scaling Agent Workloads with K8s
Learn Agentic AI12 min read12 views

Kubernetes for AI Agents: Scaling Agent Workloads with K8s

Deploy and scale AI agent services on Kubernetes with Deployments, Services, Horizontal Pod Autoscalers, resource limits, and health checks for production-grade reliability.

Why Kubernetes for AI Agent Workloads

A single FastAPI container running your AI agent handles one user well. But production workloads demand horizontal scaling, automatic recovery from crashes, rolling updates without downtime, and resource isolation. Kubernetes provides all of this through declarative configuration — you describe the desired state, and K8s continuously reconciles reality to match.

AI agents present unique scaling challenges: requests are long-running (seconds to minutes per LLM call), memory usage spikes with large context windows, and traffic patterns are bursty. Kubernetes gives you the primitives to handle all of these.

The Deployment Manifest

A Deployment defines how many replicas of your agent pod to run and how to update them:

flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions<br/>build plus test"]
    REG[("Container registry<br/>GHCR or ECR")]
    HELM["Helm chart<br/>values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment<br/>rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA<br/>CPU and queue depth"]
    POD[("Inference pods<br/>GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service
  namespace: ai-agents
  labels:
    app: agent-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: agent-service
    spec:
      containers:
        - name: agent
          image: registry.example.com/agent-service:1.0.0
          ports:
            - containerPort: 8000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: openai-api-key
            - name: AGENT_MODEL
              value: "gpt-4o"
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Key decisions here: resource requests guarantee minimum allocation so the scheduler places pods intelligently. Limits prevent a single agent from consuming all node resources during a large context window request.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Service for Internal Traffic

A Service gives your agent pods a stable DNS name and load balances traffic:

# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: agent-service
  namespace: ai-agents
spec:
  selector:
    app: agent-service
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

Other services in the cluster reach the agent at http://agent-service.ai-agents.svc.cluster.local.

Secrets Management

Store your API keys in Kubernetes Secrets, not in Deployment manifests:

kubectl create secret generic agent-secrets \
  --namespace ai-agents \
  --from-literal=openai-api-key=sk-proj-your-key-here

Reference them in your Deployment with valueFrom.secretKeyRef as shown above.

Horizontal Pod Autoscaler

Scale pods automatically based on CPU utilization or custom metrics:

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-service-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

The scaleDown.stabilizationWindowSeconds: 300 prevents thrashing — agent traffic is bursty, and you do not want Kubernetes removing pods only to recreate them a minute later.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Health Check Endpoint Design

Your /health endpoint should verify all critical dependencies:

@app.get("/health")
async def health():
    checks = {}
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "down"

    overall = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
    status_code = 200 if overall == "ok" else 503
    return JSONResponse(
        content={"status": overall, "checks": checks},
        status_code=status_code,
    )

Applying the Manifests

kubectl create namespace ai-agents

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml

# Watch the rollout
kubectl rollout status deployment/agent-service -n ai-agents

# Check pods
kubectl get pods -n ai-agents -l app=agent-service

FAQ

How should I set resource limits for AI agent pods?

Start by profiling your agent under realistic load. Most Python-based agents with FastAPI use 200-500 MB of RAM at baseline. Set memory requests at your p50 usage and limits at your p99. For CPU, LLM-backed agents are I/O-bound, so 250m-500m CPU request is typically sufficient. Monitor with kubectl top pods and adjust based on actual usage patterns.

What happens to in-flight agent requests during a rolling update?

Kubernetes sends a SIGTERM to the old pod and waits for terminationGracePeriodSeconds (default 30 seconds) before forcefully killing it. Handle SIGTERM in your FastAPI app by completing in-flight requests and rejecting new ones. Set the grace period longer than your maximum expected agent response time to prevent dropped requests.

Should I use one pod per agent type or multiplex agents in a single pod?

For most teams, a single service that handles all agent types is simpler to operate. Route to different agent behaviors via a request parameter. Only split into separate Deployments when agent types have significantly different resource profiles — for example, a coding agent that needs 4 GB of RAM versus a simple Q&A agent that needs 512 MB.


#Kubernetes #AIAgents #Scaling #DevOps #Infrastructure #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Agents

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

AI Agents

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

AI Agents

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

HVAC

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

Enterprise AI

OpenAI Frontier vs Anthropic Managed Agents: 2026 Comparison

Head-to-head: OpenAI Frontier and Anthropic's managed agent stack — strengths, fit, and what each means for enterprise AI voice and chat deployment.