---
title: "Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics"
description: "Configure Kubernetes Horizontal Pod Autoscaler for AI agent workloads using CPU, memory, and custom metrics. Learn KEDA integration and scale-to-zero patterns for cost optimization."
canonical: https://callsphere.ai/blog/horizontal-pod-autoscaling-ai-agents-custom-metrics-keda
category: "Learn Agentic AI"
tags: ["Kubernetes", "Autoscaling", "KEDA", "AI Agents", "Cost Optimization"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.070Z
---

# Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics

> Configure Kubernetes Horizontal Pod Autoscaler for AI agent workloads using CPU, memory, and custom metrics. Learn KEDA integration and scale-to-zero patterns for cost optimization.

## Why AI Agents Need Autoscaling

AI agent workloads are inherently bursty. A customer support agent might handle 10 requests per minute during quiet hours and 500 during a product launch. Running enough replicas for peak load wastes money during idle periods. Running too few causes timeouts and dropped requests. Horizontal Pod Autoscaling (HPA) dynamically adjusts replica count based on observed metrics.

## Basic HPA with CPU Metrics

The simplest HPA scales based on average CPU utilization across all Pods:

```mermaid
flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions
build plus test"]
    REG[("Container registry
GHCR or ECR")]
    HELM["Helm chart
values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment
rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA
CPU and queue depth"]
    POD[("Inference pods
GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
```

```yaml
# ai-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
```

The `behavior` section is critical for AI agents. Scale-up is aggressive — add up to four Pods per minute when load spikes. Scale-down is conservative — remove one Pod every two minutes with a five-minute stabilization window to avoid flapping during variable traffic.

## Custom Metrics with Prometheus

CPU utilization is a poor proxy for AI agent load. A better metric is request queue depth or average response latency. Export custom metrics from your agent:

```python
from prometheus_client import Histogram, Gauge, start_http_server

# Track active agent sessions
active_sessions = Gauge(
    "ai_agent_active_sessions",
    "Number of active agent sessions"
)

# Track response latency
response_latency = Histogram(
    "ai_agent_response_seconds",
    "Time to generate agent response",
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

# Start metrics server on a separate port
start_http_server(9090)
```

Configure HPA to use the custom metric via the Prometheus adapter:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa-custom
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: ai_agent_active_sessions
        target:
          type: AverageValue
          averageValue: "10"
```

This configuration maintains an average of 10 active sessions per Pod. When sessions increase, Kubernetes adds replicas. When sessions drop, it removes them.

## KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queues, databases, and external services. It also supports scale-to-zero, which standard HPA does not.

Install KEDA:

```bash
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
```

Create a ScaledObject that scales based on a Redis queue:

```yaml
# ai-agent-keda.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-agent-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: ai-agent
  pollingInterval: 10
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: redis
      metadata:
        address: redis-host:6379
        listName: agent-task-queue
        listLength: "5"
        activationListLength: "1"
```

With `minReplicaCount: 0`, the Deployment scales to zero Pods when the queue is empty, and activates when at least one message appears. This saves significant cost for agents that handle periodic batch workloads.

## Scale-to-Zero Pattern for AI Agents

Scale-to-zero works well for batch agents but requires careful handling of cold starts:

```python
import asyncio
import signal

class GracefulAgent:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self._shutdown)

    def _shutdown(self, signum, frame):
        self.running = False

    async def process_queue(self):
        """Process tasks until shutdown signal."""
        while self.running:
            task = await self.fetch_from_queue(timeout=5)
            if task:
                await self.handle_task(task)

    async def fetch_from_queue(self, timeout: int):
        # Redis BRPOP with timeout
        pass

    async def handle_task(self, task: dict):
        # Agent processing logic
        pass
```

## FAQ

### What metrics should I use for autoscaling AI agents?

Avoid relying solely on CPU. The best metrics depend on your agent type. For synchronous request-response agents, use request latency (p95) or concurrent connections. For queue-based agents, use queue depth divided by processing rate. For WebSocket-based conversational agents, use active session count. Combine multiple metrics — Kubernetes scales to the highest recommendation from any single metric.

### How do I prevent autoscaling from causing cost overruns?

Set hard `maxReplicas` limits, implement resource quotas at the namespace level, and configure PodDisruptionBudgets. Use cloud provider billing alerts as a safety net. With KEDA, the `cooldownPeriod` prevents premature scale-up oscillation that can multiply Pod count unnecessarily.

### What is the cold start time for a scaled-to-zero AI agent?

Cold start includes container pull time, application startup, model loading, and health check passage. For a well-optimized AI agent image without local models, expect 5 to 15 seconds. Pre-pulled images on nodes reduce this to 2 to 5 seconds. If cold start latency is unacceptable, set `minReplicaCount: 1` to keep one warm replica.

---

#Kubernetes #Autoscaling #KEDA #AIAgents #CostOptimization #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/horizontal-pod-autoscaling-ai-agents-custom-metrics-keda
