Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics

Configure Kubernetes Horizontal Pod Autoscaler for AI agent workloads using CPU, memory, and custom metrics. Learn KEDA integration and scale-to-zero patterns for cost optimization.

Why AI Agents Need Autoscaling

AI agent workloads are inherently bursty. A customer support agent might handle 10 requests per minute during quiet hours and 500 during a product launch. Running enough replicas for peak load wastes money during idle periods. Running too few causes timeouts and dropped requests. Horizontal Pod Autoscaling (HPA) dynamically adjusts replica count based on observed metrics.

Basic HPA with CPU Metrics

The simplest HPA scales based on average CPU utilization across all Pods:

flowchart TD
    START["Horizontal Pod Autoscaling for AI Agents: Scaling…"] --> A
    A["Why AI Agents Need Autoscaling"]
    A --> B
    B["Basic HPA with CPU Metrics"]
    B --> C
    C["Custom Metrics with Prometheus"]
    C --> D
    D["KEDA: Event-Driven Autoscaling"]
    D --> E
    E["Scale-to-Zero Pattern for AI Agents"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
# ai-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

The behavior section is critical for AI agents. Scale-up is aggressive — add up to four Pods per minute when load spikes. Scale-down is conservative — remove one Pod every two minutes with a five-minute stabilization window to avoid flapping during variable traffic.

Custom Metrics with Prometheus

CPU utilization is a poor proxy for AI agent load. A better metric is request queue depth or average response latency. Export custom metrics from your agent:

from prometheus_client import Histogram, Gauge, start_http_server

# Track active agent sessions
active_sessions = Gauge(
    "ai_agent_active_sessions",
    "Number of active agent sessions"
)

# Track response latency
response_latency = Histogram(
    "ai_agent_response_seconds",
    "Time to generate agent response",
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

# Start metrics server on a separate port
start_http_server(9090)

Configure HPA to use the custom metric via the Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa-custom
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: ai_agent_active_sessions
        target:
          type: AverageValue
          averageValue: "10"

This configuration maintains an average of 10 active sessions per Pod. When sessions increase, Kubernetes adds replicas. When sessions drop, it removes them.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queues, databases, and external services. It also supports scale-to-zero, which standard HPA does not.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Create a ScaledObject that scales based on a Redis queue:

# ai-agent-keda.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-agent-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: ai-agent
  pollingInterval: 10
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: redis
      metadata:
        address: redis-host:6379
        listName: agent-task-queue
        listLength: "5"
        activationListLength: "1"

With minReplicaCount: 0, the Deployment scales to zero Pods when the queue is empty, and activates when at least one message appears. This saves significant cost for agents that handle periodic batch workloads.

Scale-to-Zero Pattern for AI Agents

Scale-to-zero works well for batch agents but requires careful handling of cold starts:

import asyncio
import signal

class GracefulAgent:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self._shutdown)

    def _shutdown(self, signum, frame):
        self.running = False

    async def process_queue(self):
        """Process tasks until shutdown signal."""
        while self.running:
            task = await self.fetch_from_queue(timeout=5)
            if task:
                await self.handle_task(task)

    async def fetch_from_queue(self, timeout: int):
        # Redis BRPOP with timeout
        pass

    async def handle_task(self, task: dict):
        # Agent processing logic
        pass

FAQ

What metrics should I use for autoscaling AI agents?

Avoid relying solely on CPU. The best metrics depend on your agent type. For synchronous request-response agents, use request latency (p95) or concurrent connections. For queue-based agents, use queue depth divided by processing rate. For WebSocket-based conversational agents, use active session count. Combine multiple metrics — Kubernetes scales to the highest recommendation from any single metric.

How do I prevent autoscaling from causing cost overruns?

Set hard maxReplicas limits, implement resource quotas at the namespace level, and configure PodDisruptionBudgets. Use cloud provider billing alerts as a safety net. With KEDA, the cooldownPeriod prevents premature scale-up oscillation that can multiply Pod count unnecessarily.

What is the cold start time for a scaled-to-zero AI agent?

Cold start includes container pull time, application startup, model loading, and health check passage. For a well-optimized AI agent image without local models, expect 5 to 15 seconds. Pre-pulled images on nodes reduce this to 2 to 5 seconds. If cold start latency is unacceptable, set minReplicaCount: 1 to keep one warm replica.


#Kubernetes #Autoscaling #KEDA #AIAgents #CostOptimization #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.