Serverless AI Agents: Running Agents on AWS Lambda and Cloud Functions

Why Serverless for AI Agents

Serverless platforms scale to zero when there is no traffic and scale to thousands of concurrent executions when demand spikes — without you managing a single server. For AI agent workloads with unpredictable traffic patterns, this translates to significant cost savings. You pay only for the milliseconds your agent is actively processing, not for idle pods waiting for requests.

However, serverless introduces constraints that require careful design: cold starts add latency, execution timeouts limit long-running agent tasks, there is no persistent local state, and you cannot maintain WebSocket connections. Understanding these tradeoffs helps you decide which agent workloads belong on Lambda and which need dedicated infrastructure.

When Serverless Works for AI Agents

Serverless is a good fit when your agent: handles simple single-turn queries with response times under 60 seconds, has bursty traffic with quiet periods, does not require persistent in-memory state between requests, and calls external LLM APIs rather than running local models.

flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff

Serverless is a poor fit when: you need WebSocket streaming, responses take longer than the platform timeout, the agent requires GPU inference, or you need persistent connections to databases that cannot handle connection surge.

AWS Lambda Agent with Python

Here is a complete Lambda function that runs an AI agent:

# lambda_function.py
import json
import os
import uuid
import boto3
from agents import Agent, Runner

# Initialize outside handler for connection reuse across invocations
agent = Agent(
    name="assistant",
    instructions="You are a helpful assistant. Keep responses concise.",
    model=os.environ.get("AGENT_MODEL", "gpt-4o-mini"),
)

# DynamoDB for session persistence
dynamodb = boto3.resource("dynamodb")
sessions_table = dynamodb.Table(os.environ["SESSIONS_TABLE"])

def get_session_history(session_id: str) -> list:
    """Load conversation history from DynamoDB."""
    try:
        response = sessions_table.get_item(Key={"session_id": session_id})
        return response.get("Item", {}).get("history", [])
    except Exception:
        return []

def save_session_history(session_id: str, history: list):
    """Persist conversation history to DynamoDB."""
    sessions_table.put_item(Item={
        "session_id": session_id,
        "history": history,
        "ttl": int(__import__("time").time()) + 3600,  # 1 hour TTL
    })

def handler(event, context):
    try:
        body = json.loads(event.get("body", "{}"))
        message = body.get("message", "")
        session_id = body.get("session_id") or str(uuid.uuid4())

        if not message:
            return {
                "statusCode": 400,
                "body": json.dumps({"error": "message is required"}),
            }

        history = get_session_history(session_id)

        # Run the agent synchronously (Lambda does not support async handlers)
        import asyncio
        result = asyncio.get_event_loop().run_until_complete(
            Runner.run(agent, message, message_history=history)
        )

        new_history = result.to_input_list()
        save_session_history(session_id, new_history)

        return {
            "statusCode": 200,
            "headers": {"Content-Type": "application/json"},
            "body": json.dumps({
                "session_id": session_id,
                "reply": result.final_output,
                "remaining_time_ms": context.get_remaining_time_in_millis(),
            }),
        }

    except Exception as e:
        return {
            "statusCode": 500,
            "body": json.dumps({"error": str(e)}),
        }

Infrastructure as Code with SAM

Define your Lambda and API Gateway with AWS SAM:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

# template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Timeout: 90
    MemorySize: 512
    Runtime: python3.12
    Environment:
      Variables:
        AGENT_MODEL: gpt-4o-mini
        SESSIONS_TABLE: !Ref SessionsTable

Resources:
  AgentFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: lambda_function.handler
      CodeUri: src/
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref SessionsTable
      Events:
        AgentApi:
          Type: Api
          Properties:
            Path: /agent/chat
            Method: post

  SessionsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: agent-sessions
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: session_id
          AttributeType: S
      KeySchema:
        - AttributeName: session_id
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: ttl
        Enabled: true

Deploy with:

sam build
sam deploy --guided

Cold Start Optimization

Cold starts happen when Lambda creates a new execution environment. For Python-based agents, this adds 1-3 seconds of latency. Minimize it:

# Move all imports and initialization outside the handler
import json          # These run during cold start, then are cached
import os
import boto3
from agents import Agent, Runner

agent = Agent(...)   # Initialized once, reused across invocations
dynamodb = boto3.resource("dynamodb")  # Connection reused

def handler(event, context):
    # Only request-specific logic here
    pass

Use provisioned concurrency to keep warm instances ready:

# In SAM template
AgentFunction:
  Type: AWS::Serverless::Function
  Properties:
    ProvisionedConcurrencyConfig:
      ProvisionedConcurrentExecutions: 5

This keeps 5 instances warm at all times, eliminating cold starts for the first 5 concurrent requests.

Handling Timeouts Gracefully

Lambda has a maximum timeout of 15 minutes (API Gateway timeout is 29 seconds). Check remaining time and fail gracefully:

def handler(event, context):
    remaining_ms = context.get_remaining_time_in_millis()

    if remaining_ms < 10000:  # Less than 10 seconds left
        return {
            "statusCode": 503,
            "body": json.dumps({
                "error": "Insufficient time remaining",
                "suggestion": "Use async processing for complex queries",
            }),
        }

    # For long-running tasks, use Step Functions instead
    pass

Cost Comparison: Serverless vs. Kubernetes

For an agent service handling 10,000 requests per day with an average execution time of 5 seconds:

AWS Lambda: 10,000 requests x 5 seconds x 512 MB = 25,000 GB-seconds/day. At $0.0000166667 per GB-second, that is roughly $12.50/month plus API Gateway costs.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Kubernetes (2 pods, t3.medium): 2 x $30/month = $60/month, running 24/7 regardless of traffic.

Lambda wins for bursty, low-to-moderate traffic. Kubernetes wins for sustained high traffic where pods stay utilized.

Stateless Design Pattern

Since Lambda instances are ephemeral, externalize all state:

# Session state -> DynamoDB
# Cache -> ElastiCache/Redis
# File uploads -> S3
# Task queues -> SQS
# Conversation history -> DynamoDB with TTL

Never rely on /tmp storage or global variables persisting between invocations — they might, but Lambda provides no guarantee.

FAQ

Can I stream AI agent responses from AWS Lambda?

Lambda itself does not support SSE or WebSocket streaming. However, you can use Lambda Function URLs with response streaming enabled — this allows chunked transfer encoding. Alternatively, use API Gateway WebSocket APIs backed by Lambda for bidirectional streaming, though this adds architectural complexity. For simple streaming, consider keeping a dedicated FastAPI service for the streaming endpoint while using Lambda for batch processing.

How do I handle Lambda's 6 MB response payload limit?

For AI agents, 6 MB is typically more than enough for text responses. If your agent generates large outputs (like code generation or document creation), write the output to S3 and return a pre-signed URL in the Lambda response. Set the URL to expire after a reasonable period, like 15 minutes.

Is provisioned concurrency worth the cost for AI agent Lambdas?

It depends on your latency requirements. Provisioned concurrency costs roughly the same as running an equivalent EC2 instance 24/7. If your agents serve user-facing requests where a 2-3 second cold start is unacceptable, provisioned concurrency is worth it. If the agent runs background tasks where latency is not critical, on-demand concurrency is more cost-effective. Start without it and add provisioned concurrency only for latency-sensitive paths.

#Serverless #AWSLambda #AIAgents #CloudFunctions #CostOptimization #AgenticAI #LearnAI #AIEngineering

Serverless AI Agents: Running Agents on AWS Lambda and Cloud Functions

Why Serverless for AI Agents

When Serverless Works for AI Agents

AWS Lambda Agent with Python

Infrastructure as Code with SAM

Cold Start Optimization

Handling Timeouts Gracefully

Cost Comparison: Serverless vs. Kubernetes

Stateless Design Pattern

FAQ

Can I stream AI agent responses from AWS Lambda?

How do I handle Lambda's 6 MB response payload limit?

Is provisioned concurrency worth the cost for AI agent Lambdas?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?