---
title: "Deploy a Voice Agent on Modal with Python and Serverless GPU"
description: "Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing."
canonical: https://callsphere.ai/blog/vw2h-deploy-voice-agent-modal-python-serverless-gpu
category: "AI Infrastructure"
tags: ["Tutorial", "Build", "Modal", "Python", "Serverless", "LiveKit"]
author: "CallSphere Team"
published: 2026-04-21T00:00:00.000Z
updated: 2026-05-07T09:27:42.240Z
---

# Deploy a Voice Agent on Modal with Python and Serverless GPU

> Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.

> **TL;DR** — Modal lets you put `@app.function(gpu="A10G")` on top of a regular Python function and ship it. Pair it with LiveKit Agents and you have a horizontally autoscaling voice cluster in under 100 lines.

## What you'll build

A Modal-deployed LiveKit Agent that uses Whisper-large-v3 (GPU) for STT, GPT-4o-mini for LLM, and ElevenLabs for TTS. Modal autoscales containers up to 1000 concurrent calls; idle containers go to sleep in seconds, billed only for active CPU/GPU time.

## Prerequisites

1. `pip install modal` and `modal token new`.
2. LiveKit Cloud project (or self-hosted).
3. Modal secrets for LiveKit + OpenAI + ElevenLabs.
4. Python 3.11+.
5. `pip install livekit-agents livekit-plugins-openai livekit-plugins-elevenlabs`.

## Architecture

```mermaid
flowchart LR
  B[Browser/iOS/Android] -- WebRTC --> L[LiveKit Cloud]
  L -- dispatch --> M[Modal Function: VoiceAgent]
  M -- GPU --> W[Whisper]
  M -- API --> O[OpenAI gpt-4o-mini]
  M -- API --> E[ElevenLabs TTS]
```

## Step 1 — Define the Modal app

`agent.py`:

```python
import modal

image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "livekit-agents>=0.12",
        "livekit-plugins-openai",
        "livekit-plugins-silero",
        "livekit-plugins-elevenlabs",
        "livekit-plugins-deepgram",
    )
)

app = modal.App("callsphere-voice", image=image)

secrets = [
    modal.Secret.from_name("livekit"),       # LIVEKIT_URL/API_KEY/API_SECRET
    modal.Secret.from_name("openai"),        # OPENAI_API_KEY
    modal.Secret.from_name("elevenlabs"),    # ELEVEN_API_KEY
]
```

## Step 2 — Write the LiveKit entrypoint

```python
from livekit.agents import (
    AutoSubscribe, JobContext, WorkerOptions, cli, llm
)
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, silero, elevenlabs, deepgram

async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

```
initial_ctx = llm.ChatContext().append(
    role="system",
    text=(
        "You are CallSphere's salon receptionist. "
        "Keep responses short and warm."
    ),
)

assistant = VoiceAssistant(
    vad=silero.VAD.load(),
    stt=deepgram.STT(model="nova-3"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS(voice="Rachel"),
    chat_ctx=initial_ctx,
)
assistant.start(ctx.room)
await assistant.say("Hi, this is CallSphere. How can I help?", allow_interruptions=True)
```

```

## Step 3 — Wrap in a Modal Function

```python
@app.function(
    image=image,
    secrets=secrets,
    timeout=60 * 60,        # 1 hour max per call
    cpu=2.0,
    memory=2048,
    min_containers=1,       # warm pool
    max_containers=200,
)
def run_worker():
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
```

## Step 4 — GPU split for STT (optional)

Heavy Whisper inference belongs on its own autoscaling GPU pool:

```python
@app.cls(image=image, secrets=secrets, gpu="A10G", min_containers=0)
class WhisperGPU:
    @modal.method()
    def transcribe(self, audio_bytes: bytes) -> str:
        from faster_whisper import WhisperModel
        model = WhisperModel("large-v3", device="cuda", compute_type="float16")
        segments, _ = model.transcribe(audio_bytes, beam_size=1)
        return " ".join(s.text for s in segments)
```

The voice worker calls `WhisperGPU().transcribe.remote(...)`; Modal autoscales the GPU pool independently.

## Step 5 — Deploy

```bash
modal deploy agent.py
```

You'll see a URL for the function dashboard. LiveKit Cloud's worker dispatches to your Modal function via the LiveKit Cloud API, so connect them in the LiveKit dashboard under "Self-hosted dispatch."

## Step 6 — Local dev loop

```bash
modal run agent.py::run_worker
```

This runs the same function locally with hot-reload, sharing your Modal secrets so you don't manage `.env`.

## Common pitfalls

- **Cold starts on GPU** — pre-warm with `min_containers=1` for production.
- **Bundling secrets in code** — always use `modal.Secret.from_name`.
- **Long-running container leaks** — set `timeout` so a hung agent dies.
- **Pip-install on every cold start** — bake into `image` so it's cached.

## How CallSphere does this in production

CallSphere uses Modal for **off-hours batch tasks** like nightly call summarization with Whisper-large — it's the cheapest GPU surface for spiky workloads. Our 24/7 voice plane is on dedicated k3s ([Pion + FastAPI](/industries/real-estate)) for predictable latency. 37 agents across 6 verticals, HIPAA + SOC 2.

## FAQ

**Cost for 10k calls/day?** ~$120 Modal + ~$200 LLM/STT/TTS APIs.

**Cold start on CPU?** ~1.5s; on GPU ~6s with model warmup.

**Can I use Pipecat instead of LiveKit?** Yes — Pipecat has an official Modal guide.

**Modal vs RunPod?** Modal scales to zero in seconds; RunPod cheaper for always-on.

**Logs and metrics?** Built-in dashboard + `modal app logs`.

## Sources

- [Modal blog — One-second voice latency](https://modal.com/blog/low-latency-voice-bot)
- [Modal blog — Deploy LiveKit Agents on Modal](https://modal.com/blog/livekit-modal)
- [Pipecat Modal deploy guide](https://docs.pipecat.ai/deployment/platforms/modal)
- [LiveKit Agents repo](https://github.com/livekit/agents)

---

Source: https://callsphere.ai/blog/vw2h-deploy-voice-agent-modal-python-serverless-gpu
