Skip to content
AI Infrastructure
AI Infrastructure12 min read0 views

Deploy a Voice Agent on Modal with Python and Serverless GPU

Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.

TL;DR — Modal lets you put @app.function(gpu="A10G") on top of a regular Python function and ship it. Pair it with LiveKit Agents and you have a horizontally autoscaling voice cluster in under 100 lines.

What you'll build

A Modal-deployed LiveKit Agent that uses Whisper-large-v3 (GPU) for STT, GPT-4o-mini for LLM, and ElevenLabs for TTS. Modal autoscales containers up to 1000 concurrent calls; idle containers go to sleep in seconds, billed only for active CPU/GPU time.

Prerequisites

  1. pip install modal and modal token new.
  2. LiveKit Cloud project (or self-hosted).
  3. Modal secrets for LiveKit + OpenAI + ElevenLabs.
  4. Python 3.11+.
  5. pip install livekit-agents livekit-plugins-openai livekit-plugins-elevenlabs.

Architecture

flowchart LR
  B[Browser/iOS/Android] -- WebRTC --> L[LiveKit Cloud]
  L -- dispatch --> M[Modal Function: VoiceAgent]
  M -- GPU --> W[Whisper]
  M -- API --> O[OpenAI gpt-4o-mini]
  M -- API --> E[ElevenLabs TTS]

Step 1 — Define the Modal app

agent.py:

```python import modal

image = ( modal.Image.debian_slim(python_version="3.11") .pip_install( "livekit-agents>=0.12", "livekit-plugins-openai", "livekit-plugins-silero", "livekit-plugins-elevenlabs", "livekit-plugins-deepgram", ) )

app = modal.App("callsphere-voice", image=image)

secrets = [ modal.Secret.from_name("livekit"), # LIVEKIT_URL/API_KEY/API_SECRET modal.Secret.from_name("openai"), # OPENAI_API_KEY modal.Secret.from_name("elevenlabs"), # ELEVEN_API_KEY ] ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 2 — Write the LiveKit entrypoint

```python from livekit.agents import ( AutoSubscribe, JobContext, WorkerOptions, cli, llm ) from livekit.agents.voice_assistant import VoiceAssistant from livekit.plugins import openai, silero, elevenlabs, deepgram

async def entrypoint(ctx: JobContext): await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

initial_ctx = llm.ChatContext().append(
    role="system",
    text=(
        "You are CallSphere's salon receptionist. "
        "Keep responses short and warm."
    ),
)

assistant = VoiceAssistant(
    vad=silero.VAD.load(),
    stt=deepgram.STT(model="nova-3"),
    llm=openai.LLM(model="gpt-4o-mini"),
    tts=elevenlabs.TTS(voice="Rachel"),
    chat_ctx=initial_ctx,
)
assistant.start(ctx.room)
await assistant.say("Hi, this is CallSphere. How can I help?", allow_interruptions=True)

```

Step 3 — Wrap in a Modal Function

```python @app.function( image=image, secrets=secrets, timeout=60 * 60, # 1 hour max per call cpu=2.0, memory=2048, min_containers=1, # warm pool max_containers=200, ) def run_worker(): cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) ```

Step 4 — GPU split for STT (optional)

Heavy Whisper inference belongs on its own autoscaling GPU pool:

```python @app.cls(image=image, secrets=secrets, gpu="A10G", min_containers=0) class WhisperGPU: @modal.method() def transcribe(self, audio_bytes: bytes) -> str: from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, _ = model.transcribe(audio_bytes, beam_size=1) return " ".join(s.text for s in segments) ```

The voice worker calls WhisperGPU().transcribe.remote(...); Modal autoscales the GPU pool independently.

Step 5 — Deploy

```bash modal deploy agent.py ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

You'll see a URL for the function dashboard. LiveKit Cloud's worker dispatches to your Modal function via the LiveKit Cloud API, so connect them in the LiveKit dashboard under "Self-hosted dispatch."

Step 6 — Local dev loop

```bash modal run agent.py::run_worker ```

This runs the same function locally with hot-reload, sharing your Modal secrets so you don't manage .env.

Common pitfalls

  • Cold starts on GPU — pre-warm with min_containers=1 for production.
  • Bundling secrets in code — always use modal.Secret.from_name.
  • Long-running container leaks — set timeout so a hung agent dies.
  • Pip-install on every cold start — bake into image so it's cached.

How CallSphere does this in production

CallSphere uses Modal for off-hours batch tasks like nightly call summarization with Whisper-large — it's the cheapest GPU surface for spiky workloads. Our 24/7 voice plane is on dedicated k3s (Pion + FastAPI) for predictable latency. 37 agents across 6 verticals, HIPAA + SOC 2.

FAQ

Cost for 10k calls/day? ~$120 Modal + ~$200 LLM/STT/TTS APIs.

Cold start on CPU? ~1.5s; on GPU ~6s with model warmup.

Can I use Pipecat instead of LiveKit? Yes — Pipecat has an official Modal guide.

Modal vs RunPod? Modal scales to zero in seconds; RunPod cheaper for always-on.

Logs and metrics? Built-in dashboard + modal app logs.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.