Deploy a Voice Agent on Modal with Python and Serverless GPU
Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.
TL;DR — Modal lets you put
@app.function(gpu="A10G")on top of a regular Python function and ship it. Pair it with LiveKit Agents and you have a horizontally autoscaling voice cluster in under 100 lines.
What you'll build
A Modal-deployed LiveKit Agent that uses Whisper-large-v3 (GPU) for STT, GPT-4o-mini for LLM, and ElevenLabs for TTS. Modal autoscales containers up to 1000 concurrent calls; idle containers go to sleep in seconds, billed only for active CPU/GPU time.
Prerequisites
pip install modalandmodal token new.- LiveKit Cloud project (or self-hosted).
- Modal secrets for LiveKit + OpenAI + ElevenLabs.
- Python 3.11+.
pip install livekit-agents livekit-plugins-openai livekit-plugins-elevenlabs.
Architecture
flowchart LR
B[Browser/iOS/Android] -- WebRTC --> L[LiveKit Cloud]
L -- dispatch --> M[Modal Function: VoiceAgent]
M -- GPU --> W[Whisper]
M -- API --> O[OpenAI gpt-4o-mini]
M -- API --> E[ElevenLabs TTS]
Step 1 — Define the Modal app
agent.py:
```python import modal
image = ( modal.Image.debian_slim(python_version="3.11") .pip_install( "livekit-agents>=0.12", "livekit-plugins-openai", "livekit-plugins-silero", "livekit-plugins-elevenlabs", "livekit-plugins-deepgram", ) )
app = modal.App("callsphere-voice", image=image)
secrets = [ modal.Secret.from_name("livekit"), # LIVEKIT_URL/API_KEY/API_SECRET modal.Secret.from_name("openai"), # OPENAI_API_KEY modal.Secret.from_name("elevenlabs"), # ELEVEN_API_KEY ] ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2 — Write the LiveKit entrypoint
```python from livekit.agents import ( AutoSubscribe, JobContext, WorkerOptions, cli, llm ) from livekit.agents.voice_assistant import VoiceAssistant from livekit.plugins import openai, silero, elevenlabs, deepgram
async def entrypoint(ctx: JobContext): await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
initial_ctx = llm.ChatContext().append(
role="system",
text=(
"You are CallSphere's salon receptionist. "
"Keep responses short and warm."
),
)
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=elevenlabs.TTS(voice="Rachel"),
chat_ctx=initial_ctx,
)
assistant.start(ctx.room)
await assistant.say("Hi, this is CallSphere. How can I help?", allow_interruptions=True)
```
Step 3 — Wrap in a Modal Function
```python @app.function( image=image, secrets=secrets, timeout=60 * 60, # 1 hour max per call cpu=2.0, memory=2048, min_containers=1, # warm pool max_containers=200, ) def run_worker(): cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) ```
Step 4 — GPU split for STT (optional)
Heavy Whisper inference belongs on its own autoscaling GPU pool:
```python @app.cls(image=image, secrets=secrets, gpu="A10G", min_containers=0) class WhisperGPU: @modal.method() def transcribe(self, audio_bytes: bytes) -> str: from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, _ = model.transcribe(audio_bytes, beam_size=1) return " ".join(s.text for s in segments) ```
The voice worker calls WhisperGPU().transcribe.remote(...); Modal autoscales the GPU pool independently.
Step 5 — Deploy
```bash modal deploy agent.py ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
You'll see a URL for the function dashboard. LiveKit Cloud's worker dispatches to your Modal function via the LiveKit Cloud API, so connect them in the LiveKit dashboard under "Self-hosted dispatch."
Step 6 — Local dev loop
```bash modal run agent.py::run_worker ```
This runs the same function locally with hot-reload, sharing your Modal secrets so you don't manage .env.
Common pitfalls
- Cold starts on GPU — pre-warm with
min_containers=1for production. - Bundling secrets in code — always use
modal.Secret.from_name. - Long-running container leaks — set
timeoutso a hung agent dies. - Pip-install on every cold start — bake into
imageso it's cached.
How CallSphere does this in production
CallSphere uses Modal for off-hours batch tasks like nightly call summarization with Whisper-large — it's the cheapest GPU surface for spiky workloads. Our 24/7 voice plane is on dedicated k3s (Pion + FastAPI) for predictable latency. 37 agents across 6 verticals, HIPAA + SOC 2.
FAQ
Cost for 10k calls/day? ~$120 Modal + ~$200 LLM/STT/TTS APIs.
Cold start on CPU? ~1.5s; on GPU ~6s with model warmup.
Can I use Pipecat instead of LiveKit? Yes — Pipecat has an official Modal guide.
Modal vs RunPod? Modal scales to zero in seconds; RunPod cheaper for always-on.
Logs and metrics? Built-in dashboard + modal app logs.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.