By Sagar Shankaran, Founder of CallSphere
Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.
Key takeaways
TL;DR — Modal lets you put
@app.function(gpu="A10G")on top of a regular Python function and ship it. Pair it with LiveKit Agents and you have a horizontally autoscaling voice cluster in under 100 lines.
A Modal-deployed LiveKit Agent that uses Whisper-large-v3 (GPU) for STT, GPT-4o-mini for LLM, and ElevenLabs for TTS. Modal autoscales containers up to 1000 concurrent calls; idle containers go to sleep in seconds, billed only for active CPU/GPU time.
pip install modal and modal token new.pip install livekit-agents livekit-plugins-openai livekit-plugins-elevenlabs.flowchart LR
B[Browser/iOS/Android] -- WebRTC --> L[LiveKit Cloud]
L -- dispatch --> M[Modal Function: VoiceAgent]
M -- GPU --> W[Whisper]
M -- API --> O[OpenAI gpt-4o-mini]
M -- API --> E[ElevenLabs TTS]
agent.py:
```python import modal
image = ( modal.Image.debian_slim(python_version="3.11") .pip_install( "livekit-agents>=0.12", "livekit-plugins-openai", "livekit-plugins-silero", "livekit-plugins-elevenlabs", "livekit-plugins-deepgram", ) )
app = modal.App("callsphere-voice", image=image)
secrets = [ modal.Secret.from_name("livekit"), # LIVEKIT_URL/API_KEY/API_SECRET modal.Secret.from_name("openai"), # OPENAI_API_KEY modal.Secret.from_name("elevenlabs"), # ELEVEN_API_KEY ] ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```python from livekit.agents import ( AutoSubscribe, JobContext, WorkerOptions, cli, llm ) from livekit.agents.voice_assistant import VoiceAssistant from livekit.plugins import openai, silero, elevenlabs, deepgram
async def entrypoint(ctx: JobContext): await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
initial_ctx = llm.ChatContext().append(
role="system",
text=(
"You are CallSphere's salon receptionist. "
"Keep responses short and warm."
),
)
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-3"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=elevenlabs.TTS(voice="Rachel"),
chat_ctx=initial_ctx,
)
assistant.start(ctx.room)
await assistant.say("Hi, this is CallSphere. How can I help?", allow_interruptions=True)
```
```python @app.function( image=image, secrets=secrets, timeout=60 * 60, # 1 hour max per call cpu=2.0, memory=2048, min_containers=1, # warm pool max_containers=200, ) def run_worker(): cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) ```
Heavy Whisper inference belongs on its own autoscaling GPU pool:
```python @app.cls(image=image, secrets=secrets, gpu="A10G", min_containers=0) class WhisperGPU: @modal.method() def transcribe(self, audio_bytes: bytes) -> str: from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="float16") segments, _ = model.transcribe(audio_bytes, beam_size=1) return " ".join(s.text for s in segments) ```
The voice worker calls WhisperGPU().transcribe.remote(...); Modal autoscales the GPU pool independently.
```bash modal deploy agent.py ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
You'll see a URL for the function dashboard. LiveKit Cloud's worker dispatches to your Modal function via the LiveKit Cloud API, so connect them in the LiveKit dashboard under "Self-hosted dispatch."
```bash modal run agent.py::run_worker ```
This runs the same function locally with hot-reload, sharing your Modal secrets so you don't manage .env.
min_containers=1 for production.modal.Secret.from_name.timeout so a hung agent dies.image so it's cached.CallSphere uses Modal for off-hours batch tasks like nightly call summarization with Whisper-large — it's the cheapest GPU surface for spiky workloads. Our 24/7 voice plane is on dedicated k3s (Pion + FastAPI) for predictable latency. 37 agents across 6 verticals, HIPAA + SOC 2.
Cost for 10k calls/day? ~$120 Modal + ~$200 LLM/STT/TTS APIs.
Cold start on CPU? ~1.5s; on GPU ~6s with model warmup.
Can I use Pipecat instead of LiveKit? Yes — Pipecat has an official Modal guide.
Modal vs RunPod? Modal scales to zero in seconds; RunPod cheaper for always-on.
Logs and metrics? Built-in dashboard + modal app logs.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Replace expensive outbound SDR tooling with a self-hosted dialer that runs OpenAI Realtime agents at 100 concurrent calls. Full architecture and code.
HVAC companies miss 40–60% of inbound. Build a 4-agent dispatch (intake, scheduling, parts, emergency) that integrates with ServiceTitan in 600 lines.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI