By Sagar Shankaran, Founder of CallSphere
faster-whisper handles STT on CPU; vLLM serves a 70B model on a single H100 with 6x throughput vs HF Transformers. Here's the production-grade voice pipeline that connects them.
Key takeaways
TL;DR —
faster-whisper(CTranslate2) is 4x faster thanopenai/whisperon the same hardware.vLLMis the de-facto OSS LLM server (paged-attention, continuous batching). Run faster-whisper on CPU, vLLM on the GPU, and you'll handle 30+ concurrent voice calls per H100.
A FastAPI service that accepts WebSocket audio, transcribes with faster-whisper-large-v3, calls vLLM's OpenAI-compatible /v1/chat/completions for replies, and streams TTS back. Designed for multi-tenant voice agents.
ctranslate2 requires this).pip install faster-whisper vllm fastapi uvicorn websockets.flowchart LR
CL[Client WSS] --> API[FastAPI]
API -->|audio| FW[faster-whisper large-v3 INT8]
API -->|prompt| VLLM[vLLM /v1/chat/completions]
VLLM --> API
API -->|text| TTS[Piper / Coqui]
TTS --> CL
```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --port 8001 ```
For a single 80 GB H100, drop --tensor-parallel-size and use Llama-3.1-8B-Instruct to leave headroom for KV cache.
```python from faster_whisper import WhisperModel
stt = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8, num_workers=2) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
If you have spare GPU, use device="cuda", compute_type="float16" — but pin to a different GPU than vLLM or you'll thrash the KV cache.
```python from fastapi import FastAPI, WebSocket import numpy as np, httpx, asyncio app = FastAPI()
async def llm_chat(messages): async with httpx.AsyncClient(timeout=30) as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", "messages": messages, "temperature": 0.3, "max_tokens": 200}) return r.json()["choices"][0]["message"]["content"] ```
```python @app.websocket("/voice") async def voice(ws: WebSocket): await ws.accept() history, buf = [{"role":"system","content":"Be concise."}], bytearray() while True: msg = await ws.receive_bytes() buf.extend(msg) if len(buf) < 16000 * 2 * 1.2: continue # 1.2s of 16kHz int16 pcm = np.frombuffer(bytes(buf), dtype=np.int16).astype(np.float32) / 32768.0 buf.clear() segs, _ = stt.transcribe(pcm, language="en", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500)) text = " ".join(s.text for s in segs).strip() if not text: continue history.append({"role":"user","content":text}) reply = await llm_chat(history) history.append({"role":"assistant","content":reply}) await ws.send_text(reply) ```
```python TOOLS = [{ "type":"function", "function":{ "name":"check_inventory", "description":"Check stock for a SKU", "parameters":{"type":"object","properties":{ "sku":{"type":"string"}},"required":["sku"]}}}]
async def llm_with_tools(messages): async with httpx.AsyncClient() as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model":"meta-llama/Meta-Llama-3.1-70B-Instruct", "messages":messages, "tools":TOOLS, "tool_choice":"auto"}) return r.json()["choices"][0]["message"] ```
--enable-auto-tool-choice + --tool-call-parser llama3_json is the magic flag combo. Without both, tool calls return as plain text.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
On a single H100 with Llama-3.1-8B:
ctranslate2==4.4.0 if you're stuck on cuDNN 8.--max-model-len; voice agents rarely need 32k context.hermes parser silently drops calls.CallSphere's stack: 37 specialist agents, 90+ tools, 115+ Postgres tables across 6 verticals. Healthcare runs 14 tools on FastAPI :8084 with OpenAI Realtime; OneRoof Property uses 10 specialists on WebRTC. We benchmark vLLM internally for offline batch workloads (transcript summarization, QA scoring) but keep Realtime for live calls. Flat pricing $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.
Why not Whisper on GPU too? It's overkill; CPU-INT8 is fast enough and frees the GPU for the LLM.
Will vLLM eat my Realtime budget? It replaces the LLM half; you still need an STT and TTS layer.
Best model for voice today? Llama-3.1-8B for low latency, Llama-3.3-70B for quality.
Can I do speculative decoding? Yes, vLLM supports n-gram and EAGLE.
HIPAA? Possible if you self-host on a HIPAA-eligible cloud and add audit logging. CallSphere does it for Healthcare out-of-the-box.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
OpenAI's GPT-Realtime-Translate hits 70 languages at $0.034/min. For dental practices in diverse metros, this changes who picks up the phone — and who books the appointment.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI