Build a Voice Agent with faster-whisper + vLLM (2026 Stack)
faster-whisper handles STT on CPU; vLLM serves a 70B model on a single H100 with 6x throughput vs HF Transformers. Here's the production-grade voice pipeline that connects them.
TL;DR —
faster-whisper(CTranslate2) is 4x faster thanopenai/whisperon the same hardware.vLLMis the de-facto OSS LLM server (paged-attention, continuous batching). Run faster-whisper on CPU, vLLM on the GPU, and you'll handle 30+ concurrent voice calls per H100.
What you'll build
A FastAPI service that accepts WebSocket audio, transcribes with faster-whisper-large-v3, calls vLLM's OpenAI-compatible /v1/chat/completions for replies, and streams TTS back. Designed for multi-tenant voice agents.
Prerequisites
- Linux box with NVIDIA GPU (H100 or 4090).
- CUDA 12 + cuDNN 9 (latest
ctranslate2requires this). - Python 3.11,
pip install faster-whisper vllm fastapi uvicorn websockets. - Hugging Face token for downloading models.
Architecture
flowchart LR
CL[Client WSS] --> API[FastAPI]
API -->|audio| FW[faster-whisper large-v3 INT8]
API -->|prompt| VLLM[vLLM /v1/chat/completions]
VLLM --> API
API -->|text| TTS[Piper / Coqui]
TTS --> CL
Step 1 — Start vLLM with an OpenAI-compatible server
```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --port 8001 ```
For a single 80 GB H100, drop --tensor-parallel-size and use Llama-3.1-8B-Instruct to leave headroom for KV cache.
Step 2 — Initialize faster-whisper
```python from faster_whisper import WhisperModel
INT8 on CPU is the sweet spot for sub-2s utterances
stt = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8, num_workers=2) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
If you have spare GPU, use device="cuda", compute_type="float16" — but pin to a different GPU than vLLM or you'll thrash the KV cache.
Step 3 — FastAPI WebSocket bridge
```python from fastapi import FastAPI, WebSocket import numpy as np, httpx, asyncio app = FastAPI()
async def llm_chat(messages): async with httpx.AsyncClient(timeout=30) as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", "messages": messages, "temperature": 0.3, "max_tokens": 200}) return r.json()["choices"][0]["message"]["content"] ```
Step 4 — Stream audio and transcribe in chunks
```python @app.websocket("/voice") async def voice(ws: WebSocket): await ws.accept() history, buf = [{"role":"system","content":"Be concise."}], bytearray() while True: msg = await ws.receive_bytes() buf.extend(msg) if len(buf) < 16000 * 2 * 1.2: continue # 1.2s of 16kHz int16 pcm = np.frombuffer(bytes(buf), dtype=np.int16).astype(np.float32) / 32768.0 buf.clear() segs, _ = stt.transcribe(pcm, language="en", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500)) text = " ".join(s.text for s in segs).strip() if not text: continue history.append({"role":"user","content":text}) reply = await llm_chat(history) history.append({"role":"assistant","content":reply}) await ws.send_text(reply) ```
Step 5 — Add tool calls (vLLM 0.6+ supports them)
```python TOOLS = [{ "type":"function", "function":{ "name":"check_inventory", "description":"Check stock for a SKU", "parameters":{"type":"object","properties":{ "sku":{"type":"string"}},"required":["sku"]}}}]
async def llm_with_tools(messages): async with httpx.AsyncClient() as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model":"meta-llama/Meta-Llama-3.1-70B-Instruct", "messages":messages, "tools":TOOLS, "tool_choice":"auto"}) return r.json()["choices"][0]["message"] ```
--enable-auto-tool-choice + --tool-call-parser llama3_json is the magic flag combo. Without both, tool calls return as plain text.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Benchmark and tune
On a single H100 with Llama-3.1-8B:
- vLLM throughput: 6,200 output tokens/s aggregate.
- faster-whisper-large-v3 INT8 on 8 CPU threads: ~0.18 s for a 5 s utterance.
- 30 concurrent voice sessions stable; KV cache is the limit.
Common pitfalls
- CTranslate2 + CUDA 11. The current build needs CUDA 12 / cuDNN 9. Pin
ctranslate2==4.4.0if you're stuck on cuDNN 8. - KV cache OOM. Lower
--max-model-len; voice agents rarely need 32k context. - Tool-call parser. Llama 3 uses a JSON-with-tags format;
hermesparser silently drops calls.
How CallSphere does this in production
CallSphere's stack: 37 specialist agents, 90+ tools, 115+ Postgres tables across 6 verticals. Healthcare runs 14 tools on FastAPI :8084 with OpenAI Realtime; OneRoof Property uses 10 specialists on WebRTC. We benchmark vLLM internally for offline batch workloads (transcript summarization, QA scoring) but keep Realtime for live calls. Flat pricing $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.
FAQ
Why not Whisper on GPU too? It's overkill; CPU-INT8 is fast enough and frees the GPU for the LLM.
Will vLLM eat my Realtime budget? It replaces the LLM half; you still need an STT and TTS layer.
Best model for voice today? Llama-3.1-8B for low latency, Llama-3.3-70B for quality.
Can I do speculative decoding? Yes, vLLM supports n-gram and EAGLE.
HIPAA? Possible if you self-host on a HIPAA-eligible cloud and add audit logging. CallSphere does it for Healthcare out-of-the-box.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.