Skip to content
AI Infrastructure
AI Infrastructure13 min read0 views

Build a Voice Agent with faster-whisper + vLLM (2026 Stack)

faster-whisper handles STT on CPU; vLLM serves a 70B model on a single H100 with 6x throughput vs HF Transformers. Here's the production-grade voice pipeline that connects them.

TL;DRfaster-whisper (CTranslate2) is 4x faster than openai/whisper on the same hardware. vLLM is the de-facto OSS LLM server (paged-attention, continuous batching). Run faster-whisper on CPU, vLLM on the GPU, and you'll handle 30+ concurrent voice calls per H100.

What you'll build

A FastAPI service that accepts WebSocket audio, transcribes with faster-whisper-large-v3, calls vLLM's OpenAI-compatible /v1/chat/completions for replies, and streams TTS back. Designed for multi-tenant voice agents.

Prerequisites

  1. Linux box with NVIDIA GPU (H100 or 4090).
  2. CUDA 12 + cuDNN 9 (latest ctranslate2 requires this).
  3. Python 3.11, pip install faster-whisper vllm fastapi uvicorn websockets.
  4. Hugging Face token for downloading models.

Architecture

flowchart LR
  CL[Client WSS] --> API[FastAPI]
  API -->|audio| FW[faster-whisper large-v3 INT8]
  API -->|prompt| VLLM[vLLM /v1/chat/completions]
  VLLM --> API
  API -->|text| TTS[Piper / Coqui]
  TTS --> CL

Step 1 — Start vLLM with an OpenAI-compatible server

```bash pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --port 8001 ```

For a single 80 GB H100, drop --tensor-parallel-size and use Llama-3.1-8B-Instruct to leave headroom for KV cache.

Step 2 — Initialize faster-whisper

```python from faster_whisper import WhisperModel

INT8 on CPU is the sweet spot for sub-2s utterances

stt = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8, num_workers=2) ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

If you have spare GPU, use device="cuda", compute_type="float16" — but pin to a different GPU than vLLM or you'll thrash the KV cache.

Step 3 — FastAPI WebSocket bridge

```python from fastapi import FastAPI, WebSocket import numpy as np, httpx, asyncio app = FastAPI()

async def llm_chat(messages): async with httpx.AsyncClient(timeout=30) as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", "messages": messages, "temperature": 0.3, "max_tokens": 200}) return r.json()["choices"][0]["message"]["content"] ```

Step 4 — Stream audio and transcribe in chunks

```python @app.websocket("/voice") async def voice(ws: WebSocket): await ws.accept() history, buf = [{"role":"system","content":"Be concise."}], bytearray() while True: msg = await ws.receive_bytes() buf.extend(msg) if len(buf) < 16000 * 2 * 1.2: continue # 1.2s of 16kHz int16 pcm = np.frombuffer(bytes(buf), dtype=np.int16).astype(np.float32) / 32768.0 buf.clear() segs, _ = stt.transcribe(pcm, language="en", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500)) text = " ".join(s.text for s in segs).strip() if not text: continue history.append({"role":"user","content":text}) reply = await llm_chat(history) history.append({"role":"assistant","content":reply}) await ws.send_text(reply) ```

Step 5 — Add tool calls (vLLM 0.6+ supports them)

```python TOOLS = [{ "type":"function", "function":{ "name":"check_inventory", "description":"Check stock for a SKU", "parameters":{"type":"object","properties":{ "sku":{"type":"string"}},"required":["sku"]}}}]

async def llm_with_tools(messages): async with httpx.AsyncClient() as c: r = await c.post("http://127.0.0.1:8001/v1/chat/completions", json={ "model":"meta-llama/Meta-Llama-3.1-70B-Instruct", "messages":messages, "tools":TOOLS, "tool_choice":"auto"}) return r.json()["choices"][0]["message"] ```

--enable-auto-tool-choice + --tool-call-parser llama3_json is the magic flag combo. Without both, tool calls return as plain text.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Benchmark and tune

On a single H100 with Llama-3.1-8B:

  • vLLM throughput: 6,200 output tokens/s aggregate.
  • faster-whisper-large-v3 INT8 on 8 CPU threads: ~0.18 s for a 5 s utterance.
  • 30 concurrent voice sessions stable; KV cache is the limit.

Common pitfalls

  • CTranslate2 + CUDA 11. The current build needs CUDA 12 / cuDNN 9. Pin ctranslate2==4.4.0 if you're stuck on cuDNN 8.
  • KV cache OOM. Lower --max-model-len; voice agents rarely need 32k context.
  • Tool-call parser. Llama 3 uses a JSON-with-tags format; hermes parser silently drops calls.

How CallSphere does this in production

CallSphere's stack: 37 specialist agents, 90+ tools, 115+ Postgres tables across 6 verticals. Healthcare runs 14 tools on FastAPI :8084 with OpenAI Realtime; OneRoof Property uses 10 specialists on WebRTC. We benchmark vLLM internally for offline batch workloads (transcript summarization, QA scoring) but keep Realtime for live calls. Flat pricing $149 / $499 / $1499. 14-day trial · 22% affiliate · /demo.

FAQ

Why not Whisper on GPU too? It's overkill; CPU-INT8 is fast enough and frees the GPU for the LLM.

Will vLLM eat my Realtime budget? It replaces the LLM half; you still need an STT and TTS layer.

Best model for voice today? Llama-3.1-8B for low latency, Llama-3.3-70B for quality.

Can I do speculative decoding? Yes, vLLM supports n-gram and EAGLE.

HIPAA? Possible if you self-host on a HIPAA-eligible cloud and add audit logging. CallSphere does it for Healthcare out-of-the-box.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.