---
title: "Build a Voice Agent with faster-whisper + vLLM (2026 Stack)"
description: "faster-whisper handles STT on CPU; vLLM serves a 70B model on a single H100 with 6x throughput vs HF Transformers. Here's the production-grade voice pipeline that connects them."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-faster-whisper-vllm
category: "AI Infrastructure"
tags: ["faster-whisper", "vLLM", "CTranslate2", "Voice Agent", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-21T00:00:00.000Z
updated: 2026-05-07T16:13:44.724Z
---

# Build a Voice Agent with faster-whisper + vLLM (2026 Stack)

> faster-whisper handles STT on CPU; vLLM serves a 70B model on a single H100 with 6x throughput vs HF Transformers. Here's the production-grade voice pipeline that connects them.

> **TL;DR** — `faster-whisper` (CTranslate2) is 4x faster than `openai/whisper` on the same hardware. `vLLM` is the de-facto OSS LLM server (paged-attention, continuous batching). Run faster-whisper on CPU, vLLM on the GPU, and you'll handle 30+ concurrent voice calls per H100.

## What you'll build

A FastAPI service that accepts WebSocket audio, transcribes with faster-whisper-large-v3, calls vLLM's OpenAI-compatible `/v1/chat/completions` for replies, and streams TTS back. Designed for multi-tenant voice agents.

## Prerequisites

1. Linux box with NVIDIA GPU (H100 or 4090).
2. CUDA 12 + cuDNN 9 (latest `ctranslate2` requires this).
3. Python 3.11, `pip install faster-whisper vllm fastapi uvicorn websockets`.
4. Hugging Face token for downloading models.

## Architecture

```mermaid
flowchart LR
  CL[Client WSS] --> API[FastAPI]
  API -->|audio| FW[faster-whisper large-v3 INT8]
  API -->|prompt| VLLM[vLLM /v1/chat/completions]
  VLLM --> API
  API -->|text| TTS[Piper / Coqui]
  TTS --> CL
```

## Step 1 — Start vLLM with an OpenAI-compatible server

```bash
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --port 8001
```

For a single 80 GB H100, drop `--tensor-parallel-size` and use `Llama-3.1-8B-Instruct` to leave headroom for KV cache.

## Step 2 — Initialize faster-whisper

```python
from faster_whisper import WhisperModel

# INT8 on CPU is the sweet spot for sub-2s utterances

stt = WhisperModel("large-v3", device="cpu", compute_type="int8",
                   cpu_threads=8, num_workers=2)
```

If you have spare GPU, use `device="cuda", compute_type="float16"` — but pin to a different GPU than vLLM or you'll thrash the KV cache.

## Step 3 — FastAPI WebSocket bridge

```python
from fastapi import FastAPI, WebSocket
import numpy as np, httpx, asyncio
app = FastAPI()

async def llm_chat(messages):
    async with httpx.AsyncClient(timeout=30) as c:
        r = await c.post("[http://127.0.0.1:8001/v1/chat/completions](http://127.0.0.1:8001/v1/chat/completions)", json={
            "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            "messages": messages, "temperature": 0.3, "max_tokens": 200})
    return r.json()["choices"][0]["message"]["content"]
```

## Step 4 — Stream audio and transcribe in chunks

```python
@app.websocket("/voice")
async def voice(ws: WebSocket):
    await ws.accept()
    history, buf = [{"role":"system","content":"Be concise."}], bytearray()
    while True:
        msg = await ws.receive_bytes()
        buf.extend(msg)
        if len(buf) < 16000 * 2 * 1.2: continue   # 1.2s of 16kHz int16
        pcm = np.frombuffer(bytes(buf), dtype=np.int16).astype(np.float32) / 32768.0
        buf.clear()
        segs, _ = stt.transcribe(pcm, language="en", vad_filter=True,
                                  vad_parameters=dict(min_silence_duration_ms=500))
        text = " ".join(s.text for s in segs).strip()
        if not text: continue
        history.append({"role":"user","content":text})
        reply = await llm_chat(history)
        history.append({"role":"assistant","content":reply})
        await ws.send_text(reply)
```

## Step 5 — Add tool calls (vLLM 0.6+ supports them)

```python
TOOLS = [{
  "type":"function",
  "function":{
    "name":"check_inventory",
    "description":"Check stock for a SKU",
    "parameters":{"type":"object","properties":{
      "sku":{"type":"string"}},"required":["sku"]}}}]

async def llm_with_tools(messages):
    async with httpx.AsyncClient() as c:
        r = await c.post("[http://127.0.0.1:8001/v1/chat/completions](http://127.0.0.1:8001/v1/chat/completions)", json={
            "model":"meta-llama/Meta-Llama-3.1-70B-Instruct",
            "messages":messages, "tools":TOOLS, "tool_choice":"auto"})
    return r.json()["choices"][0]["message"]
```

`--enable-auto-tool-choice` + `--tool-call-parser llama3_json` is the magic flag combo. Without both, tool calls return as plain text.

## Step 6 — Benchmark and tune

On a single H100 with Llama-3.1-8B:

- vLLM throughput: 6,200 output tokens/s aggregate.
- faster-whisper-large-v3 INT8 on 8 CPU threads: ~0.18 s for a 5 s utterance.
- 30 concurrent voice sessions stable; KV cache is the limit.

## Common pitfalls

- **CTranslate2 + CUDA 11.** The current build needs CUDA 12 / cuDNN 9. Pin `ctranslate2==4.4.0` if you're stuck on cuDNN 8.
- **KV cache OOM.** Lower `--max-model-len`; voice agents rarely need 32k context.
- **Tool-call parser.** Llama 3 uses a JSON-with-tags format; `hermes` parser silently drops calls.

## How CallSphere does this in production

CallSphere's stack: 37 specialist agents, 90+ tools, 115+ Postgres tables across 6 verticals. Healthcare runs 14 tools on FastAPI :8084 with OpenAI Realtime; OneRoof Property uses 10 specialists on WebRTC. We benchmark vLLM internally for offline batch workloads (transcript summarization, QA scoring) but keep Realtime for live calls. Flat pricing $149 / $499 / $1499. [14-day trial](/trial) · [22% affiliate](/affiliate) · [/demo](/demo).

## FAQ

**Why not Whisper on GPU too?** It's overkill; CPU-INT8 is fast enough and frees the GPU for the LLM.

**Will vLLM eat my Realtime budget?** It replaces the LLM half; you still need an STT and TTS layer.

**Best model for voice today?** Llama-3.1-8B for low latency, Llama-3.3-70B for quality.

**Can I do speculative decoding?** Yes, vLLM supports n-gram and EAGLE.

**HIPAA?** Possible if you self-host on a HIPAA-eligible cloud and add audit logging. CallSphere does it for [Healthcare](/industries/real-estate) out-of-the-box.

## Sources

- [faster-whisper on GitHub](https://github.com/SYSTRAN/faster-whisper)
- [vLLM Whisper docs](https://docs.vllm.ai/en/latest/api/vllm/model_executor/models/whisper/)
- [Arm Learning Path: faster-whisper + vLLM](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_voicechatbot/)
- [Red Hat Whisper production guide](https://developers.redhat.com/articles/2026/03/06/private-transcription-whisper-red-hat-ai)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-faster-whisper-vllm
