---
title: "Build a Voice Agent with Krisp Audio Filter and VIVA SDK (2026)"
description: "Krisp's VIVA SDK isolates the primary speaker before STT. Wire it as a pre-processor in front of LiveKit/Pipecat for 30%+ WER drop in noisy calls."
canonical: https://callsphere.ai/blog/vw9h-build-voice-agent-krisp-audio-filter-viva-2026
category: "AI Voice Agents"
tags: ["Krisp", "VIVA", "Voice Agent", "Noise Cancellation", "SDK"]
author: "CallSphere Team"
published: 2026-03-29T00:00:00.000Z
updated: 2026-05-08T17:25:15.771Z
---

# Build a Voice Agent with Krisp Audio Filter and VIVA SDK (2026)

> Krisp's VIVA SDK isolates the primary speaker before STT. Wire it as a pre-processor in front of LiveKit/Pipecat for 30%+ WER drop in noisy calls.

> **TL;DR** — Krisp shipped VIVA (Voice Isolation for Voice Agents) in 2026 — a CPU-only model 3.5x smaller than its predecessor that strips background noise AND secondary voices before audio reaches your STT. Drop it as a pipeline pre-processor and watch WER improve 20-40% on real-world calls.

## What you'll build

A LiveKit Agents pipeline with Krisp VIVA inserted between the room input track and Deepgram STT, so the LLM only hears the primary caller — even at a coffee shop or with a TV in the background.

## Architecture

```mermaid
flowchart LR
  MIC[Caller mic] --> RM[LiveKit room]
  RM --> KR[Krisp VIVA filter]
  KR -- clean PCM --> STT[Deepgram Nova-3]
  STT --> LLM[GPT-4o]
  LLM --> TTS[ElevenLabs]
  TTS --> RM --> MIC
```

## Step 1 — Get Krisp SDK

Sign up at developers.krisp.ai. You'll get an SDK token + a per-platform binary (`libkrisp-audio-sdk.so` for Linux, `.dylib` Mac, `.dll` Windows, `.wasm` for browser).

## Step 2 — Python bindings

```bash
pip install krisp-audio-sdk  # internal pip from Krisp
export KRISP_TOKEN="your-token"
```

## Step 3 — Wrap as an audio processor

```python
import numpy as np
from krisp_audio_sdk import AudioCleaner, ModelType
from livekit.agents import audio

class KrispVAF(audio.AudioProcessor):
    def **init**(self):
        self.cleaner = AudioCleaner(
            model=ModelType.VIVA_VC_32K,   # voice-call optimised
            sample_rate=16000,
            frame_size_ms=10,
        )
    async def process(self, frame: audio.AudioFrame) -> audio.AudioFrame:
        clean = self.cleaner.clean_frame(frame.data)
        return audio.AudioFrame(data=clean, sample_rate=frame.sample_rate,
                                num_channels=frame.num_channels)
```

## Step 4 — Insert into LiveKit pipeline

```python
from livekit.agents import AgentSession, RoomInputOptions

session = AgentSession(
    stt=deepgram.STT(model="nova-3"),
    llm=openai.LLM(model="gpt-4o"),
    tts=elevenlabs.TTS(),
)
await session.start(
    room=ctx.room,
    agent=Concierge(),
    room_input_options=RoomInputOptions(
        audio_processors=[KrispVAF()],
    ),
)
```

## Step 5 — Browser fallback (WASM)

```ts
import { KrispSDK } from "@krisp.ai/krisp-audio-sdk-wasm";

const krisp = await KrispSDK.create({
  authToken: process.env.NEXT_PUBLIC_KRISP_TOKEN!,
  model: "viva_vc_16k",
});
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const cleaned = await krisp.process(stream);   // returns a MediaStream
// pipe `cleaned` into your WebRTC peer connection or AudioWorklet
```

## Step 6 — Pipecat variant

```python
from pipecat.audio.filters.krisp_filter import KrispFilter

transport = DailyTransport(..., DailyParams(
    audio_in_filter=KrispFilter(model="viva_vc_32k"),
))
```

## Step 7 — Measure the win

Run a controlled WER test (e.g. LibriSpeech + cafe-noise SNR 10 dB). Typical numbers in 2026: Deepgram Nova-3 alone hits ~14% WER on noisy mixed clips; Nova-3 + VIVA drops to ~9% — a >30% relative reduction.

## Pitfalls

- **Sample rate**: VIVA models are SR-pinned (16k or 32k); resample BEFORE `clean_frame`.
- **CPU budget**: VIVA-VC adds ~6-10% single-core CPU per stream; size workers accordingly.
- **Frame size**: Stick with 10ms — 20ms increases buffer latency 2x for marginal quality gain.
- **Browser CORS**: WASM build requires `Cross-Origin-Embedder-Policy: require-corp` in your headers.

## How CallSphere does this

CallSphere wraps every inbound call across **6 verticals** with VIVA, then feeds **37 agents** through **90+ tools** and **115+ DB tables**. The salon vertical (loud chair-side noise) saw a 33% WER reduction. **$149/$499/$1,499 · 14-day trial · 22% affiliate**.

## FAQ

**Cloud or local?** Krisp processes locally — no audio leaves the worker, so HIPAA/PII stays sealed.

**License model?** Per-minute via SDK token; volume tiers down to ~$0.001/min at scale.

**Mobile?** iOS + Android binaries ship with the same API surface.

**Compatible with Deepgram/AssemblyAI/Soniox?** Yes — VIVA is a pre-processor, totally vendor-neutral.

## Sources

- Krisp Developers - Real-Time AI Voice SDK - [https://krisp.ai/developers/](https://krisp.ai/developers/)
- Krisp SDK Docs - [https://sdk-docs.krisp.ai/](https://sdk-docs.krisp.ai/)
- Krisp Blog - 3.5x Smaller Voice Isolation Model - [https://krisp.ai/blog/small-voice-isolation-model/](https://krisp.ai/blog/small-voice-isolation-model/)
- Krisp SDK Docs - Twilio Voice Integration - [https://sdk-docs.krisp.ai/docs/twilio-voice](https://sdk-docs.krisp.ai/docs/twilio-voice)

## How this plays out in production

If you are taking the ideas in *Build a Voice Agent with Krisp Audio Filter and VIVA SDK (2026)* and putting them in front of real customers, the constraint that decides everything is ASR error rates on long-tail entities (drug names, street names, SKUs) and the post-call pipeline that must reconcile what was actually heard. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it.

## Voice agent architecture, end to end

A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording.

## FAQ

**What does this mean for a voice agent the way *Build a Voice Agent with Krisp Audio Filter and VIVA SDK (2026)* describes?**

Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head.

**Why does this matter for voice agent deployments at scale?**

The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay.

**How does the salon stack (GlamBook) keep bookings clean across stylists and services?**

GlamBook runs 4 agents that handle booking, rescheduling, fuzzy service-name matching, and confirmations. Every appointment gets a deterministic reference like GB-YYYYMMDD-### so the salon, the customer, and the agent all reference the same object across SMS, email, and voice.

## See it live

Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live salon booking agent (GlamBook) at [salon.callsphere.tech](https://salon.callsphere.tech) and show you exactly where the production wiring sits.

---

Source: https://callsphere.ai/blog/vw9h-build-voice-agent-krisp-audio-filter-viva-2026
