---
title: "Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization"
description: "Build a complete audio preprocessing pipeline for voice AI agents — covering noise reduction, echo cancellation, gain normalization, and both client-side Web Audio API and server-side Python implementations."
canonical: https://callsphere.ai/blog/audio-preprocessing-voice-agents-noise-reduction-echo-cancellation
category: "Learn Agentic AI"
tags: ["Audio Preprocessing", "Noise Reduction", "Echo Cancellation", "Web Audio API", "Voice AI", "Signal Processing"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.284Z
---

# Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization

> Build a complete audio preprocessing pipeline for voice AI agents — covering noise reduction, echo cancellation, gain normalization, and both client-side Web Audio API and server-side Python implementations.

## Why Preprocessing Matters

Raw microphone audio is messy. It contains background noise (fans, traffic, other conversations), echo from the agent's own speech playing through speakers, volume inconsistencies (some users speak quietly, others shout), and room reverberation. Feeding raw audio directly to your STT engine degrades transcription accuracy and produces unreliable results.

A well-designed preprocessing pipeline cleans the audio before it reaches the STT engine, dramatically improving word accuracy and reducing hallucinated transcriptions. The goal is to deliver clean, normalized speech at a consistent volume level.

## Client-Side Preprocessing with Web Audio API

The browser's Web Audio API lets you process audio in real time before sending it to the server. This reduces bandwidth and offloads processing from your backend.

```mermaid
flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT
Deepgram or Whisper"]
        NLU{"Intent and
Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS
ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and
Schedule")]
        KB[("Knowledge Base
and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS  CRM
    TOOLS  CAL
    TOOLS  KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
```

```javascript
class AudioPreprocessor {
  constructor() {
    this.audioContext = null;
    this.sourceNode = null;
    this.processorNode = null;
  }

  async init(stream) {
    this.audioContext = new AudioContext({ sampleRate: 16000 });
    this.sourceNode = this.audioContext.createMediaStreamSource(stream);

    // High-pass filter to remove low-frequency rumble (below 80Hz)
    const highPass = this.audioContext.createBiquadFilter();
    highPass.type = 'highpass';
    highPass.frequency.value = 80;
    highPass.Q.value = 0.7;

    // Low-pass filter to remove high-frequency hiss (above 8kHz)
    const lowPass = this.audioContext.createBiquadFilter();
    lowPass.type = 'lowpass';
    lowPass.frequency.value = 8000;
    lowPass.Q.value = 0.7;

    // Compressor for volume normalization
    const compressor = this.audioContext.createDynamicsCompressor();
    compressor.threshold.value = -30;   // Start compressing at -30dB
    compressor.knee.value = 10;
    compressor.ratio.value = 4;         // 4:1 compression ratio
    compressor.attack.value = 0.005;    // 5ms attack
    compressor.release.value = 0.1;     // 100ms release

    // Gain to boost after compression
    const gainNode = this.audioContext.createGain();
    gainNode.gain.value = 1.5;

    // Connect the chain
    this.sourceNode
      .connect(highPass)
      .connect(lowPass)
      .connect(compressor)
      .connect(gainNode);

    return gainNode;
  }

  getProcessedStream(gainNode) {
    const destination = this.audioContext.createMediaStreamDestination();
    gainNode.connect(destination);
    return destination.stream;
  }
}

// Usage
const rawStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const preprocessor = new AudioPreprocessor();
const outputNode = await preprocessor.init(rawStream);
const cleanStream = preprocessor.getProcessedStream(outputNode);
// Use cleanStream for WebRTC or recording
```

## AudioWorklet for Advanced Processing

For more sophisticated processing like spectral noise reduction, use an AudioWorklet. This runs in a separate thread so it does not block the main UI.

```javascript
// noise-suppressor-worklet.js
class NoiseSuppressorProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.noiseFloor = new Float32Array(128).fill(0.001);
    this.alpha = 0.98;  // Smoothing factor for noise estimation
  }

  process(inputs, outputs) {
    const input = inputs[0][0];
    const output = outputs[0][0];

    if (!input) return true;

    for (let i = 0; i  noiseEst) {
        output[i] = input[i] * (1 - noiseEst / magnitude);
      } else {
        output[i] = input[i] * 0.05;  // Soft gate, don't zero out
      }
    }

    return true;
  }
}

registerProcessor('noise-suppressor', NoiseSuppressorProcessor);
```

Register and use the worklet in your main code:

```javascript
await audioContext.audioWorklet.addModule('noise-suppressor-worklet.js');
const suppressorNode = new AudioWorkletNode(audioContext, 'noise-suppressor');

// Insert into the processing chain
sourceNode.connect(suppressorNode).connect(compressor);
```

## Server-Side Preprocessing with Python

When you need more powerful noise reduction than what the browser can provide, process audio on the server using libraries like noisereduce and scipy.

```python
import numpy as np
import noisereduce as nr
from scipy.signal import butter, sosfilt
from scipy.io import wavfile

class ServerAudioPreprocessor:
    def __init__(self, sample_rate: int = 16000):
        self.sample_rate = sample_rate
        self.target_rms = 0.1  # Target RMS for normalization

    def preprocess(self, audio: np.ndarray) -> np.ndarray:
        """Full preprocessing pipeline."""
        audio = audio.astype(np.float32)
        if audio.max() > 1.0:
            audio = audio / 32768.0  # Convert int16 to float

        audio = self._bandpass_filter(audio, low=80, high=8000)
        audio = self._reduce_noise(audio)
        audio = self._normalize(audio)
        audio = self._trim_silence(audio)

        return audio

    def _bandpass_filter(
        self, audio: np.ndarray, low: int, high: int
    ) -> np.ndarray:
        sos = butter(
            5, [low, high], btype='band',
            fs=self.sample_rate, output='sos',
        )
        return sosfilt(sos, audio)

    def _reduce_noise(self, audio: np.ndarray) -> np.ndarray:
        return nr.reduce_noise(
            y=audio,
            sr=self.sample_rate,
            stationary=False,   # Non-stationary noise (better for real-world)
            prop_decrease=0.8,  # Reduce noise by 80%
            n_fft=512,
            hop_length=128,
        )

    def _normalize(self, audio: np.ndarray) -> np.ndarray:
        rms = np.sqrt(np.mean(audio ** 2))
        if rms > 0:
            audio = audio * (self.target_rms / rms)
        return np.clip(audio, -1.0, 1.0)

    def _trim_silence(
        self, audio: np.ndarray, threshold: float = 0.01
    ) -> np.ndarray:
        mask = np.abs(audio) > threshold
        if not mask.any():
            return audio
        first = mask.argmax()
        last = len(mask) - mask[::-1].argmax()
        # Keep small padding
        pad = int(0.05 * self.sample_rate)
        return audio[max(0, first - pad):min(len(audio), last + pad)]

# Usage
preprocessor = ServerAudioPreprocessor(sample_rate=16000)
sample_rate, raw_audio = wavfile.read("recording.wav")
clean_audio = preprocessor.preprocess(raw_audio)
```

## Echo Cancellation

Echo cancellation removes the agent's own voice from the user's microphone input. The browser handles this when you enable `echoCancellation: true` in getUserMedia. For server-side echo cancellation, you need the agent's output audio as a reference signal.

```python
from scipy.signal import fftconvolve

class SimpleAEC:
    """Simplified Acoustic Echo Cancellation using cross-correlation."""

    def __init__(self, filter_length: int = 4096):
        self.filter_length = filter_length
        self.filter_coeffs = np.zeros(filter_length)
        self.mu = 0.01  # Learning rate

    def cancel_echo(
        self, mic_signal: np.ndarray, ref_signal: np.ndarray
    ) -> np.ndarray:
        """Remove echo of ref_signal from mic_signal."""
        n = len(mic_signal)
        output = np.zeros(n)

        for i in range(self.filter_length, n):
            ref_chunk = ref_signal[i - self.filter_length:i][::-1]
            echo_estimate = np.dot(self.filter_coeffs, ref_chunk)
            error = mic_signal[i] - echo_estimate
            output[i] = error

            # Adaptive filter update (NLMS)
            power = np.dot(ref_chunk, ref_chunk) + 1e-10
            self.filter_coeffs += self.mu * error * ref_chunk / power

        return output
```

In practice, WebRTC's built-in AEC is far more sophisticated and handles non-linear echo, double-talk, and dynamic room conditions. Use it whenever possible.

## FAQ

### Should I preprocess audio on the client or the server?

Do both. Client-side preprocessing (filtering, compression, gain) reduces bandwidth and gives the server cleaner input. Server-side preprocessing (noise reduction, echo cancellation) handles the heavy lifting. This layered approach is standard in production voice systems. The browser's built-in audio constraints (echoCancellation, noiseSuppression, autoGainControl) provide a solid baseline that handles 80% of cases.

### Does preprocessing degrade STT accuracy?

Aggressive preprocessing can remove speech content along with noise, particularly overly aggressive noise reduction or narrow bandpass filters. The key is to tune your preprocessing parameters on representative audio samples and measure the STT word error rate before and after. In most cases, well-tuned preprocessing improves STT accuracy by 10-30% compared to raw audio.

### How do I handle audio from different microphone types?

Different microphones (laptop built-in, USB headset, phone) have vastly different frequency responses and sensitivity levels. Normalization is the key — apply automatic gain control to bring all inputs to a consistent RMS level. The compressor in the Web Audio API chain handles this well. Additionally, the bandpass filter removes frequencies that are outside the speech range regardless of microphone type.

---

#AudioPreprocessing #NoiseReduction #EchoCancellation #WebAudioAPI #VoiceAI #SignalProcessing #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/audio-preprocessing-voice-agents-noise-reduction-echo-cancellation
