Skip to content
Learn Agentic AI
Learn Agentic AI12 min read3 views

Raspberry Pi AI Agent: Building a Hardware-Based Voice Assistant

Build a complete voice-controlled AI agent on a Raspberry Pi, covering hardware setup, model selection, audio input/output, wake word detection, and tool integration for home automation.

What You Will Build

In this guide, you will build a standalone AI voice assistant running entirely on a Raspberry Pi 5. It listens for a wake word, transcribes your speech locally, processes the request through a small language model, and responds with synthesized speech — all without cloud API calls.

Hardware Requirements

  • Raspberry Pi 5 (8 GB RAM recommended, 4 GB minimum)
  • USB microphone or a ReSpeaker HAT for higher quality audio
  • Speaker connected via 3.5mm jack or USB
  • MicroSD card (64 GB or larger for model storage)
  • Power supply (USB-C, 5V 5A for Pi 5)

Software Setup

Start with a clean Raspberry Pi OS (64-bit) installation:

flowchart TD
    START["Raspberry Pi AI Agent: Building a Hardware-Based …"] --> A
    A["What You Will Build"]
    A --> B
    B["Hardware Requirements"]
    B --> C
    C["Software Setup"]
    C --> D
    D["Wake Word Detection"]
    D --> E
    E["Speech-to-Text with Faster Whisper"]
    E --> F
    F["Agent Brain: Local Language Model"]
    F --> G
    G["Text-to-Speech with Piper"]
    G --> H
    H["Putting It All Together"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
# Update system
sudo apt update && sudo apt upgrade -y

# Install audio dependencies
sudo apt install -y portaudio19-dev python3-pyaudio espeak-ng libespeak-ng-dev

# Install Python dependencies
pip install numpy onnxruntime openwakeword faster-whisper piper-tts

# Verify audio devices
python3 -c "import pyaudio; p = pyaudio.PyAudio(); print(p.get_default_input_device_info())"

Wake Word Detection

The agent needs to listen continuously but only process speech after hearing a wake word. The openwakeword library provides lightweight wake word models:

import pyaudio
import numpy as np
from openwakeword.model import Model as WakeModel

class WakeWordListener:
    """Listens for a wake word using openwakeword."""

    CHUNK = 1280  # 80ms at 16kHz
    RATE = 16000
    FORMAT = pyaudio.paInt16

    def __init__(self, threshold: float = 0.5):
        self.model = WakeModel(
            wakeword_models=["hey_jarvis"],
            inference_framework="onnx",
        )
        self.threshold = threshold
        self.audio = pyaudio.PyAudio()

    def listen_for_wake_word(self) -> bool:
        """Block until wake word is detected."""
        stream = self.audio.open(
            format=self.FORMAT,
            channels=1,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )

        print("Listening for wake word...")
        try:
            while True:
                audio_data = stream.read(self.CHUNK, exception_on_overflow=False)
                audio_array = np.frombuffer(audio_data, dtype=np.int16)
                prediction = self.model.predict(audio_array)

                for model_name, score in prediction.items():
                    if score > self.threshold:
                        print(f"Wake word detected! (score: {score:.2f})")
                        return True
        finally:
            stream.stop_stream()
            stream.close()

Speech-to-Text with Faster Whisper

After the wake word triggers, capture and transcribe the user's speech:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Raspberry Pi 5 8 GB RAM recommended, 4 …"]
    CENTER --> N1["USB microphone or a ReSpeaker HAT for h…"]
    CENTER --> N2["Speaker connected via 3.5mm jack or USB"]
    CENTER --> N3["MicroSD card 64 GB or larger for model …"]
    CENTER --> N4["Power supply USB-C, 5V 5A for Pi 5"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
import io
import wave
import numpy as np
from faster_whisper import WhisperModel

class SpeechRecognizer:
    """Transcribes speech using Whisper running locally."""

    def __init__(self, model_size: str = "tiny.en"):
        # tiny.en is ~75MB, runs fast on Pi 5
        # Use "base.en" (~140MB) for better accuracy
        self.model = WhisperModel(
            model_size,
            device="cpu",
            compute_type="int8",
        )
        self.audio = pyaudio.PyAudio()

    def record_until_silence(
        self,
        silence_threshold: int = 500,
        silence_duration: float = 1.5,
        max_duration: float = 15.0,
    ) -> np.ndarray:
        """Record audio until silence is detected."""
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024,
        )

        frames = []
        silent_chunks = 0
        max_chunks = int(max_duration * 16000 / 1024)
        silence_limit = int(silence_duration * 16000 / 1024)

        print("Recording... speak now")
        for _ in range(max_chunks):
            data = stream.read(1024, exception_on_overflow=False)
            frames.append(np.frombuffer(data, dtype=np.int16))

            amplitude = np.abs(frames[-1]).mean()
            if amplitude < silence_threshold:
                silent_chunks += 1
            else:
                silent_chunks = 0

            if silent_chunks >= silence_limit and len(frames) > 10:
                break

        stream.stop_stream()
        stream.close()
        print("Recording complete")
        return np.concatenate(frames).astype(np.float32) / 32768.0

    def transcribe(self, audio: np.ndarray) -> str:
        segments, _ = self.model.transcribe(audio, language="en")
        return " ".join(seg.text.strip() for seg in segments)

Agent Brain: Local Language Model

For the reasoning engine, use a small ONNX-optimized model. On a Pi 5 with 8 GB RAM, a 1 to 2 billion parameter quantized model fits:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import json

class PiAgent:
    """Simple agent with tool-calling capabilities."""

    def __init__(self, model_runner):
        self.model = model_runner
        self.tools = {}
        self.conversation = []

    def register_tool(self, name: str, description: str, handler):
        self.tools[name] = {"description": description, "handler": handler}

    def process(self, user_text: str) -> str:
        self.conversation.append({"role": "user", "content": user_text})

        tools_desc = "\n".join(
            f"- {name}: {t['description']}" for name, t in self.tools.items()
        )
        system_prompt = (
            "You are a helpful voice assistant running on a Raspberry Pi. "
            "Keep responses short — under 2 sentences. "
            f"Available tools:\n{tools_desc}\n"
            "To use a tool, respond with: TOOL:tool_name:argument"
        )

        response = self.model.generate(system_prompt, self.conversation)

        # Check if the model wants to use a tool
        if response.startswith("TOOL:"):
            parts = response.split(":", 2)
            tool_name = parts[1]
            tool_arg = parts[2] if len(parts) > 2 else ""

            if tool_name in self.tools:
                tool_result = self.tools[tool_name]["handler"](tool_arg)
                return f"Done. {tool_result}"
            return f"I do not have a tool called {tool_name}."

        self.conversation.append({"role": "assistant", "content": response})
        return response

Text-to-Speech with Piper

Convert the agent's response back to audio:

import subprocess
import tempfile

class TextToSpeech:
    """Synthesize speech using Piper TTS (runs locally on Pi)."""

    def __init__(self, model: str = "en_US-lessac-medium"):
        self.model = model

    def speak(self, text: str):
        """Generate and play speech."""
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            temp_path = f.name

        subprocess.run(
            [
                "piper",
                "--model", self.model,
                "--output_file", temp_path,
            ],
            input=text.encode(),
            capture_output=True,
        )
        subprocess.run(["aplay", temp_path])

    def speak_streaming(self, text: str):
        """Stream speech output directly to audio device."""
        piper = subprocess.Popen(
            ["piper", "--model", self.model, "--output-raw"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        aplay = subprocess.Popen(
            ["aplay", "-r", "22050", "-f", "S16_LE", "-c", "1"],
            stdin=piper.stdout,
        )
        piper.stdin.write(text.encode())
        piper.stdin.close()
        aplay.wait()

Putting It All Together

def main():
    wake = WakeWordListener(threshold=0.6)
    stt = SpeechRecognizer(model_size="tiny.en")
    tts = TextToSpeech()
    agent = PiAgent(model_runner=LocalONNXModel("phi2_q4.onnx"))

    # Register tools
    agent.register_tool(
        "lights", "Control smart lights (on/off)",
        lambda arg: toggle_lights(arg),
    )
    agent.register_tool(
        "weather", "Get current weather",
        lambda arg: get_weather_cached(),
    )

    print("Pi Agent ready!")
    while True:
        wake.listen_for_wake_word()
        tts.speak("Yes?")
        audio = stt.record_until_silence()
        user_text = stt.transcribe(audio)
        print(f"User said: {user_text}")

        response = agent.process(user_text)
        print(f"Agent: {response}")
        tts.speak(response)

FAQ

Which Raspberry Pi model should I use for a voice agent?

The Raspberry Pi 5 with 8 GB RAM is the recommended choice. It has enough memory for a quantized 1 to 2 billion parameter model plus the STT and TTS models running concurrently. The Pi 4 with 8 GB works for smaller models but inference is noticeably slower. The Pi 4 with 4 GB can handle only the lightest models (Whisper tiny plus a classifier).

How fast is speech-to-text on a Raspberry Pi?

Whisper tiny.en transcribes a 5-second audio clip in about 1 to 2 seconds on a Pi 5. Whisper base.en takes 3 to 5 seconds for the same clip. For real-time conversational feel, stick with tiny.en and accept slightly lower accuracy, or use Whisper small.en if you can tolerate 5 to 8 second transcription times.

Can the Raspberry Pi agent work without any internet connection?

Yes, that is the entire point of this design. The wake word model, Whisper STT, the language model, and Piper TTS all run locally. The only internet dependency would be for tools that call external APIs (like weather). You can pre-cache data for those tools or provide a graceful "I am offline" response.


#RaspberryPi #VoiceAssistant #HardwareAI #EdgeAI #HomeAutomation #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Edge AI Agents: Running Autonomous Systems on Local Hardware with Nemotron and Llama

How to run AI agents on edge devices using NVIDIA Nemotron, Meta Llama, GGUF quantization, local inference servers, and offline-capable agent architectures.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.