Raspberry Pi AI Agent: Building a Hardware-Based Voice Assistant

What You Will Build

In this guide, you will build a standalone AI voice assistant running entirely on a Raspberry Pi 5. It listens for a wake word, transcribes your speech locally, processes the request through a small language model, and responds with synthesized speech — all without cloud API calls.

Hardware Requirements

Raspberry Pi 5 (8 GB RAM recommended, 4 GB minimum)
USB microphone or a ReSpeaker HAT for higher quality audio
Speaker connected via 3.5mm jack or USB
MicroSD card (64 GB or larger for model storage)
Power supply (USB-C, 5V 5A for Pi 5)

Software Setup

Start with a clean Raspberry Pi OS (64-bit) installation:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

# Update system
sudo apt update && sudo apt upgrade -y

# Install audio dependencies
sudo apt install -y portaudio19-dev python3-pyaudio espeak-ng libespeak-ng-dev

# Install Python dependencies
pip install numpy onnxruntime openwakeword faster-whisper piper-tts

# Verify audio devices
python3 -c "import pyaudio; p = pyaudio.PyAudio(); print(p.get_default_input_device_info())"

Wake Word Detection

The agent needs to listen continuously but only process speech after hearing a wake word. The openwakeword library provides lightweight wake word models:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import pyaudio
import numpy as np
from openwakeword.model import Model as WakeModel

class WakeWordListener:
    """Listens for a wake word using openwakeword."""

    CHUNK = 1280  # 80ms at 16kHz
    RATE = 16000
    FORMAT = pyaudio.paInt16

    def __init__(self, threshold: float = 0.5):
        self.model = WakeModel(
            wakeword_models=["hey_jarvis"],
            inference_framework="onnx",
        )
        self.threshold = threshold
        self.audio = pyaudio.PyAudio()

    def listen_for_wake_word(self) -> bool:
        """Block until wake word is detected."""
        stream = self.audio.open(
            format=self.FORMAT,
            channels=1,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )

        print("Listening for wake word...")
        try:
            while True:
                audio_data = stream.read(self.CHUNK, exception_on_overflow=False)
                audio_array = np.frombuffer(audio_data, dtype=np.int16)
                prediction = self.model.predict(audio_array)

                for model_name, score in prediction.items():
                    if score > self.threshold:
                        print(f"Wake word detected! (score: {score:.2f})")
                        return True
        finally:
            stream.stop_stream()
            stream.close()

Speech-to-Text with Faster Whisper

After the wake word triggers, capture and transcribe the user's speech:

import io
import wave
import numpy as np
from faster_whisper import WhisperModel

class SpeechRecognizer:
    """Transcribes speech using Whisper running locally."""

    def __init__(self, model_size: str = "tiny.en"):
        # tiny.en is ~75MB, runs fast on Pi 5
        # Use "base.en" (~140MB) for better accuracy
        self.model = WhisperModel(
            model_size,
            device="cpu",
            compute_type="int8",
        )
        self.audio = pyaudio.PyAudio()

    def record_until_silence(
        self,
        silence_threshold: int = 500,
        silence_duration: float = 1.5,
        max_duration: float = 15.0,
    ) -> np.ndarray:
        """Record audio until silence is detected."""
        stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024,
        )

        frames = []
        silent_chunks = 0
        max_chunks = int(max_duration * 16000 / 1024)
        silence_limit = int(silence_duration * 16000 / 1024)

        print("Recording... speak now")
        for _ in range(max_chunks):
            data = stream.read(1024, exception_on_overflow=False)
            frames.append(np.frombuffer(data, dtype=np.int16))

            amplitude = np.abs(frames[-1]).mean()
            if amplitude < silence_threshold:
                silent_chunks += 1
            else:
                silent_chunks = 0

            if silent_chunks >= silence_limit and len(frames) > 10:
                break

        stream.stop_stream()
        stream.close()
        print("Recording complete")
        return np.concatenate(frames).astype(np.float32) / 32768.0

    def transcribe(self, audio: np.ndarray) -> str:
        segments, _ = self.model.transcribe(audio, language="en")
        return " ".join(seg.text.strip() for seg in segments)

Agent Brain: Local Language Model

For the reasoning engine, use a small ONNX-optimized model. On a Pi 5 with 8 GB RAM, a 1 to 2 billion parameter quantized model fits:

import json

class PiAgent:
    """Simple agent with tool-calling capabilities."""

    def __init__(self, model_runner):
        self.model = model_runner
        self.tools = {}
        self.conversation = []

    def register_tool(self, name: str, description: str, handler):
        self.tools[name] = {"description": description, "handler": handler}

    def process(self, user_text: str) -> str:
        self.conversation.append({"role": "user", "content": user_text})

        tools_desc = "\n".join(
            f"- {name}: {t['description']}" for name, t in self.tools.items()
        )
        system_prompt = (
            "You are a helpful voice assistant running on a Raspberry Pi. "
            "Keep responses short — under 2 sentences. "
            f"Available tools:\n{tools_desc}\n"
            "To use a tool, respond with: TOOL:tool_name:argument"
        )

        response = self.model.generate(system_prompt, self.conversation)

        # Check if the model wants to use a tool
        if response.startswith("TOOL:"):
            parts = response.split(":", 2)
            tool_name = parts[1]
            tool_arg = parts[2] if len(parts) > 2 else ""

            if tool_name in self.tools:
                tool_result = self.tools[tool_name]["handler"](tool_arg)
                return f"Done. {tool_result}"
            return f"I do not have a tool called {tool_name}."

        self.conversation.append({"role": "assistant", "content": response})
        return response

Text-to-Speech with Piper

Convert the agent's response back to audio:

import subprocess
import tempfile

class TextToSpeech:
    """Synthesize speech using Piper TTS (runs locally on Pi)."""

    def __init__(self, model: str = "en_US-lessac-medium"):
        self.model = model

    def speak(self, text: str):
        """Generate and play speech."""
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            temp_path = f.name

        subprocess.run(
            [
                "piper",
                "--model", self.model,
                "--output_file", temp_path,
            ],
            input=text.encode(),
            capture_output=True,
        )
        subprocess.run(["aplay", temp_path])

    def speak_streaming(self, text: str):
        """Stream speech output directly to audio device."""
        piper = subprocess.Popen(
            ["piper", "--model", self.model, "--output-raw"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
        )
        aplay = subprocess.Popen(
            ["aplay", "-r", "22050", "-f", "S16_LE", "-c", "1"],
            stdin=piper.stdout,
        )
        piper.stdin.write(text.encode())
        piper.stdin.close()
        aplay.wait()

Putting It All Together

def main():
    wake = WakeWordListener(threshold=0.6)
    stt = SpeechRecognizer(model_size="tiny.en")
    tts = TextToSpeech()
    agent = PiAgent(model_runner=LocalONNXModel("phi2_q4.onnx"))

    # Register tools
    agent.register_tool(
        "lights", "Control smart lights (on/off)",
        lambda arg: toggle_lights(arg),
    )
    agent.register_tool(
        "weather", "Get current weather",
        lambda arg: get_weather_cached(),
    )

    print("Pi Agent ready!")
    while True:
        wake.listen_for_wake_word()
        tts.speak("Yes?")
        audio = stt.record_until_silence()
        user_text = stt.transcribe(audio)
        print(f"User said: {user_text}")

        response = agent.process(user_text)
        print(f"Agent: {response}")
        tts.speak(response)

FAQ

Which Raspberry Pi model should I use for a voice agent?

The Raspberry Pi 5 with 8 GB RAM is the recommended choice. It has enough memory for a quantized 1 to 2 billion parameter model plus the STT and TTS models running concurrently. The Pi 4 with 8 GB works for smaller models but inference is noticeably slower. The Pi 4 with 4 GB can handle only the lightest models (Whisper tiny plus a classifier).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How fast is speech-to-text on a Raspberry Pi?

Whisper tiny.en transcribes a 5-second audio clip in about 1 to 2 seconds on a Pi 5. Whisper base.en takes 3 to 5 seconds for the same clip. For real-time conversational feel, stick with tiny.en and accept slightly lower accuracy, or use Whisper small.en if you can tolerate 5 to 8 second transcription times.

Can the Raspberry Pi agent work without any internet connection?

Yes, that is the entire point of this design. The wake word model, Whisper STT, the language model, and Piper TTS all run locally. The only internet dependency would be for tools that call external APIs (like weather). You can pre-cache data for those tools or provide a graceful "I am offline" response.

#RaspberryPi #VoiceAssistant #HardwareAI #EdgeAI #HomeAutomation #Python #AgenticAI #LearnAI #AIEngineering

Raspberry Pi AI Agent: Building a Hardware-Based Voice Assistant

What You Will Build

Hardware Requirements

Software Setup

Wake Word Detection

Speech-to-Text with Faster Whisper

Agent Brain: Local Language Model

Text-to-Speech with Piper

Putting It All Together

FAQ

Which Raspberry Pi model should I use for a voice agent?

How fast is speech-to-text on a Raspberry Pi?

Can the Raspberry Pi agent work without any internet connection?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Small Language Models That Beat GPT-4: Phi-4, Gemma-3, and SmolLM-3 Benchmarks

On-Device Voice LLMs: Apple Intelligence, Gemini Nano, and the Privacy Angle

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools