---
title: "Build a Voice Agent on Jetson Orin Nano Super (Edge GPU, 2026)"
description: "Sub-$250 NVIDIA Jetson Orin Nano Super runs a full Whisper + 8B LLM + Piper voice loop offline at 15 tok/s. Here's the full Docker-based build with thermals, models, and code."
canonical: https://callsphere.ai/blog/vw4h-build-voice-agent-jetson-orin-edge-gpu
category: "AI Infrastructure"
tags: ["Jetson Orin", "Edge AI", "Voice Agent", "NVIDIA", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-31T00:00:00.000Z
updated: 2026-05-07T16:13:45.418Z
---

# Build a Voice Agent on Jetson Orin Nano Super (Edge GPU, 2026)

> Sub-$250 NVIDIA Jetson Orin Nano Super runs a full Whisper + 8B LLM + Piper voice loop offline at 15 tok/s. Here's the full Docker-based build with thermals, models, and code.

> **TL;DR** — The Jetson Orin Nano Super (8 GB / 40 TOPS / ~$249) is the cheapest device that runs Whisper + an 8B LLM + Piper end-to-end with no cloud. Conversation loop: 2–3 seconds. Power: under 25 W.

## What you'll build

A headless Jetson appliance booting into a Docker compose stack: `whisper.cpp` for STT, `ollama` (or `llama.cpp` server) for the LLM, `piper` for TTS, and a Python conversation loop. Talks via USB mic + 3.5 mm jack or Bluetooth speaker.

## Prerequisites

1. Jetson Orin Nano Super 8 GB with NVMe SSD and Super-mode firmware.
2. JetPack 6.1+ flashed (`sudo apt full-upgrade`).
3. Docker + nvidia-container-toolkit configured for Jetson.
4. USB conference mic (e.g., Anker PowerConf S3) and a 3.5 mm or Bluetooth speaker.
5. A small fan if you don't have a vendor heatsink — Super mode runs the SoC at 25 W.

## Architecture

```mermaid
flowchart LR
  MIC[USB Mic] --> APP[Python loop]
  APP -->|PCM| WCPP[whisper.cpp tiny.en CUDA]
  WCPP --> APP
  APP -->|HTTP| OLL[ollama llama3.1:8b q4]
  OLL --> APP
  APP --> PIP[piper amy-medium]
  PIP --> SPK[Speaker]
```

## Step 1 — Maximize the Orin

```bash
sudo nvpmodel -m 0          # MAXN Super
sudo jetson_clocks          # Lock max clocks
```

Verify with `tegrastats` — you should see GPU @ 1020 MHz.

## Step 2 — Build whisper.cpp with CUDA on Jetson

```bash
git clone [https://github.com/ggml-org/whisper.cpp](https://github.com/ggml-org/whisper.cpp) && cd whisper.cpp
cmake -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=87
cmake --build build -j6
bash ./models/download-ggml-model.sh tiny.en
./build/bin/whisper-cli -m models/ggml-tiny.en.bin -f samples/jfk.wav
```

CUDA arch 87 is the SM version for the Ampere-based Orin. Anything else silently falls back to CPU.

## Step 3 — Run Ollama with a Q4 model

```bash
curl -fsSL [https://ollama.com/install.sh](https://ollama.com/install.sh) | sh
sudo systemctl start ollama
ollama pull llama3.1:8b-instruct-q4_K_M
```

Ollama on Jetson autodetects the iGPU. Verify with `OLLAMA_DEBUG=1 ollama run` — look for `gpu="cuda"`.

## Step 4 — Install Piper

```bash
pip install piper-tts
python -m piper.download_voices en_US-amy-medium
echo "Hello from Orin" | piper --model en_US-amy-medium --output-raw \
  | aplay -r 22050 -f S16_LE -t raw -
```

## Step 5 — Conversation loop

```python
import sounddevice as sd, numpy as np, subprocess, requests, tempfile, wave

def record(threshold=0.012, max_s=8):
    frames, silent = [], 0
    with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s:
        while silent < 9000 and len(frames) * 1600 < 16000 * max_s:
            ck, _ = s.read(1600); frames.append(ck)
            rms = np.sqrt(np.mean((ck.astype(np.float32)/32768)**2))
            silent = silent + 1600 if rms < threshold else 0
    return np.concatenate(frames).flatten()

def stt(pcm):
    f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
    with wave.open(f, "wb") as w:
        w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000)
        w.writeframes(pcm.tobytes())
    return subprocess.check_output(["./whisper.cpp/build/bin/whisper-cli",
        "-m", "./whisper.cpp/models/ggml-tiny.en.bin", "-f", f, "-nt", "-otxt"],
        text=True).strip()

def chat(history, text):
    history.append({"role":"user","content":text})
    r = requests.post("[http://127.0.0.1:11434/api/chat](http://127.0.0.1:11434/api/chat)",
        json={"model":"llama3.1:8b-instruct-q4_K_M","messages":history,"stream":False}).json()
    history.append(r["message"])
    return r["message"]["content"]

def speak(t):
    p = subprocess.Popen(["piper","--model","en_US-amy-medium","--output-raw"],
                          stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    raw, _ = p.communicate(t.encode())
    sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait()

history = [{"role":"system","content":"You are a concise edge voice assistant."}]
while True:
    text = stt(record())
    if not text: continue
    speak(chat(history, text))
```

## Step 6 — Bake it into a systemd unit

```ini

# /etc/systemd/system/edge-voice.service

[Unit]
Description=Edge voice agent
After=network.target ollama.service
[Service]
WorkingDirectory=/opt/voice
ExecStart=/usr/bin/python3 /opt/voice/agent.py
Restart=always
[Install]
WantedBy=multi-user.target
```

`sudo systemctl enable --now edge-voice`. The Orin now boots into a voice agent.

## Common pitfalls

- **Wrong CUDA arch.** Orin is SM 87, not 80. Build flags matter.
- **Power throttling.** Without Super mode, 8B Q4 runs at 6 tok/s instead of 15.
- **USB mic noise floor.** Cheap mics produce false VAD triggers; tune `threshold`.

## How CallSphere does this in production

We deploy edge appliances for vertical pilots — kiosks, vehicles, on-prem clinics — where outbound traffic is forbidden. Our 37 cloud agents across 6 verticals (Healthcare's 14 tools on FastAPI :8084 / OpenAI Realtime, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral) handle volume; Jetson handles privacy. Flat $149/$499/$1499 · [14-day trial](/trial) · [22% affiliate](/affiliate) · [/demo](/demo).

## FAQ

**Cheaper than a cloud call?** Yes after ~3,000 minutes/month/device.

**Real-time?** 2–3 s end-to-end on tiny.en + 8B Q4. Sub-second is possible with smaller models.

**Hot to the touch?** Without active cooling, yes — get the official thermal kit.

**Battery powered?** 25 W is too much for hand-held; fine for desk/vehicle.

**Update strategy?** Mender or rauc OTA — same as any embedded Linux device.

## Sources

- [NVIDIA Jetson Edge AI getting started](https://developer.nvidia.com/blog/getting-started-with-edge-ai-on-nvidia-jetson-llms-vlms-and-foundation-models-for-robotics/)
- [Robot brain on Orin Nano Super](https://thomasthelliez.com/blog/building-a-local-robot-brain-on-jetson-orin-nano-super/)
- [Open Voice OS on Jetson](https://forums.developer.nvidia.com/t/open-voice-os-on-jetson-orin-nano-offline-ai-assistant-with-llm-tts-stt-on-k3s/330132)
- [Edge AI on Jetson 2026 guide](https://www.edge-ai-vision.com/2026/01/getting-started-with-edge-ai-on-nvidia-jetson-llms-vlms-and-foundation-models-for-robotics/)

---

Source: https://callsphere.ai/blog/vw4h-build-voice-agent-jetson-orin-edge-gpu