Skip to content
AI Infrastructure
AI Infrastructure13 min read0 views

Build a Voice Agent on Jetson Orin Nano Super (Edge GPU, 2026)

Sub-$250 NVIDIA Jetson Orin Nano Super runs a full Whisper + 8B LLM + Piper voice loop offline at 15 tok/s. Here's the full Docker-based build with thermals, models, and code.

TL;DR — The Jetson Orin Nano Super (8 GB / 40 TOPS / ~$249) is the cheapest device that runs Whisper + an 8B LLM + Piper end-to-end with no cloud. Conversation loop: 2–3 seconds. Power: under 25 W.

What you'll build

A headless Jetson appliance booting into a Docker compose stack: whisper.cpp for STT, ollama (or llama.cpp server) for the LLM, piper for TTS, and a Python conversation loop. Talks via USB mic + 3.5 mm jack or Bluetooth speaker.

Prerequisites

  1. Jetson Orin Nano Super 8 GB with NVMe SSD and Super-mode firmware.
  2. JetPack 6.1+ flashed (sudo apt full-upgrade).
  3. Docker + nvidia-container-toolkit configured for Jetson.
  4. USB conference mic (e.g., Anker PowerConf S3) and a 3.5 mm or Bluetooth speaker.
  5. A small fan if you don't have a vendor heatsink — Super mode runs the SoC at 25 W.

Architecture

flowchart LR
  MIC[USB Mic] --> APP[Python loop]
  APP -->|PCM| WCPP[whisper.cpp tiny.en CUDA]
  WCPP --> APP
  APP -->|HTTP| OLL[ollama llama3.1:8b q4]
  OLL --> APP
  APP --> PIP[piper amy-medium]
  PIP --> SPK[Speaker]

Step 1 — Maximize the Orin

```bash sudo nvpmodel -m 0 # MAXN Super sudo jetson_clocks # Lock max clocks ```

Verify with tegrastats — you should see GPU @ 1020 MHz.

Step 2 — Build whisper.cpp with CUDA on Jetson

```bash git clone https://github.com/ggml-org/whisper.cpp && cd whisper.cpp cmake -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=87 cmake --build build -j6 bash ./models/download-ggml-model.sh tiny.en ./build/bin/whisper-cli -m models/ggml-tiny.en.bin -f samples/jfk.wav ```

CUDA arch 87 is the SM version for the Ampere-based Orin. Anything else silently falls back to CPU.

Step 3 — Run Ollama with a Q4 model

```bash curl -fsSL https://ollama.com/install.sh | sh sudo systemctl start ollama ollama pull llama3.1:8b-instruct-q4_K_M ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Ollama on Jetson autodetects the iGPU. Verify with OLLAMA_DEBUG=1 ollama run — look for gpu="cuda".

Step 4 — Install Piper

```bash pip install piper-tts python -m piper.download_voices en_US-amy-medium echo "Hello from Orin" | piper --model en_US-amy-medium --output-raw \ | aplay -r 22050 -f S16_LE -t raw - ```

Step 5 — Conversation loop

```python import sounddevice as sd, numpy as np, subprocess, requests, tempfile, wave

def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: ck, _ = s.read(1600); frames.append(ck) rms = np.sqrt(np.mean((ck.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten()

def stt(pcm): f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name with wave.open(f, "wb") as w: w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000) w.writeframes(pcm.tobytes()) return subprocess.check_output(["./whisper.cpp/build/bin/whisper-cli", "-m", "./whisper.cpp/models/ggml-tiny.en.bin", "-f", f, "-nt", "-otxt"], text=True).strip()

def chat(history, text): history.append({"role":"user","content":text}) r = requests.post("http://127.0.0.1:11434/api/chat", json={"model":"llama3.1:8b-instruct-q4_K_M","messages":history,"stream":False}).json() history.append(r["message"]) return r["message"]["content"]

def speak(t): p = subprocess.Popen(["piper","--model","en_US-amy-medium","--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) raw, _ = p.communicate(t.encode()) sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait()

history = [{"role":"system","content":"You are a concise edge voice assistant."}] while True: text = stt(record()) if not text: continue speak(chat(history, text)) ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Bake it into a systemd unit

```ini

/etc/systemd/system/edge-voice.service

[Unit] Description=Edge voice agent After=network.target ollama.service [Service] WorkingDirectory=/opt/voice ExecStart=/usr/bin/python3 /opt/voice/agent.py Restart=always [Install] WantedBy=multi-user.target ```

sudo systemctl enable --now edge-voice. The Orin now boots into a voice agent.

Common pitfalls

  • Wrong CUDA arch. Orin is SM 87, not 80. Build flags matter.
  • Power throttling. Without Super mode, 8B Q4 runs at 6 tok/s instead of 15.
  • USB mic noise floor. Cheap mics produce false VAD triggers; tune threshold.

How CallSphere does this in production

We deploy edge appliances for vertical pilots — kiosks, vehicles, on-prem clinics — where outbound traffic is forbidden. Our 37 cloud agents across 6 verticals (Healthcare's 14 tools on FastAPI :8084 / OpenAI Realtime, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral) handle volume; Jetson handles privacy. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.

FAQ

Cheaper than a cloud call? Yes after ~3,000 minutes/month/device.

Real-time? 2–3 s end-to-end on tiny.en + 8B Q4. Sub-second is possible with smaller models.

Hot to the touch? Without active cooling, yes — get the official thermal kit.

Battery powered? 25 W is too much for hand-held; fine for desk/vehicle.

Update strategy? Mender or rauc OTA — same as any embedded Linux device.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.