Build a Voice Agent on Jetson Orin Nano Super (Edge GPU, 2026)
Sub-$250 NVIDIA Jetson Orin Nano Super runs a full Whisper + 8B LLM + Piper voice loop offline at 15 tok/s. Here's the full Docker-based build with thermals, models, and code.
TL;DR — The Jetson Orin Nano Super (8 GB / 40 TOPS / ~$249) is the cheapest device that runs Whisper + an 8B LLM + Piper end-to-end with no cloud. Conversation loop: 2–3 seconds. Power: under 25 W.
What you'll build
A headless Jetson appliance booting into a Docker compose stack: whisper.cpp for STT, ollama (or llama.cpp server) for the LLM, piper for TTS, and a Python conversation loop. Talks via USB mic + 3.5 mm jack or Bluetooth speaker.
Prerequisites
- Jetson Orin Nano Super 8 GB with NVMe SSD and Super-mode firmware.
- JetPack 6.1+ flashed (
sudo apt full-upgrade). - Docker + nvidia-container-toolkit configured for Jetson.
- USB conference mic (e.g., Anker PowerConf S3) and a 3.5 mm or Bluetooth speaker.
- A small fan if you don't have a vendor heatsink — Super mode runs the SoC at 25 W.
Architecture
flowchart LR
MIC[USB Mic] --> APP[Python loop]
APP -->|PCM| WCPP[whisper.cpp tiny.en CUDA]
WCPP --> APP
APP -->|HTTP| OLL[ollama llama3.1:8b q4]
OLL --> APP
APP --> PIP[piper amy-medium]
PIP --> SPK[Speaker]
Step 1 — Maximize the Orin
```bash sudo nvpmodel -m 0 # MAXN Super sudo jetson_clocks # Lock max clocks ```
Verify with tegrastats — you should see GPU @ 1020 MHz.
Step 2 — Build whisper.cpp with CUDA on Jetson
```bash git clone https://github.com/ggml-org/whisper.cpp && cd whisper.cpp cmake -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=87 cmake --build build -j6 bash ./models/download-ggml-model.sh tiny.en ./build/bin/whisper-cli -m models/ggml-tiny.en.bin -f samples/jfk.wav ```
CUDA arch 87 is the SM version for the Ampere-based Orin. Anything else silently falls back to CPU.
Step 3 — Run Ollama with a Q4 model
```bash curl -fsSL https://ollama.com/install.sh | sh sudo systemctl start ollama ollama pull llama3.1:8b-instruct-q4_K_M ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Ollama on Jetson autodetects the iGPU. Verify with OLLAMA_DEBUG=1 ollama run — look for gpu="cuda".
Step 4 — Install Piper
```bash pip install piper-tts python -m piper.download_voices en_US-amy-medium echo "Hello from Orin" | piper --model en_US-amy-medium --output-raw \ | aplay -r 22050 -f S16_LE -t raw - ```
Step 5 — Conversation loop
```python import sounddevice as sd, numpy as np, subprocess, requests, tempfile, wave
def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: ck, _ = s.read(1600); frames.append(ck) rms = np.sqrt(np.mean((ck.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten()
def stt(pcm): f = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name with wave.open(f, "wb") as w: w.setnchannels(1); w.setsampwidth(2); w.setframerate(16000) w.writeframes(pcm.tobytes()) return subprocess.check_output(["./whisper.cpp/build/bin/whisper-cli", "-m", "./whisper.cpp/models/ggml-tiny.en.bin", "-f", f, "-nt", "-otxt"], text=True).strip()
def chat(history, text): history.append({"role":"user","content":text}) r = requests.post("http://127.0.0.1:11434/api/chat", json={"model":"llama3.1:8b-instruct-q4_K_M","messages":history,"stream":False}).json() history.append(r["message"]) return r["message"]["content"]
def speak(t): p = subprocess.Popen(["piper","--model","en_US-amy-medium","--output-raw"], stdin=subprocess.PIPE, stdout=subprocess.PIPE) raw, _ = p.communicate(t.encode()) sd.play(np.frombuffer(raw, dtype=np.int16), 22050); sd.wait()
history = [{"role":"system","content":"You are a concise edge voice assistant."}] while True: text = stt(record()) if not text: continue speak(chat(history, text)) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Bake it into a systemd unit
```ini
/etc/systemd/system/edge-voice.service
[Unit] Description=Edge voice agent After=network.target ollama.service [Service] WorkingDirectory=/opt/voice ExecStart=/usr/bin/python3 /opt/voice/agent.py Restart=always [Install] WantedBy=multi-user.target ```
sudo systemctl enable --now edge-voice. The Orin now boots into a voice agent.
Common pitfalls
- Wrong CUDA arch. Orin is SM 87, not 80. Build flags matter.
- Power throttling. Without Super mode, 8B Q4 runs at 6 tok/s instead of 15.
- USB mic noise floor. Cheap mics produce false VAD triggers; tune
threshold.
How CallSphere does this in production
We deploy edge appliances for vertical pilots — kiosks, vehicles, on-prem clinics — where outbound traffic is forbidden. Our 37 cloud agents across 6 verticals (Healthcare's 14 tools on FastAPI :8084 / OpenAI Realtime, OneRoof's 10 specialists on WebRTC, plus Salon, Dental, F&B, Behavioral) handle volume; Jetson handles privacy. Flat $149/$499/$1499 · 14-day trial · 22% affiliate · /demo.
FAQ
Cheaper than a cloud call? Yes after ~3,000 minutes/month/device.
Real-time? 2–3 s end-to-end on tiny.en + 8B Q4. Sub-second is possible with smaller models.
Hot to the touch? Without active cooling, yes — get the official thermal kit.
Battery powered? 25 W is too much for hand-held; fine for desk/vehicle.
Update strategy? Mender or rauc OTA — same as any embedded Linux device.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.