k3s on the Edge for AI Voice: 200ms-or-Bust Topology (2026)
Run a 3-node k3s cluster at the network edge to slash voice-agent first-token latency below 250ms. ServerlessLB, MetalLB, NodeLocalDNS, and tuned WebRTC ports.
TL;DR — k3s runs on a single binary under 100 MB. Three Hetzner ccx33s in three regions + MetalLB + NodeLocalDNS + tuned WebRTC port range gives you sub-250 ms voice-agent latency without the EKS bill.
What you'll set up
A 3-node k3s cluster across us-east, us-west, eu-central. Voice agents run on each node; LiveKit terminates WebRTC locally; OpenAI Realtime calls are routed to the nearest OpenAI region. End-user voice-to-voice latency stays under ~280 ms.
Architecture
flowchart TD
USER[End user] --> ANYCAST[Cloudflare anycast]
ANYCAST -->|nearest region| EDGE1[k3s us-east]
ANYCAST --> EDGE2[k3s us-west]
ANYCAST --> EDGE3[k3s eu-central]
EDGE1 --> AG1[Voice agent pod]
AG1 --> LK1[LiveKit]
AG1 -->|Realtime WSS| OPENAI[OpenAI us-east]
Step 1 — Install k3s with the right disables
```bash curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.0+k3s1 sh -s - server \ --disable traefik \ --disable servicelb \ --tls-san edge-us-east.example.com \ --kube-apiserver-arg="audit-log-path=/var/log/audit.log" ```
We disable Traefik (we'll use ingress-nginx) and ServiceLB (we'll use MetalLB or kube-vip). tls-san makes kubectl work from outside.
Step 2 — Add MetalLB for real LoadBalancer Services
```yaml apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: { name: edge-pool, namespace: metallb-system } spec: addresses: ["203.0.113.10-203.0.113.20"]
apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: { name: l2, namespace: metallb-system } ```
LiveKit needs a LoadBalancer Service with a stable external IP for SDP offers; MetalLB layer-2 is the simplest path on bare-metal/Hetzner.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Tune the WebRTC port range
LiveKit grabs UDP 50000-60000 by default. On Hetzner that means opening 10k UDP ports on each node — fine, but Cloudflare's WebRTC TURN reduces that. We use:
```yaml livekit: rtc: port_range_start: 50000 port_range_end: 50500 use_external_ip: true stun_servers: ["stun.cloudflare.com:3478"] ```
500 UDP ports per node = ~250 concurrent calls. Enough for our edge density.
Step 4 — NodeLocalDNS to kill DNS tail latency
```bash kubectl apply -f https://github.com/kubernetes/dns/raw/master/cmd/node-cache/node-cache.yaml ```
OpenAI Realtime opens a fresh DNS lookup for api.openai.com on every reconnect; with NodeLocalDNS the resolve goes from 25 ms to 0.1 ms. Compounded over reconnects, this matters.
Step 5 — Pin agent pods to nodes via topologySpreadConstraints
```yaml spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: voice-agent } ```
We want one agent replica per region — never two in us-east while us-west is empty.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Cloudflare Tunnel for ingress (no public ports)
```yaml apiVersion: apps/v1 kind: Deployment metadata: { name: cloudflared } spec: template: spec: containers: - name: cloudflared image: cloudflare/cloudflared:latest args: ["tunnel","--no-autoupdate","run"] env: - name: TUNNEL_TOKEN valueFrom: { secretKeyRef: { name: cf-tunnel, key: token }} ```
Cloudflare Tunnel terminates TLS at the edge, then proxies inbound HTTPS over an outbound-only mTLS connection. Zero public ports on the k3s server. The k3s control plane only allows traffic from cloudflared's IPs.
Step 7 — Anycast routing via Cloudflare load balancer
In Cloudflare Load Balancer, configure 3 origin pools (one per region) with health checks against /healthz/realtime. Geo-steering routes US callers to us-east, EU to eu-central. Failover automatic.
Pitfalls
- k3s embedded etcd on a single server has no HA. For prod, use 3 server nodes with embedded etcd HA (
--cluster-initthen--server). - MetalLB layer-2 + multiple nodes elects one node as ARP responder. Failover is fine, but the Service IP traverses one node — sometimes adds 1-2 ms.
- UDP port range too narrow silently caps concurrent calls. Monitor LiveKit
rtc_session_countmetric and widen if you hit it. - NodeLocalDNS without tuning ndots → still slow. Set
dnsConfig.options: [{name: ndots, value: "2"}]in pods. - Cloudflare Tunnel + WebRTC — Tunnel doesn't proxy UDP. WebRTC must reach LiveKit directly via MetalLB IP; only the signaling TLS goes via Tunnel.
How CallSphere does this in production
CallSphere runs a 3-node k3s edge fleet with Postgres at 72.62.162.83 fronted by Cloudflare Tunnel. We run 37 voice agents and 90+ tools across 6 verticals on this exact topology. Healthcare and behavioral health get dedicated edge nodes for HIPAA isolation. p95 voice-to-voice latency is 280 ms us-east, 310 ms us-west. $149 / $499 / $1499 with 14-day trial; 22% affiliate; see healthcare.
FAQ
Q: k3s vs k0s on the edge? k3s has stronger ecosystem support; k0s is slightly leaner (40 MB vs 70 MB). Either is fine.
Q: What about NVIDIA Jetson edges?
k3s runs cleanly on linux/arm64 Jetson nodes. We don't run on-device ASR — Realtime API in the cloud is faster than Jetson STT for most cases.
Q: How many concurrent voice sessions per ccx33? ~150 with OpenAI Realtime (network-bound), ~50 if you're running local TTS too.
Q: HA control plane?
3 server nodes with --cluster-init + --server https://lb-vip and a kube-vip floating IP for the API.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.