Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

k3s on the Edge for AI Voice: 200ms-or-Bust Topology (2026)

Run a 3-node k3s cluster at the network edge to slash voice-agent first-token latency below 250ms. ServerlessLB, MetalLB, NodeLocalDNS, and tuned WebRTC ports.

TL;DR — k3s runs on a single binary under 100 MB. Three Hetzner ccx33s in three regions + MetalLB + NodeLocalDNS + tuned WebRTC port range gives you sub-250 ms voice-agent latency without the EKS bill.

What you'll set up

A 3-node k3s cluster across us-east, us-west, eu-central. Voice agents run on each node; LiveKit terminates WebRTC locally; OpenAI Realtime calls are routed to the nearest OpenAI region. End-user voice-to-voice latency stays under ~280 ms.

Architecture

flowchart TD
  USER[End user] --> ANYCAST[Cloudflare anycast]
  ANYCAST -->|nearest region| EDGE1[k3s us-east]
  ANYCAST --> EDGE2[k3s us-west]
  ANYCAST --> EDGE3[k3s eu-central]
  EDGE1 --> AG1[Voice agent pod]
  AG1 --> LK1[LiveKit]
  AG1 -->|Realtime WSS| OPENAI[OpenAI us-east]

Step 1 — Install k3s with the right disables

```bash curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.0+k3s1 sh -s - server \ --disable traefik \ --disable servicelb \ --tls-san edge-us-east.example.com \ --kube-apiserver-arg="audit-log-path=/var/log/audit.log" ```

We disable Traefik (we'll use ingress-nginx) and ServiceLB (we'll use MetalLB or kube-vip). tls-san makes kubectl work from outside.

Step 2 — Add MetalLB for real LoadBalancer Services

```yaml apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: { name: edge-pool, namespace: metallb-system } spec: addresses: ["203.0.113.10-203.0.113.20"]

apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: { name: l2, namespace: metallb-system } ```

LiveKit needs a LoadBalancer Service with a stable external IP for SDP offers; MetalLB layer-2 is the simplest path on bare-metal/Hetzner.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Tune the WebRTC port range

LiveKit grabs UDP 50000-60000 by default. On Hetzner that means opening 10k UDP ports on each node — fine, but Cloudflare's WebRTC TURN reduces that. We use:

```yaml livekit: rtc: port_range_start: 50000 port_range_end: 50500 use_external_ip: true stun_servers: ["stun.cloudflare.com:3478"] ```

500 UDP ports per node = ~250 concurrent calls. Enough for our edge density.

Step 4 — NodeLocalDNS to kill DNS tail latency

```bash kubectl apply -f https://github.com/kubernetes/dns/raw/master/cmd/node-cache/node-cache.yaml ```

OpenAI Realtime opens a fresh DNS lookup for api.openai.com on every reconnect; with NodeLocalDNS the resolve goes from 25 ms to 0.1 ms. Compounded over reconnects, this matters.

Step 5 — Pin agent pods to nodes via topologySpreadConstraints

```yaml spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: { app: voice-agent } ```

We want one agent replica per region — never two in us-east while us-west is empty.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Cloudflare Tunnel for ingress (no public ports)

```yaml apiVersion: apps/v1 kind: Deployment metadata: { name: cloudflared } spec: template: spec: containers: - name: cloudflared image: cloudflare/cloudflared:latest args: ["tunnel","--no-autoupdate","run"] env: - name: TUNNEL_TOKEN valueFrom: { secretKeyRef: { name: cf-tunnel, key: token }} ```

Cloudflare Tunnel terminates TLS at the edge, then proxies inbound HTTPS over an outbound-only mTLS connection. Zero public ports on the k3s server. The k3s control plane only allows traffic from cloudflared's IPs.

Step 7 — Anycast routing via Cloudflare load balancer

In Cloudflare Load Balancer, configure 3 origin pools (one per region) with health checks against /healthz/realtime. Geo-steering routes US callers to us-east, EU to eu-central. Failover automatic.

Pitfalls

  • k3s embedded etcd on a single server has no HA. For prod, use 3 server nodes with embedded etcd HA (--cluster-init then --server).
  • MetalLB layer-2 + multiple nodes elects one node as ARP responder. Failover is fine, but the Service IP traverses one node — sometimes adds 1-2 ms.
  • UDP port range too narrow silently caps concurrent calls. Monitor LiveKit rtc_session_count metric and widen if you hit it.
  • NodeLocalDNS without tuning ndots → still slow. Set dnsConfig.options: [{name: ndots, value: "2"}] in pods.
  • Cloudflare Tunnel + WebRTC — Tunnel doesn't proxy UDP. WebRTC must reach LiveKit directly via MetalLB IP; only the signaling TLS goes via Tunnel.

How CallSphere does this in production

CallSphere runs a 3-node k3s edge fleet with Postgres at 72.62.162.83 fronted by Cloudflare Tunnel. We run 37 voice agents and 90+ tools across 6 verticals on this exact topology. Healthcare and behavioral health get dedicated edge nodes for HIPAA isolation. p95 voice-to-voice latency is 280 ms us-east, 310 ms us-west. $149 / $499 / $1499 with 14-day trial; 22% affiliate; see healthcare.

FAQ

Q: k3s vs k0s on the edge? k3s has stronger ecosystem support; k0s is slightly leaner (40 MB vs 70 MB). Either is fine.

Q: What about NVIDIA Jetson edges? k3s runs cleanly on linux/arm64 Jetson nodes. We don't run on-device ASR — Realtime API in the cloud is faster than Jetson STT for most cases.

Q: How many concurrent voice sessions per ccx33? ~150 with OpenAI Realtime (network-bound), ~50 if you're running local TTS too.

Q: HA control plane? 3 server nodes with --cluster-init + --server https://lb-vip and a kube-vip floating IP for the API.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.