---
title: "k3s on the Edge for AI Voice: 200ms-or-Bust Topology (2026)"
description: "Run a 3-node k3s cluster at the network edge to slash voice-agent first-token latency below 250ms. ServerlessLB, MetalLB, NodeLocalDNS, and tuned WebRTC ports."
canonical: https://callsphere.ai/blog/vw6h-k3s-edge-ai-voice-agent-low-latency-2026
category: "AI Infrastructure"
tags: ["k3s", "Edge Computing", "Voice AI", "WebRTC", "Tutorial"]
author: "CallSphere Team"
published: 2026-03-29T00:00:00.000Z
updated: 2026-05-07T16:46:15.588Z
---

# k3s on the Edge for AI Voice: 200ms-or-Bust Topology (2026)

> Run a 3-node k3s cluster at the network edge to slash voice-agent first-token latency below 250ms. ServerlessLB, MetalLB, NodeLocalDNS, and tuned WebRTC ports.

> **TL;DR** — k3s runs on a single binary under 100 MB. Three Hetzner ccx33s in three regions + MetalLB + NodeLocalDNS + tuned WebRTC port range gives you sub-250 ms voice-agent latency without the EKS bill.

## What you'll set up

A 3-node k3s cluster across us-east, us-west, eu-central. Voice agents run on each node; LiveKit terminates WebRTC locally; OpenAI Realtime calls are routed to the nearest OpenAI region. End-user voice-to-voice latency stays under ~280 ms.

## Architecture

```mermaid
flowchart TD
  USER[End user] --> ANYCAST[Cloudflare anycast]
  ANYCAST -->|nearest region| EDGE1[k3s us-east]
  ANYCAST --> EDGE2[k3s us-west]
  ANYCAST --> EDGE3[k3s eu-central]
  EDGE1 --> AG1[Voice agent pod]
  AG1 --> LK1[LiveKit]
  AG1 -->|Realtime WSS| OPENAI[OpenAI us-east]
```

## Step 1 — Install k3s with the right disables

```bash
curl -sfL [https://get.k3s.io](https://get.k3s.io) | INSTALL_K3S_VERSION=v1.31.0+k3s1 sh -s - server \
  --disable traefik \
  --disable servicelb \
  --tls-san edge-us-east.example.com \
  --kube-apiserver-arg="audit-log-path=/var/log/audit.log"
```

We disable Traefik (we'll use ingress-nginx) and ServiceLB (we'll use MetalLB or kube-vip). `tls-san` makes `kubectl` work from outside.

## Step 2 — Add MetalLB for real LoadBalancer Services

## ```yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata: { name: edge-pool, namespace: metallb-system }
spec:
  addresses: ["203.0.113.10-203.0.113.20"]

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata: { name: l2, namespace: metallb-system }
```

LiveKit needs a `LoadBalancer` Service with a stable external IP for SDP offers; MetalLB layer-2 is the simplest path on bare-metal/Hetzner.

## Step 3 — Tune the WebRTC port range

LiveKit grabs UDP 50000-60000 by default. On Hetzner that means opening 10k UDP ports on each node — fine, but Cloudflare's WebRTC TURN reduces that. We use:

```yaml
livekit:
  rtc:
    port_range_start: 50000
    port_range_end: 50500
    use_external_ip: true
    stun_servers: ["stun.cloudflare.com:3478"]
```

500 UDP ports per node = ~250 concurrent calls. Enough for our edge density.

## Step 4 — NodeLocalDNS to kill DNS tail latency

```bash
kubectl apply -f [https://github.com/kubernetes/dns/raw/master/cmd/node-cache/node-cache.yaml](https://github.com/kubernetes/dns/raw/master/cmd/node-cache/node-cache.yaml)
```

OpenAI Realtime opens a fresh DNS lookup for `api.openai.com` on every reconnect; with NodeLocalDNS the resolve goes from 25 ms to 0.1 ms. Compounded over reconnects, this matters.

## Step 5 — Pin agent pods to nodes via topologySpreadConstraints

```yaml
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels: { app: voice-agent }
```

We want one agent replica per region — never two in us-east while us-west is empty.

## Step 6 — Cloudflare Tunnel for ingress (no public ports)

```yaml
apiVersion: apps/v1
kind: Deployment
metadata: { name: cloudflared }
spec:
  template:
    spec:
      containers:
        - name: cloudflared
          image: cloudflare/cloudflared:latest
          args: ["tunnel","--no-autoupdate","run"]
          env:
            - name: TUNNEL_TOKEN
              valueFrom: { secretKeyRef: { name: cf-tunnel, key: token }}
```

Cloudflare Tunnel terminates TLS at the edge, then proxies inbound HTTPS over an outbound-only mTLS connection. Zero public ports on the k3s server. The k3s control plane only allows traffic from cloudflared's IPs.

## Step 7 — Anycast routing via Cloudflare load balancer

In Cloudflare Load Balancer, configure 3 origin pools (one per region) with health checks against `/healthz/realtime`. Geo-steering routes US callers to us-east, EU to eu-central. Failover automatic.

## Pitfalls

- **k3s embedded etcd** on a single server has no HA. For prod, use 3 server nodes with embedded etcd HA (`--cluster-init` then `--server`).
- **MetalLB layer-2 + multiple nodes** elects one node as ARP responder. Failover is fine, but the Service IP traverses one node — sometimes adds 1-2 ms.
- **UDP port range too narrow** silently caps concurrent calls. Monitor LiveKit `rtc_session_count` metric and widen if you hit it.
- **NodeLocalDNS without tuning ndots** → still slow. Set `dnsConfig.options: [{name: ndots, value: "2"}]` in pods.
- **Cloudflare Tunnel + WebRTC** — Tunnel doesn't proxy UDP. WebRTC must reach LiveKit directly via MetalLB IP; only the signaling TLS goes via Tunnel.

## How CallSphere does this in production

CallSphere runs a 3-node k3s edge fleet with Postgres at 72.62.162.83 fronted by Cloudflare Tunnel. We run 37 voice agents and 90+ tools across 6 verticals on this exact topology. Healthcare and behavioral health get dedicated edge nodes for HIPAA isolation. p95 voice-to-voice latency is 280 ms us-east, 310 ms us-west. $149 / $499 / $1499 with 14-day [trial](/trial); 22% [affiliate](/affiliate); see [healthcare](/industries/healthcare).

## FAQ

**Q: k3s vs k0s on the edge?**
k3s has stronger ecosystem support; k0s is slightly leaner (40 MB vs 70 MB). Either is fine.

**Q: What about NVIDIA Jetson edges?**
k3s runs cleanly on `linux/arm64` Jetson nodes. We don't run on-device ASR — Realtime API in the cloud is faster than Jetson STT for most cases.

**Q: How many concurrent voice sessions per ccx33?**
~150 with OpenAI Realtime (network-bound), ~50 if you're running local TTS too.

**Q: HA control plane?**
3 server nodes with `--cluster-init` + `--server https://lb-vip` and a kube-vip floating IP for the API.

## Sources

- [How OpenAI delivers low-latency voice AI at scale](https://openai.com/index/delivering-low-latency-voice-ai-at-scale/)
- [How to Deploy AI/ML Models at the Edge with K3s](https://oneuptime.com/blog/post/2026-03-20-k3s-edge-ai-ml/view)
- [What is K3s? Lightweight Kubernetes for Edge — Devtron](https://devtron.ai/what-is-k3s)
- [K3s and K0s: Lightweight Kubernetes for Edge](https://dasroot.net/posts/2026/04/k3s-k0s-lightweight-kubernetes-edge-development/)
- [How to Configure K3s for Edge Deployment](https://oneuptime.com/blog/post/2026-01-27-k3s-edge-deployment/view)

---

Source: https://callsphere.ai/blog/vw6h-k3s-edge-ai-voice-agent-low-latency-2026
