Monitoring WebSocket Health: Heartbeats and Prometheus in 2026
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
Half of WebSocket disconnects are silent. The TCP socket is open, the application is dead, and your users are watching a frozen page. Heartbeats are the only way to find out before they tweet about it.
Why are heartbeats mandatory for WebSocket?
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Because TCP keepalive runs at the OS level and only fires after 2 hours by default. In application time, a connection can be functionally dead for 90 minutes before any platform notices. A misbehaving NAT box, a crashed worker, or a frozen JavaScript event loop all produce "connection looks open, no data flows."
Heartbeats — application-layer ping/pong every 30–60 seconds — close that gap. The contract is simple: server pings every 30 s, client must pong within 10 s, two missed pongs and the server closes. The same logic applies in reverse for client-driven heartbeats.
How do you actually observe the fleet?
Three layers of observability are non-negotiable:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Per-connection liveness via heartbeats described above.
- Aggregate metrics in Prometheus — connection count, message rate, latency histogram, reconnection rate.
- Dead man's switch — an external system that expects regular signals and screams when they stop.
The metrics that actually matter:
websocket_connections_active(gauge) — current open count.websocket_connections_total{result}(counter) — accept/reject/close rate.websocket_message_duration_seconds(histogram) — handler processing time.websocket_handshake_duration_seconds(histogram) — upgrade latency.websocket_buffered_amount_bytes(gauge) — backpressure indicator.websocket_pong_missed_total(counter) — heartbeat failures.
Alert on rate-of-change of active, sustained pong_missed, and any nonzero buffered_amount for > 30 s.
CallSphere's implementation
CallSphere instruments every WebSocket service across six verticals with the same metric set:
- Heartbeats every 30 s, 10 s pong timeout, 2-strike close. Dashboard, Healthcare, Sales Calling, all consistent.
- Prometheus + Grafana dashboards with one row per service and one panel per metric above.
- OpenTelemetry traces wrapping each WebSocket handler invocation; exported to self-hosted Tempo.
- Dead-man switch for the inbound Twilio Media Streams WebSocket — if no
mediaevent arrives for 90 s on any active call, PagerDuty pages.
The audit trail crosses all 115+ database tables so a stuck WebSocket triggers correlated checks across Postgres health, Redis lag, and OpenAI API status. Mean time to detect for a stuck WebSocket session in production is 47 seconds.
Code: heartbeat + Prometheus instrumentation
import { Counter, Gauge, Histogram } from "prom-client";
const active = new Gauge({ name: "websocket_connections_active", help: "" });
const missed = new Counter({ name: "websocket_pong_missed_total", help: "" });
const handlerMs = new Histogram({ name: "websocket_message_duration_seconds", help: "" });
wss.on("connection", (ws) => {
active.inc();
let alive = true;
ws.on("pong", () => { alive = true; });
const iv = setInterval(() => {
if (!alive) { missed.inc(); ws.terminate(); return; }
alive = false;
ws.ping();
}, 30_000);
ws.on("close", () => { active.dec(); clearInterval(iv); });
ws.on("message", async (m) => { const t = handlerMs.startTimer(); await handle(m); t(); });
});
Build steps
- Add ping/pong heartbeats to every WebSocket service. 30 s interval, 10 s pong timeout.
- Expose Prometheus metrics on a separate
/metricsport; never on the same port as the WebSocket itself. - Build a Grafana dashboard with the six metrics above. Add a single "service health" stat panel that aggregates them.
- Wire alerts: sustained pong-miss rate > 1%, active-connection drop > 50% in 1 min, any p99 handshake latency > 2 s.
- Add an external dead-man switch (we use a healthcheck.io pinger from a separate region) for catastrophic failure detection.
- Run a quarterly chaos test that kills a worker mid-session; verify alerts fire within 60 s.
FAQ
Why not rely on TCP keepalive? OS defaults are 2 hours. You can lower it but it is brittle and platform-specific. Application-level ping is reliable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How often should I ping? 30 s is the sweet spot. More frequent wastes bandwidth on mobile; less frequent misses outages.
What about NAT timeouts? Some carrier NATs drop idle connections after 90 s. Ping every 30 s keeps NAT entries alive.
Should I log every ping? No. Log only failures. Pings should be silent in normal operation.
Where do I store the metrics long-term? Prometheus retains 15 days by default. Add Mimir or Thanos for longer history; we keep 90 days.
CallSphere ships HIPAA + SOC 2 monitoring across 37 agents and 90+ tools. Start the 14-day trial, join the affiliate program, or book a demo — pricing is $149/$499/$1499.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.