Skip to content
AI Infrastructure
AI Infrastructure10 min read0 views

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Half of WebSocket disconnects are silent. The TCP socket is open, the application is dead, and your users are watching a frozen page. Heartbeats are the only way to find out before they tweet about it.

Why are heartbeats mandatory for WebSocket?

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Because TCP keepalive runs at the OS level and only fires after 2 hours by default. In application time, a connection can be functionally dead for 90 minutes before any platform notices. A misbehaving NAT box, a crashed worker, or a frozen JavaScript event loop all produce "connection looks open, no data flows."

Heartbeats — application-layer ping/pong every 30–60 seconds — close that gap. The contract is simple: server pings every 30 s, client must pong within 10 s, two missed pongs and the server closes. The same logic applies in reverse for client-driven heartbeats.

How do you actually observe the fleet?

Three layers of observability are non-negotiable:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Per-connection liveness via heartbeats described above.
  2. Aggregate metrics in Prometheus — connection count, message rate, latency histogram, reconnection rate.
  3. Dead man's switch — an external system that expects regular signals and screams when they stop.

The metrics that actually matter:

  • websocket_connections_active (gauge) — current open count.
  • websocket_connections_total{result} (counter) — accept/reject/close rate.
  • websocket_message_duration_seconds (histogram) — handler processing time.
  • websocket_handshake_duration_seconds (histogram) — upgrade latency.
  • websocket_buffered_amount_bytes (gauge) — backpressure indicator.
  • websocket_pong_missed_total (counter) — heartbeat failures.

Alert on rate-of-change of active, sustained pong_missed, and any nonzero buffered_amount for > 30 s.

CallSphere's implementation

CallSphere instruments every WebSocket service across six verticals with the same metric set:

  • Heartbeats every 30 s, 10 s pong timeout, 2-strike close. Dashboard, Healthcare, Sales Calling, all consistent.
  • Prometheus + Grafana dashboards with one row per service and one panel per metric above.
  • OpenTelemetry traces wrapping each WebSocket handler invocation; exported to self-hosted Tempo.
  • Dead-man switch for the inbound Twilio Media Streams WebSocket — if no media event arrives for 90 s on any active call, PagerDuty pages.

The audit trail crosses all 115+ database tables so a stuck WebSocket triggers correlated checks across Postgres health, Redis lag, and OpenAI API status. Mean time to detect for a stuck WebSocket session in production is 47 seconds.

Code: heartbeat + Prometheus instrumentation

import { Counter, Gauge, Histogram } from "prom-client";

const active = new Gauge({ name: "websocket_connections_active", help: "" });
const missed = new Counter({ name: "websocket_pong_missed_total", help: "" });
const handlerMs = new Histogram({ name: "websocket_message_duration_seconds", help: "" });

wss.on("connection", (ws) => {
  active.inc();
  let alive = true;
  ws.on("pong", () => { alive = true; });
  const iv = setInterval(() => {
    if (!alive) { missed.inc(); ws.terminate(); return; }
    alive = false;
    ws.ping();
  }, 30_000);
  ws.on("close", () => { active.dec(); clearInterval(iv); });
  ws.on("message", async (m) => { const t = handlerMs.startTimer(); await handle(m); t(); });
});

Build steps

  1. Add ping/pong heartbeats to every WebSocket service. 30 s interval, 10 s pong timeout.
  2. Expose Prometheus metrics on a separate /metrics port; never on the same port as the WebSocket itself.
  3. Build a Grafana dashboard with the six metrics above. Add a single "service health" stat panel that aggregates them.
  4. Wire alerts: sustained pong-miss rate > 1%, active-connection drop > 50% in 1 min, any p99 handshake latency > 2 s.
  5. Add an external dead-man switch (we use a healthcheck.io pinger from a separate region) for catastrophic failure detection.
  6. Run a quarterly chaos test that kills a worker mid-session; verify alerts fire within 60 s.

FAQ

Why not rely on TCP keepalive? OS defaults are 2 hours. You can lower it but it is brittle and platform-specific. Application-level ping is reliable.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How often should I ping? 30 s is the sweet spot. More frequent wastes bandwidth on mobile; less frequent misses outages.

What about NAT timeouts? Some carrier NATs drop idle connections after 90 s. Ping every 30 s keeps NAT entries alive.

Should I log every ping? No. Log only failures. Pings should be silent in normal operation.

Where do I store the metrics long-term? Prometheus retains 15 days by default. Add Mimir or Thanos for longer history; we keep 90 days.

CallSphere ships HIPAA + SOC 2 monitoring across 37 agents and 90+ tools. Start the 14-day trial, join the affiliate program, or book a demo — pricing is $149/$499/$1499.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.