---
title: "Monitoring WebSocket Health: Heartbeats and Prometheus in 2026"
description: "How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice."
canonical: https://callsphere.ai/blog/vw1c-monitoring-websocket-health-heartbeats-prometheus-2026
category: "AI Infrastructure"
tags: ["WebSockets", "Monitoring", "Prometheus", "Observability", "AI Infrastructure"]
author: "CallSphere Team"
published: 2026-05-07T00:00:00.000Z
updated: 2026-05-07T09:32:10.891Z
---

# Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

> How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

> Half of WebSocket disconnects are silent. The TCP socket is open, the application is dead, and your users are watching a frozen page. Heartbeats are the only way to find out before they tweet about it.

## Why are heartbeats mandatory for WebSocket?

```mermaid
flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
```

CallSphere reference architecture

Because TCP keepalive runs at the OS level and only fires after 2 hours by default. In application time, a connection can be functionally dead for 90 minutes before any platform notices. A misbehaving NAT box, a crashed worker, or a frozen JavaScript event loop all produce "connection looks open, no data flows."

Heartbeats — application-layer ping/pong every 30–60 seconds — close that gap. The contract is simple: server pings every 30 s, client must pong within 10 s, two missed pongs and the server closes. The same logic applies in reverse for client-driven heartbeats.

## How do you actually observe the fleet?

Three layers of observability are non-negotiable:

1. **Per-connection liveness** via heartbeats described above.
2. **Aggregate metrics** in Prometheus — connection count, message rate, latency histogram, reconnection rate.
3. **Dead man's switch** — an external system that expects regular signals and screams when they stop.

The metrics that actually matter:

- `websocket_connections_active` (gauge) — current open count.
- `websocket_connections_total{result}` (counter) — accept/reject/close rate.
- `websocket_message_duration_seconds` (histogram) — handler processing time.
- `websocket_handshake_duration_seconds` (histogram) — upgrade latency.
- `websocket_buffered_amount_bytes` (gauge) — backpressure indicator.
- `websocket_pong_missed_total` (counter) — heartbeat failures.

Alert on rate-of-change of `active`, sustained `pong_missed`, and any nonzero `buffered_amount` for > 30 s.

## CallSphere's implementation

CallSphere instruments every WebSocket service across [six verticals](/pricing) with the same metric set:

- **Heartbeats every 30 s, 10 s pong timeout, 2-strike close.** Dashboard, Healthcare, Sales Calling, all consistent.
- **Prometheus + Grafana dashboards** with one row per service and one panel per metric above.
- **OpenTelemetry traces** wrapping each WebSocket handler invocation; exported to self-hosted Tempo.
- **Dead-man switch** for the inbound Twilio Media Streams WebSocket — if no `media` event arrives for 90 s on any active call, PagerDuty pages.

The audit trail crosses all [115+ database tables](/pricing) so a stuck WebSocket triggers correlated checks across Postgres health, Redis lag, and OpenAI API status. Mean time to detect for a stuck WebSocket session in production is 47 seconds.

## Code: heartbeat + Prometheus instrumentation

```typescript
import { Counter, Gauge, Histogram } from "prom-client";

const active = new Gauge({ name: "websocket_connections_active", help: "" });
const missed = new Counter({ name: "websocket_pong_missed_total", help: "" });
const handlerMs = new Histogram({ name: "websocket_message_duration_seconds", help: "" });

wss.on("connection", (ws) => {
  active.inc();
  let alive = true;
  ws.on("pong", () => { alive = true; });
  const iv = setInterval(() => {
    if (!alive) { missed.inc(); ws.terminate(); return; }
    alive = false;
    ws.ping();
  }, 30_000);
  ws.on("close", () => { active.dec(); clearInterval(iv); });
  ws.on("message", async (m) => { const t = handlerMs.startTimer(); await handle(m); t(); });
});
```

## Build steps

1. Add ping/pong heartbeats to every WebSocket service. 30 s interval, 10 s pong timeout.
2. Expose Prometheus metrics on a separate `/metrics` port; never on the same port as the WebSocket itself.
3. Build a Grafana dashboard with the six metrics above. Add a single "service health" stat panel that aggregates them.
4. Wire alerts: sustained pong-miss rate > 1%, active-connection drop > 50% in 1 min, any p99 handshake latency > 2 s.
5. Add an external dead-man switch (we use a healthcheck.io pinger from a separate region) for catastrophic failure detection.
6. Run a quarterly chaos test that kills a worker mid-session; verify alerts fire within 60 s.

## FAQ

**Why not rely on TCP keepalive?** OS defaults are 2 hours. You can lower it but it is brittle and platform-specific. Application-level ping is reliable.

**How often should I ping?** 30 s is the sweet spot. More frequent wastes bandwidth on mobile; less frequent misses outages.

**What about NAT timeouts?** Some carrier NATs drop idle connections after 90 s. Ping every 30 s keeps NAT entries alive.

**Should I log every ping?** No. Log only failures. Pings should be silent in normal operation.

**Where do I store the metrics long-term?** Prometheus retains 15 days by default. Add Mimir or Thanos for longer history; we keep 90 days.

CallSphere ships HIPAA + SOC 2 monitoring across [37 agents and 90+ tools](/pricing). [Start the 14-day trial](/trial), [join the affiliate program](/affiliate), or [book a demo](/demo) — pricing is $149/$499/$1499.

## Sources

- [WebSocket Heartbeat: Ping/Pong, Keep-Alive & Zombie Detection](https://websocket.org/guides/heartbeat/)
- [How to Monitor WebSocket Connection Health](https://oneuptime.com/blog/post/2026-01-24-websocket-connection-health-monitoring/view)
- [How to Implement Heartbeat/Ping-Pong in WebSockets](https://oneuptime.com/blog/post/2026-01-27-websocket-heartbeat-ping-pong/view)
- [WebSocket Application Monitoring: An In-Depth Guide](https://www.dotcom-monitor.com/blog/websocket-monitoring/)

---

Source: https://callsphere.ai/blog/vw1c-monitoring-websocket-health-heartbeats-prometheus-2026
