By Sagar Shankaran, Founder of CallSphere
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
Key takeaways
Half of WebSocket disconnects are silent. The TCP socket is open, the application is dead, and your users are watching a frozen page. Heartbeats are the only way to find out before they tweet about it.
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Because TCP keepalive runs at the OS level and only fires after 2 hours by default. In application time, a connection can be functionally dead for 90 minutes before any platform notices. A misbehaving NAT box, a crashed worker, or a frozen JavaScript event loop all produce "connection looks open, no data flows."
Heartbeats — application-layer ping/pong every 30–60 seconds — close that gap. The contract is simple: server pings every 30 s, client must pong within 10 s, two missed pongs and the server closes. The same logic applies in reverse for client-driven heartbeats.
Three layers of observability are non-negotiable:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The metrics that actually matter:
websocket_connections_active (gauge) — current open count.websocket_connections_total{result} (counter) — accept/reject/close rate.websocket_message_duration_seconds (histogram) — handler processing time.websocket_handshake_duration_seconds (histogram) — upgrade latency.websocket_buffered_amount_bytes (gauge) — backpressure indicator.websocket_pong_missed_total (counter) — heartbeat failures.Alert on rate-of-change of active, sustained pong_missed, and any nonzero buffered_amount for > 30 s.
CallSphere instruments every WebSocket service across six verticals with the same metric set:
media event arrives for 90 s on any active call, PagerDuty pages.The audit trail crosses all 115+ database tables so a stuck WebSocket triggers correlated checks across Postgres health, Redis lag, and OpenAI API status. Mean time to detect for a stuck WebSocket session in production is 47 seconds.
import { Counter, Gauge, Histogram } from "prom-client";
const active = new Gauge({ name: "websocket_connections_active", help: "" });
const missed = new Counter({ name: "websocket_pong_missed_total", help: "" });
const handlerMs = new Histogram({ name: "websocket_message_duration_seconds", help: "" });
wss.on("connection", (ws) => {
active.inc();
let alive = true;
ws.on("pong", () => { alive = true; });
const iv = setInterval(() => {
if (!alive) { missed.inc(); ws.terminate(); return; }
alive = false;
ws.ping();
}, 30_000);
ws.on("close", () => { active.dec(); clearInterval(iv); });
ws.on("message", async (m) => { const t = handlerMs.startTimer(); await handle(m); t(); });
});
/metrics port; never on the same port as the WebSocket itself.Why not rely on TCP keepalive? OS defaults are 2 hours. You can lower it but it is brittle and platform-specific. Application-level ping is reliable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How often should I ping? 30 s is the sweet spot. More frequent wastes bandwidth on mobile; less frequent misses outages.
What about NAT timeouts? Some carrier NATs drop idle connections after 90 s. Ping every 30 s keeps NAT entries alive.
Should I log every ping? No. Log only failures. Pings should be silent in normal operation.
Where do I store the metrics long-term? Prometheus retains 15 days by default. Add Mimir or Thanos for longer history; we keep 90 days.
CallSphere ships HIPAA + SOC 2 monitoring across 37 agents and 90+ tools. Start the 14-day trial, join the affiliate program, or book a demo — pricing is $149/$499/$1499.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
Arize Phoenix is the open-source LLM observability tool that grew up significantly in 2026. Tracing, evals, and the OTel-native approach that makes Phoenix portable.
By April 2026 CoreWeave shares are trading roughly 60% above its March 2024 IPO price, with Q1 2026 earnings re-rating the AI infrastructure cohort.
Infrastructure-level look at Claude Sonnet 4.6 Bedrock, including AWS AI, deployment topology, region availability, and cost considerations.
© 2026 CallSphere LLC. All rights reserved.