Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Tracing OpenAI Realtime Calls End-to-End

OpenAI Realtime traces look great in the OpenAI dashboard but vanish when the call leaves their servers. Here's how to stitch SIP, WebRTC, your tools, and Realtime into one trace.

TL;DR — OpenAI's Traces dashboard ends at OpenAI. To trace a real voice call you need to inject your own traceparent and join SIP, WebRTC media, model events, and tools into one root.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
CallSphere reference architecture

The OpenAI Agents SDK emits beautiful traces — model calls, tool calls, handoffs — into OpenAI's dashboard. The Realtime API does too, via session-level traces. Both stop at OpenAI's edge. Your phone-system layer (Twilio, Telnyx, your SIP trunk), your media transport (WebRTC), and your tool executors (databases, CRM, calendars) sit outside their view. When a call goes wrong you're flipping between three dashboards and a Postgres query, manually correlating timestamps.

The fix is to make your trace the parent and have OpenAI's traces become children. Inject a traceparent header on the WebSocket upgrade or HTTPS POST that opens the Realtime session, and propagate that ID through your tool calls, RAG lookups, and SIP signaling.

How to monitor

Build a single root span per call:

  1. Root: callsphere.call (one per phone number ringing in)
  2. Child: sip.invite (Twilio webhook → your gateway)
  3. Child: webrtc.peer_connection (media negotiation)
  4. Child: gen_ai.realtime.session (the OpenAI session — they emit nested spans inside)
  5. Children of (4): gen_ai.tool.execute per tool, gen_ai.client per model turn

Use OTel context propagation. The Realtime API doesn't accept traceparent directly, but you can stash your trace ID in the session metadata and re-attach on the model side.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere stack

CallSphere runs Realtime for the Healthcare and Real Estate verticals. The Healthcare FastAPI on :8084 answers Twilio webhooks, mints a Realtime ephemeral key, and proxies the SDP through our edge. We open a root callsphere.call span when Twilio fires the inbound webhook. The trace ID is shoved into Realtime session metadata. Tool calls (insurance verification, EHR lookup) reuse the same trace context via OTel's HTTP propagator.

Real Estate's 6-container NATS pod is harder — the trace context flies across six microservices over NATS. We custom-coded a NATS header propagator (NATS doesn't carry HTTP headers natively) so the trace ID survives. The Sales WebSocket layer (PM2 + 8 workers) and the After-hours Bull/Redis queue use the same propagator pattern. The result: one click in Honeycomb shows the entire call, including the OpenAI-internal spans we pull from their trace export.

We see ~480ms first-token-out on Realtime calls; the trace tells us exactly which 480ms came from us vs them. $1499 enterprise tier on /pricing gets per-call trace links in the call recording UI.

Implementation

  1. Mint the trace ID at call ingress.
@app.post("/twilio/inbound")
async def inbound(request: Request):
    with tracer.start_as_current_span("callsphere.call") as root:
        trace_id = root.get_span_context().trace_id
        ephemeral = await mint_realtime_key(metadata={"trace_id": format(trace_id, "032x")})
        return twiml_with_session(ephemeral)
  1. Read OpenAI's trace export (their Traces API supports webhook export as of Q1 2026) and graft their spans under your root using the metadata trace_id.

  2. Propagate over NATS with a custom header carrier:

from opentelemetry.propagate import inject
def publish_with_trace(subject, payload):
    headers = {}
    inject(headers)
    nats.publish(subject, payload, headers=headers)
  1. Tag tool spans with gen_ai.tool.name and gen_ai.tool.call.id so they line up under the model turn that requested them.

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  2. Persist the call_id ↔ trace_id map in Postgres (we use the calls table) so support engineers can paste a phone number and get the trace.

FAQ

Q: Does the Realtime API natively emit OTel spans? A: As of Q1 2026, no — it emits OpenAI-format traces accessible via the dashboard and an export webhook. You graft them under your root.

Q: How do I trace TURN/STUN delays? A: We instrument the WebRTC client with timing events (onicegatheringstatechange, etc.) and emit them as span events on webrtc.peer_connection.

Q: Can I trace barge-in events? A: Yes — emit a span event gen_ai.audio.barge_in with audio.elapsed_ms so you can see how often users interrupt.

Q: Does sampling break voice traces? A: Tail-sample at the collector and always keep traces with errors or FTL > 1500ms. Head-sampling will drop the calls you most need.

Q: Is this worth it for a 5-call/day startup? A: No. Use the OpenAI dashboard until you're past 1k calls/day. Try the 14-day trial first.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

Agentic AI

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.