By Sagar Shankaran, Founder of CallSphere
Trace every turn, every tool call, every LLM round-trip with OpenTelemetry shipped to Langfuse. Find latency outliers, debug hallucinations, and watch p95 stay under 800ms.
Key takeaways
TL;DR — Voice agents fail in three places: STT, LLM, TTS. Without per-component tracing you'll never know which one slowed a call. OpenTelemetry → Langfuse gives you span-level visibility in 30 lines of init code.
An instrumented voice bridge that emits OpenTelemetry spans for every turn (stt → llm → tts), tags them with the call ID, and ships them to Langfuse. You'll be able to open a single call in the Langfuse UI and see every span timing, every token count, and every prompt/response — including which turn pushed p95 latency over 800ms.
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http.LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.flowchart LR
AGT[Voice Agent] -- spans --> SDK[OTel SDK]
SDK -- OTLP HTTP --> LF[Langfuse /api/public/otel]
LF --> UI[Langfuse UI]
LF --> EV[Eval suite]
```ts // otel.ts import { NodeSDK } from "@opentelemetry/sdk-node"; import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"; import { resourceFromAttributes } from "@opentelemetry/resources";
const auth = Buffer.from( `${process.env.LANGFUSE_PUBLIC_KEY}:${process.env.LANGFUSE_SECRET_KEY}` ).toString("base64");
const sdk = new NodeSDK({ resource: resourceFromAttributes({ "service.name": "voice-bridge" }), traceExporter: new OTLPTraceExporter({ url: "https://cloud.langfuse.com/api/public/otel/v1/traces", headers: { Authorization: `Basic ${auth}` }, }), }); sdk.start(); ```
Import this at the top of your entrypoint before any other imports.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```ts import { trace } from "@opentelemetry/api"; const tracer = trace.getTracer("voice-bridge");
async function handleTurn(callId: string, userText: string) { return tracer.startActiveSpan("turn", { attributes: { callId }}, async (turnSpan) => { try { const stt = await tracer.startActiveSpan("stt", async (s) => { const text = userText; s.setAttribute("text", text); s.end(); return text; });
const llm = await tracer.startActiveSpan("llm", async (s) => {
const r = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: stt }],
});
s.setAttributes({
"gen_ai.system": "openai",
"gen_ai.request.model": "gpt-4o-mini",
"gen_ai.usage.prompt_tokens": r.usage?.prompt_tokens ?? 0,
"gen_ai.usage.completion_tokens": r.usage?.completion_tokens ?? 0,
"gen_ai.response.text": r.choices[0].message.content,
});
s.end();
return r.choices[0].message.content!;
});
const tts = await tracer.startActiveSpan("tts", async (s) => {
const audio = await synthesize(llm);
s.setAttribute("audio.bytes", audio.length);
s.end();
return audio;
});
turnSpan.end();
return tts;
} catch (err) {
turnSpan.recordException(err as Error);
turnSpan.end();
throw err;
}
}); } ```
Use the gen_ai.* attribute namespace so Langfuse renders prompts, responses, and token counts in its UI without custom mapping. Important keys:
gen_ai.system (openai | anthropic | elevenlabs)gen_ai.request.modelgen_ai.request.temperaturegen_ai.usage.prompt_tokensgen_ai.usage.completion_tokensgen_ai.response.text```ts async function callTool(name: string, args: object) { return tracer.startActiveSpan(`tool.${name}`, async (s) => { s.setAttributes({ "tool.name": name, "tool.args": JSON.stringify(args), }); const t0 = Date.now(); try { const result = await registryname; s.setAttribute("tool.latency_ms", Date.now() - t0); s.setAttribute("tool.result", JSON.stringify(result).slice(0, 1000)); return result; } catch (e) { s.recordException(e as Error); throw e; } finally { s.end(); } }); } ```
Set the trace ID at call start so every span (turn 1, turn 2, ..., post-call analytics) joins one trace:
```ts const callTrace = tracer.startSpan("call", { attributes: { callId, agentId }}); const ctx = trace.setSpan(context.active(), callTrace); context.with(ctx, async () => { // run the whole call here callTrace.end(); }); ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
In Langfuse Dashboards:
span.name = turn, last 24h.duration.Click "Add to dataset" in Langfuse on any failed call to build a regression dataset. Run nightly evals and gate prompt PRs on quality (see post 13).
sdk.shutdown() on SIGTERM.startActiveSpan (which sets context) vs startSpan (which doesn't).s.setAttribute("text", ...) if you're under HIPAA.callId as a metric label — use as span attribute only.CallSphere ships every span — turn, STT, LLM, TTS, tool — to a self-hosted Langfuse via OpenTelemetry. Healthcare runs PHI redaction in a span processor before export. The eval dashboard surfaces p95 latency per vertical and alerts when any agent crosses 1.5s. Real-time observability is part of the platform; trial it.
Langfuse vs LangSmith vs Phoenix? All emit OTel; pick on price and self-host needs. Langfuse is open-source and OTel-native.
Cost? Cloud free tier: 50k observations/mo. Self-host: just your DB.
Can I trace WebRTC sessions? Yes — instrument server-side handlers; client-side, use the OTel browser SDK and a CORS-enabled OTLP collector.
Sampling? Sample call traces at 100% for the first month, then drop to 10% with always-on for failures.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI