Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

How to Add Voice Agent Observability with Langfuse and OpenTelemetry

Trace every turn, every tool call, every LLM round-trip with OpenTelemetry shipped to Langfuse. Find latency outliers, debug hallucinations, and watch p95 stay under 800ms.

TL;DR — Voice agents fail in three places: STT, LLM, TTS. Without per-component tracing you'll never know which one slowed a call. OpenTelemetry → Langfuse gives you span-level visibility in 30 lines of init code.

What you'll build

An instrumented voice bridge that emits OpenTelemetry spans for every turn (stt → llm → tts), tags them with the call ID, and ships them to Langfuse. You'll be able to open a single call in the Langfuse UI and see every span timing, every token count, and every prompt/response — including which turn pushed p95 latency over 800ms.

Prerequisites

  1. Langfuse Cloud account or self-hosted (open-source).
  2. Node 20+ or Python 3.11+.
  3. npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http.
  4. Working voice agent (any of posts 1–4).
  5. LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.

Architecture

flowchart LR
  AGT[Voice Agent] -- spans --> SDK[OTel SDK]
  SDK -- OTLP HTTP --> LF[Langfuse /api/public/otel]
  LF --> UI[Langfuse UI]
  LF --> EV[Eval suite]

Step 1 — OTel init shipping to Langfuse

```ts // otel.ts import { NodeSDK } from "@opentelemetry/sdk-node"; import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http"; import { resourceFromAttributes } from "@opentelemetry/resources";

const auth = Buffer.from( `${process.env.LANGFUSE_PUBLIC_KEY}:${process.env.LANGFUSE_SECRET_KEY}` ).toString("base64");

const sdk = new NodeSDK({ resource: resourceFromAttributes({ "service.name": "voice-bridge" }), traceExporter: new OTLPTraceExporter({ url: "https://cloud.langfuse.com/api/public/otel/v1/traces", headers: { Authorization: `Basic ${auth}` }, }), }); sdk.start(); ```

Import this at the top of your entrypoint before any other imports.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 2 — Wrap each turn in a span

```ts import { trace } from "@opentelemetry/api"; const tracer = trace.getTracer("voice-bridge");

async function handleTurn(callId: string, userText: string) { return tracer.startActiveSpan("turn", { attributes: { callId }}, async (turnSpan) => { try { const stt = await tracer.startActiveSpan("stt", async (s) => { const text = userText; s.setAttribute("text", text); s.end(); return text; });

  const llm = await tracer.startActiveSpan("llm", async (s) => {
    const r = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: stt }],
    });
    s.setAttributes({
      "gen_ai.system": "openai",
      "gen_ai.request.model": "gpt-4o-mini",
      "gen_ai.usage.prompt_tokens": r.usage?.prompt_tokens ?? 0,
      "gen_ai.usage.completion_tokens": r.usage?.completion_tokens ?? 0,
      "gen_ai.response.text": r.choices[0].message.content,
    });
    s.end();
    return r.choices[0].message.content!;
  });

  const tts = await tracer.startActiveSpan("tts", async (s) => {
    const audio = await synthesize(llm);
    s.setAttribute("audio.bytes", audio.length);
    s.end();
    return audio;
  });

  turnSpan.end();
  return tts;
} catch (err) {
  turnSpan.recordException(err as Error);
  turnSpan.end();
  throw err;
}

}); } ```

Step 3 — Use Langfuse semantic conventions

Use the gen_ai.* attribute namespace so Langfuse renders prompts, responses, and token counts in its UI without custom mapping. Important keys:

  • gen_ai.system (openai | anthropic | elevenlabs)
  • gen_ai.request.model
  • gen_ai.request.temperature
  • gen_ai.usage.prompt_tokens
  • gen_ai.usage.completion_tokens
  • gen_ai.response.text

Step 4 — Tag tool calls

```ts async function callTool(name: string, args: object) { return tracer.startActiveSpan(`tool.${name}`, async (s) => { s.setAttributes({ "tool.name": name, "tool.args": JSON.stringify(args), }); const t0 = Date.now(); try { const result = await registryname; s.setAttribute("tool.latency_ms", Date.now() - t0); s.setAttribute("tool.result", JSON.stringify(result).slice(0, 1000)); return result; } catch (e) { s.recordException(e as Error); throw e; } finally { s.end(); } }); } ```

Step 5 — Group spans into one Langfuse "trace" per call

Set the trace ID at call start so every span (turn 1, turn 2, ..., post-call analytics) joins one trace:

```ts const callTrace = tracer.startSpan("call", { attributes: { callId, agentId }}); const ctx = trace.setSpan(context.active(), callTrace); context.with(ctx, async () => { // run the whole call here callTrace.end(); }); ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Build alerts on p95 latency

In Langfuse Dashboards:

  • Filter span.name = turn, last 24h.
  • Plot p50/p95/p99 of duration.
  • Alert if p95 > 1200ms for 5 minutes.

Step 7 — Eval datasets from real calls

Click "Add to dataset" in Langfuse on any failed call to build a regression dataset. Run nightly evals and gate prompt PRs on quality (see post 13).

Common pitfalls

  • No flush on shutdown: spans are buffered. Call sdk.shutdown() on SIGTERM.
  • Spans not nesting: forgetting startActiveSpan (which sets context) vs startSpan (which doesn't).
  • PHI in span attributes: redact transcripts before s.setAttribute("text", ...) if you're under HIPAA.
  • Cardinality explosion: don't set callId as a metric label — use as span attribute only.

How CallSphere does this in production

CallSphere ships every span — turn, STT, LLM, TTS, tool — to a self-hosted Langfuse via OpenTelemetry. Healthcare runs PHI redaction in a span processor before export. The eval dashboard surfaces p95 latency per vertical and alerts when any agent crosses 1.5s. Real-time observability is part of the platform; trial it.

FAQ

Langfuse vs LangSmith vs Phoenix? All emit OTel; pick on price and self-host needs. Langfuse is open-source and OTel-native.

Cost? Cloud free tier: 50k observations/mo. Self-host: just your DB.

Can I trace WebRTC sessions? Yes — instrument server-side handlers; client-side, use the OTel browser SDK and a CORS-enabled OTLP collector.

Sampling? Sample call traces at 100% for the first month, then drop to 10% with always-on for failures.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

AI Engineering

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Engineering

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.