TL;DR — In 2026 you don't write custom span attributes for "model name" anymore. You use gen_ai.request.model and your traces work in every backend that supports OTel.

What goes wrong

flowchart LR
  Browser["Browser / Phone"] -- "WebSocket /ws" --> LB["Load Balancer<br/>sticky session"]
  LB --> Pod1["Node A · Socket.IO"]
  LB --> Pod2["Node B · Socket.IO"]
  Pod1 -- "pub/sub" --> Redis[("Redis cluster")]
  Pod2 -- "pub/sub" --> Redis
  Pod1 --> AI["AI Worker · OpenAI Realtime"]
  Pod2 --> AI

CallSphere reference architecture

For two years every team rolled its own LLM-tracing schema. model, llm.model, openai.model, anthropic.model — all meant the same thing, none queried the same way. A platform team that wanted to chart "tokens spent per model per service" had to write a per-vendor adapter for every framework. By late 2025, the OTel GenAI SIG stabilized client spans and metrics, and most agent frameworks (OpenAI Agents SDK, LangChain, LlamaIndex, AutoGen) shipped emitters by Q1 2026.

The trap is that the agent spec is still experimental, and most production agents are agents — not single LLM calls. If you only instrument the chat-completions span you miss the tool-call planning, the handoff between sub-agents, and the loop. You end up with a trace that looks fast and an experience that feels slow.

How to monitor

Use three layers of OTel GenAI conventions:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

gen_ai.client spans (stable) — one per LLM round-trip. Attributes: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.
gen_ai.agent spans (experimental) — one per agent invocation. Attributes: gen_ai.agent.name, gen_ai.agent.id, gen_ai.agent.description.
gen_ai.tool.* events — attached to agent spans. Captures every tool call the agent makes and its result.

Standard metrics in 2026: gen_ai.client.token.usage (histogram), gen_ai.client.operation.duration (histogram). Datadog, Honeycomb, Grafana, and OpenObserve all auto-detect these.

CallSphere stack

We run 37 agents across six verticals on k3s with Cloudflare Tunnel. Every agent emits OTel GenAI spans through an OpenTelemetry Collector deployed as a DaemonSet. The collector tail-samples to 5% (100% for errors and slow turns) and forwards to two backends:

Honeycomb for tracing (developer ergonomics on agent traces)
Prometheus + Grafana for SLO dashboards

The Healthcare FastAPI service on :8084 decorates each route with our @trace_genai_agent decorator that auto-emits parent agent span and child client spans. The Real Estate 6-container pod sends spans across NATS subjects and reuses the trace context header so a single call shows as one trace across all six containers. Sales WebSocket workers (PM2) batch-export every 5 seconds. The After-hours Bull/Redis queue worker emits one trace per job — Bull's job ID becomes the trace ID prefix.

Plans on /pricing include trace export to your own OTel collector at the $499 tier; $1499 enterprise gets a dedicated tenant in our Honeycomb. Try it on the 14-day trial.

Implementation

Install the OTel SDK for your framework. For Python:

pip install opentelemetry-distro \
  opentelemetry-instrumentation-openai \
  opentelemetry-exporter-otlp

Wrap your agent loop with explicit agent spans:

from opentelemetry import trace
tracer = trace.get_tracer("callsphere.healthcare")

def run_agent(user_input: str):
    with tracer.start_as_current_span(
        "gen_ai.agent.invoke",
        attributes={
            "gen_ai.agent.name": "healthcare_intake",
            "gen_ai.agent.id": "hc-intake-v3",
            "gen_ai.system": "openai",
        },
    ) as span:
        # tool calls and llm calls inside here
        # auto-instrument adds gen_ai.client spans
        result = agent_loop(user_input)
        span.set_attribute("gen_ai.completion.text", result.text[:512])
        return result

Configure the collector to validate semconv:

processors:
  transform:
    metric_statements:
      - context: datapoint
        statements:
          - keep_keys(attributes, ["gen_ai.request.model","gen_ai.system"])

Build dashboards on the standard names. A "tokens per model per route" panel that uses gen_ai.request.model works for OpenAI, Anthropic, and Cohere with no code changes.
Tail-sample. 100% of error traces, 100% of traces with FTL > 1500ms, 5% of everything else. Tail-sampling at the collector saves 95% of storage cost.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

Q: Are GenAI agent spans stable yet? A: Client spans and metrics are stable. Agent and framework spans are experimental but have been very stable in practice through Q1 2026.

Q: Do I need a vendor SDK on top of OTel? A: No. OTel + auto-instrumentation covers 80% of needs. Add a vendor SDK (Langfuse, LangSmith) if you want their UI on top — they all consume OTel.

Q: How do I keep PII out of the spans? A: Use the collector's redaction processor or run Microsoft Presidio in a sidecar before export. Our /industries/healthcare build does this in the collector.

Q: Will my Datadog APM see this? A: Yes. Datadog LLM Observability natively maps OTel GenAI semconv to its product UI as of late 2025.

Q: What about voice-specific attributes? A: We add callsphere.audio.first_token_ms and callsphere.audio.barge_in_count as custom attributes — namespaced so they don't collide with future OTel additions.

OpenTelemetry GenAI Conventions for AI Agents in 2026

What goes wrong

How to monitor

CallSphere stack

Implementation

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

Arize Phoenix: Open-Source LLM Tracing in 2026 Reviewed Honestly

Langfuse 2026 Update: Evals, Prompt Management, and Datasets Mature