By Sagar Shankaran, Founder of CallSphere
The OTel GenAI semantic conventions exited experimental for client spans in early 2026. Here's how CallSphere instruments 37 voice and chat agents with gen_ai.* attributes that work across Datadog, Honeycomb, and Grafana.
Key takeaways
TL;DR — In 2026 you don't write custom span attributes for "model name" anymore. You use
gen_ai.request.modeland your traces work in every backend that supports OTel.
flowchart LR
Browser["Browser / Phone"] -- "WebSocket /ws" --> LB["Load Balancer<br/>sticky session"]
LB --> Pod1["Node A · Socket.IO"]
LB --> Pod2["Node B · Socket.IO"]
Pod1 -- "pub/sub" --> Redis[("Redis cluster")]
Pod2 -- "pub/sub" --> Redis
Pod1 --> AI["AI Worker · OpenAI Realtime"]
Pod2 --> AIFor two years every team rolled its own LLM-tracing schema. model, llm.model, openai.model, anthropic.model — all meant the same thing, none queried the same way. A platform team that wanted to chart "tokens spent per model per service" had to write a per-vendor adapter for every framework. By late 2025, the OTel GenAI SIG stabilized client spans and metrics, and most agent frameworks (OpenAI Agents SDK, LangChain, LlamaIndex, AutoGen) shipped emitters by Q1 2026.
The trap is that the agent spec is still experimental, and most production agents are agents — not single LLM calls. If you only instrument the chat-completions span you miss the tool-call planning, the handoff between sub-agents, and the loop. You end up with a trace that looks fast and an experience that feels slow.
Use three layers of OTel GenAI conventions:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
gen_ai.client spans (stable) — one per LLM round-trip. Attributes: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.gen_ai.agent spans (experimental) — one per agent invocation. Attributes: gen_ai.agent.name, gen_ai.agent.id, gen_ai.agent.description.gen_ai.tool.* events — attached to agent spans. Captures every tool call the agent makes and its result.Standard metrics in 2026: gen_ai.client.token.usage (histogram), gen_ai.client.operation.duration (histogram). Datadog, Honeycomb, Grafana, and OpenObserve all auto-detect these.
We run 37 agents across six verticals on k3s with Cloudflare Tunnel. Every agent emits OTel GenAI spans through an OpenTelemetry Collector deployed as a DaemonSet. The collector tail-samples to 5% (100% for errors and slow turns) and forwards to two backends:
The Healthcare FastAPI service on :8084 decorates each route with our @trace_genai_agent decorator that auto-emits parent agent span and child client spans. The Real Estate 6-container pod sends spans across NATS subjects and reuses the trace context header so a single call shows as one trace across all six containers. Sales WebSocket workers (PM2) batch-export every 5 seconds. The After-hours Bull/Redis queue worker emits one trace per job — Bull's job ID becomes the trace ID prefix.
Plans on /pricing include trace export to your own OTel collector at the $499 tier; $1499 enterprise gets a dedicated tenant in our Honeycomb. Try it on the 14-day trial.
pip install opentelemetry-distro \
opentelemetry-instrumentation-openai \
opentelemetry-exporter-otlp
from opentelemetry import trace
tracer = trace.get_tracer("callsphere.healthcare")
def run_agent(user_input: str):
with tracer.start_as_current_span(
"gen_ai.agent.invoke",
attributes={
"gen_ai.agent.name": "healthcare_intake",
"gen_ai.agent.id": "hc-intake-v3",
"gen_ai.system": "openai",
},
) as span:
# tool calls and llm calls inside here
# auto-instrument adds gen_ai.client spans
result = agent_loop(user_input)
span.set_attribute("gen_ai.completion.text", result.text[:512])
return result
processors:
transform:
metric_statements:
- context: datapoint
statements:
- keep_keys(attributes, ["gen_ai.request.model","gen_ai.system"])
Build dashboards on the standard names. A "tokens per model per route" panel that uses gen_ai.request.model works for OpenAI, Anthropic, and Cohere with no code changes.
Tail-sample. 100% of error traces, 100% of traces with FTL > 1500ms, 5% of everything else. Tail-sampling at the collector saves 95% of storage cost.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Are GenAI agent spans stable yet? A: Client spans and metrics are stable. Agent and framework spans are experimental but have been very stable in practice through Q1 2026.
Q: Do I need a vendor SDK on top of OTel? A: No. OTel + auto-instrumentation covers 80% of needs. Add a vendor SDK (Langfuse, LangSmith) if you want their UI on top — they all consume OTel.
Q: How do I keep PII out of the spans?
A: Use the collector's redaction processor or run Microsoft Presidio in a sidecar before export. Our /industries/healthcare build does this in the collector.
Q: Will my Datadog APM see this? A: Yes. Datadog LLM Observability natively maps OTel GenAI semconv to its product UI as of late 2025.
Q: What about voice-specific attributes?
A: We add callsphere.audio.first_token_ms and callsphere.audio.barge_in_count as custom attributes — namespaced so they don't collide with future OTel additions.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
Arize Phoenix is the open-source LLM observability tool that grew up significantly in 2026. Tracing, evals, and the OTel-native approach that makes Phoenix portable.
Langfuse's April 2026 release ships online evals, prompt versioning, and dataset workflows. Why self-hosted observability is worth the operational lift in 2026 builds.
© 2026 CallSphere LLC. All rights reserved.