By Sagar Shankaran, Founder of CallSphere
Phoenix accepts OpenTelemetry traces from any framework, runs anywhere from your laptop to Kubernetes, and is fully open source with no feature gates.
Key takeaways
TL;DR — Phoenix is what you wire up when you need full control of your agent traces and evals on infrastructure you own. It's OTel-native, framework-agnostic, fully open source, and runs from a Jupyter cell up to a multi-node Kubernetes deployment with no feature gates.
flowchart LR
Repo[GitHub repo] --> CI[GitHub Actions]
CI --> Eval[Agent eval suite · PromptFoo]
Eval -->|pass| Deploy[Deploy]
Eval -->|fail| Block[Block PR]
Deploy --> Prod[Production agent]
Prod --> Trace[(LangSmith trace)]
Trace --> EvalPhoenix (Arize-ai/phoenix on GitHub) is an open-source AI observability and evaluation platform. It accepts traces over OpenTelemetry (OTLP), ships auto-instrumentation for every major framework (LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, smolagents, deepagents), and provides a UI for trace exploration, eval scoring, prompt iteration, and experiments.
Critical distinction from competitors: Phoenix is fully open source and self-hostable with no feature gates. The hosted Arize AX product is a separate paid offering for teams that want a managed service; the OSS Phoenix offering is feature-complete on its own.
Three properties matter:
openinference-instrumentation-* packages exist for nearly every popular framework — one line wires the spans.Two common patterns:
Pattern A — laptop / notebook. pip install arize-phoenix and px.launch_app(). Phoenix runs in-process, the UI is on localhost:6006, traces stream from your code as you iterate. Best for prompt iteration and offline eval work.
Pattern B — Docker / Kubernetes. Pull the Phoenix container, point your application's OTLP exporter at it, mount Postgres for persistence. Best for production observability.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
For regulated industries, Pattern B is the right default — your prompt and response data stays on infrastructure you control. That's critical when you're handling regulated data or need an audit trail for EU AI Act compliance.
We run a self-hosted Phoenix cluster as part of our healthcare deployment. Every voice call's trace — turn-by-turn LLM calls, tool calls, latency breakdowns, error spans — flows from the Node runtime via OTLP to Phoenix. PHI never leaves our HIPAA-eligible AWS account.
For our IT Services UrackIT deployment (ChromaDB-backed RAG agent), Phoenix's RAG-specific eval templates are what catch retrieval regressions. When the embedding model gets bumped, we run the eval suite, see hit rate / MRR / faithfulness scores, and gate the rollout.
We also use Phoenix's prompt playground as the day-to-day workflow for prompt iteration. You can replay a real production trace, edit the prompt, re-run, and compare side by side. That's faster than any other prompt iteration UX we've tried.
Pricing: $149 Starter / $499 Growth / $1499 Scale. 14-day trial. 22% affiliate.
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest.OTEL_EXPORTER_OTLP_ENDPOINT=http://phoenix:4317.pip install openinference-instrumentation-langchain (or your framework).from openinference.instrumentation.langchain import LangChainInstrumentor; LangChainInstrumentor().instrument().:6006. Traces appear immediately.phoenix.evals.run_evals(...) in CI.from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from phoenix.otel import register
tracer_provider = register(project_name="callsphere-after-hours",
endpoint="http://phoenix:4317")
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
# ...your agent runs here, traces flow to Phoenix automatically.
from phoenix.evals import HallucinationEvaluator, run_evals
import pandas as pd
dataset = pd.read_csv("eval_set.csv")
results = run_evals(
dataframe=dataset,
evaluators=[HallucinationEvaluator(model="gpt-5")],
provide_explanation=True,
)
Phoenix uses OpenInference, an open semantic-convention layer on top of OpenTelemetry. OpenInference defines what a "tool span" looks like, what an "LLM span" contains, what a "retrieval span" tracks. The benefit: instrumenting your agent in Phoenix means your spans are also compatible with any other OTel backend that understands OpenInference.
Practical implication: you can run Phoenix in dev/staging for fast iteration and pipe the same traces to Datadog or Honeycomb in production. No re-instrumentation. We do exactly this — Phoenix for trace exploration during development, Datadog for production alerting.
Phoenix's built-in evaluators are a real time-saver. The ones we use most:
Each ships as a callable that takes a DataFrame of records and returns scored DataFrames with explanations. We run these as a nightly job over the previous day's production traces and dashboard the results.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Phoenix's "experiments" feature pairs a dataset (real or synthetic inputs) with a function (your agent or a prompt variant) and runs evals across the combination. Useful when you want to compare three prompt variants or two retrieval strategies head-to-head. The UI shows a side-by-side comparison with diffs and eval scores.
This is our default workflow before any production prompt change: define the dataset, run the experiment with the old and new prompt, compare scores, ship the winner.
How is Phoenix different from Arize AX? AX is the hosted commercial product with multi-tenant features, advanced governance, and SLAs. Phoenix OSS is the same trace + eval surface without the managed-service plumbing.
Does it work with the OpenAI Agents SDK? Yes — openinference-instrumentation-openai-agents autoinstruments. We use it daily.
Can I run Phoenix and Helicone together? Yes. Helicone for gateway-level logging and cost; Phoenix for span-level traces and evals. Different layers.
What's the resource footprint? A single-node Phoenix container handles ~10M spans/month comfortably with Postgres on a t3.medium. Scale Postgres before Phoenix.
Where do I see this in production? Book a demo and we'll walk through a live Phoenix dashboard for our healthcare voice agent.
Does Phoenix support multi-tenancy? OSS supports project-level separation; for full multi-tenant isolation with quotas and per-team auth, Arize AX is the managed answer.
Can I export traces to other backends? Yes — Phoenix is OTel-native. Pipe the same OTLP stream to Datadog, Honeycomb, or Grafana Tempo.
How do I write a custom evaluator? Subclass LLMEvaluator or write a plain function that returns a score and explanation. The framework handles batch execution and persistence.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
Zep Cloud and OSS Zep have diverged in 2026 with different feature sets. The build-vs-buy math for memory infrastructure with concrete cost numbers and trade-offs.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI