Arize Phoenix in 2026: The Self-Hosted Observability Stack for Agents
Phoenix accepts OpenTelemetry traces from any framework, runs anywhere from your laptop to Kubernetes, and is fully open source with no feature gates.
TL;DR — Phoenix is what you wire up when you need full control of your agent traces and evals on infrastructure you own. It's OTel-native, framework-agnostic, fully open source, and runs from a Jupyter cell up to a multi-node Kubernetes deployment with no feature gates.
What Phoenix is
flowchart LR
Repo[GitHub repo] --> CI[GitHub Actions]
CI --> Eval[Agent eval suite · PromptFoo]
Eval -->|pass| Deploy[Deploy]
Eval -->|fail| Block[Block PR]
Deploy --> Prod[Production agent]
Prod --> Trace[(LangSmith trace)]
Trace --> EvalPhoenix (Arize-ai/phoenix on GitHub) is an open-source AI observability and evaluation platform. It accepts traces over OpenTelemetry (OTLP), ships auto-instrumentation for every major framework (LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, smolagents, deepagents), and provides a UI for trace exploration, eval scoring, prompt iteration, and experiments.
Critical distinction from competitors: Phoenix is fully open source and self-hostable with no feature gates. The hosted Arize AX product is a separate paid offering for teams that want a managed service; the OSS Phoenix offering is feature-complete on its own.
Why Phoenix wins for self-hosted
Three properties matter:
- OTel-native. Your traces are portable. If you outgrow Phoenix you can pipe the same OTLP stream into Datadog, Honeycomb, or Grafana Tempo without re-instrumenting.
- Framework-agnostic auto-instrumentation.
openinference-instrumentation-*packages exist for nearly every popular framework — one line wires the spans. - No feature gates. Eval templates, experiments, prompt playground — all in OSS. No "upgrade to Pro to see..." walls.
Where Phoenix lives in your stack
Two common patterns:
Pattern A — laptop / notebook. pip install arize-phoenix and px.launch_app(). Phoenix runs in-process, the UI is on localhost:6006, traces stream from your code as you iterate. Best for prompt iteration and offline eval work.
Pattern B — Docker / Kubernetes. Pull the Phoenix container, point your application's OTLP exporter at it, mount Postgres for persistence. Best for production observability.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
For regulated industries, Pattern B is the right default — your prompt and response data stays on infrastructure you control. That's critical when you're handling regulated data or need an audit trail for EU AI Act compliance.
How CallSphere uses Phoenix
We run a self-hosted Phoenix cluster as part of our healthcare deployment. Every voice call's trace — turn-by-turn LLM calls, tool calls, latency breakdowns, error spans — flows from the Node runtime via OTLP to Phoenix. PHI never leaves our HIPAA-eligible AWS account.
For our IT Services UrackIT deployment (ChromaDB-backed RAG agent), Phoenix's RAG-specific eval templates are what catch retrieval regressions. When the embedding model gets bumped, we run the eval suite, see hit rate / MRR / faithfulness scores, and gate the rollout.
We also use Phoenix's prompt playground as the day-to-day workflow for prompt iteration. You can replay a real production trace, edit the prompt, re-run, and compare side by side. That's faster than any other prompt iteration UX we've tried.
Pricing: $149 Starter / $499 Growth / $1499 Scale. 14-day trial. 22% affiliate.
Build steps — Phoenix self-hosted on Docker
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest.- In your app, set
OTEL_EXPORTER_OTLP_ENDPOINT=http://phoenix:4317. - Install instrumentation:
pip install openinference-instrumentation-langchain(or your framework). from openinference.instrumentation.langchain import LangChainInstrumentor; LangChainInstrumentor().instrument().- Open Phoenix UI on
:6006. Traces appear immediately. - Define eval datasets and run
phoenix.evals.run_evals(...)in CI. - Mount Postgres for durable storage; without it, traces live only in memory.
Code: instrument an agent + run evals
from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from phoenix.otel import register
tracer_provider = register(project_name="callsphere-after-hours",
endpoint="http://phoenix:4317")
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
# ...your agent runs here, traces flow to Phoenix automatically.
from phoenix.evals import HallucinationEvaluator, run_evals
import pandas as pd
dataset = pd.read_csv("eval_set.csv")
results = run_evals(
dataframe=dataset,
evaluators=[HallucinationEvaluator(model="gpt-5")],
provide_explanation=True,
)
Build steps — Phoenix on Kubernetes
- Use the official Helm chart (in the GitHub repo).
- Provision Postgres for traces, S3 for span attachments.
- Configure SSO (Cognito, Okta) on the UI ingress.
- Wire your apps' OTLP exporter to the Phoenix service.
- Set retention: 14 days hot, 90 days cold (export to S3 + Athena).
- Add Prometheus alerts on Phoenix's own metrics endpoint.
- Quarterly: update the chart, validate eval templates still pass.
OpenInference — the instrumentation standard
Phoenix uses OpenInference, an open semantic-convention layer on top of OpenTelemetry. OpenInference defines what a "tool span" looks like, what an "LLM span" contains, what a "retrieval span" tracks. The benefit: instrumenting your agent in Phoenix means your spans are also compatible with any other OTel backend that understands OpenInference.
Practical implication: you can run Phoenix in dev/staging for fast iteration and pipe the same traces to Datadog or Honeycomb in production. No re-instrumentation. We do exactly this — Phoenix for trace exploration during development, Datadog for production alerting.
Eval templates that ship out of the box
Phoenix's built-in evaluators are a real time-saver. The ones we use most:
- HallucinationEvaluator — detects answers not grounded in retrieved context.
- QAEvaluator — scores answer quality against a reference.
- RelevanceEvaluator — scores RAG retrieval relevance.
- ToxicityEvaluator — flags toxic or unsafe outputs.
- CodeReadabilityEvaluator — for coding agents.
Each ships as a callable that takes a DataFrame of records and returns scored DataFrames with explanations. We run these as a nightly job over the previous day's production traces and dashboard the results.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Experiments — the prompt iteration loop
Phoenix's "experiments" feature pairs a dataset (real or synthetic inputs) with a function (your agent or a prompt variant) and runs evals across the combination. Useful when you want to compare three prompt variants or two retrieval strategies head-to-head. The UI shows a side-by-side comparison with diffs and eval scores.
This is our default workflow before any production prompt change: define the dataset, run the experiment with the old and new prompt, compare scores, ship the winner.
FAQ
How is Phoenix different from Arize AX? AX is the hosted commercial product with multi-tenant features, advanced governance, and SLAs. Phoenix OSS is the same trace + eval surface without the managed-service plumbing.
Does it work with the OpenAI Agents SDK? Yes — openinference-instrumentation-openai-agents autoinstruments. We use it daily.
Can I run Phoenix and Helicone together? Yes. Helicone for gateway-level logging and cost; Phoenix for span-level traces and evals. Different layers.
What's the resource footprint? A single-node Phoenix container handles ~10M spans/month comfortably with Postgres on a t3.medium. Scale Postgres before Phoenix.
Where do I see this in production? Book a demo and we'll walk through a live Phoenix dashboard for our healthcare voice agent.
Does Phoenix support multi-tenancy? OSS supports project-level separation; for full multi-tenant isolation with quotas and per-team auth, Arize AX is the managed answer.
Can I export traces to other backends? Yes — Phoenix is OTel-native. Pipe the same OTLP stream to Datadog, Honeycomb, or Grafana Tempo.
How do I write a custom evaluator? Subclass LLMEvaluator or write a plain function that returns a score and explanation. The framework handles batch execution and persistence.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.