Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Arize Phoenix in 2026: The Self-Hosted Observability Stack for Agents

Phoenix accepts OpenTelemetry traces from any framework, runs anywhere from your laptop to Kubernetes, and is fully open source with no feature gates.

TL;DR — Phoenix is what you wire up when you need full control of your agent traces and evals on infrastructure you own. It's OTel-native, framework-agnostic, fully open source, and runs from a Jupyter cell up to a multi-node Kubernetes deployment with no feature gates.

What Phoenix is

flowchart LR
  Repo[GitHub repo] --> CI[GitHub Actions]
  CI --> Eval[Agent eval suite · PromptFoo]
  Eval -->|pass| Deploy[Deploy]
  Eval -->|fail| Block[Block PR]
  Deploy --> Prod[Production agent]
  Prod --> Trace[(LangSmith trace)]
  Trace --> Eval
CallSphere reference architecture

Phoenix (Arize-ai/phoenix on GitHub) is an open-source AI observability and evaluation platform. It accepts traces over OpenTelemetry (OTLP), ships auto-instrumentation for every major framework (LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, smolagents, deepagents), and provides a UI for trace exploration, eval scoring, prompt iteration, and experiments.

Critical distinction from competitors: Phoenix is fully open source and self-hostable with no feature gates. The hosted Arize AX product is a separate paid offering for teams that want a managed service; the OSS Phoenix offering is feature-complete on its own.

Why Phoenix wins for self-hosted

Three properties matter:

  1. OTel-native. Your traces are portable. If you outgrow Phoenix you can pipe the same OTLP stream into Datadog, Honeycomb, or Grafana Tempo without re-instrumenting.
  2. Framework-agnostic auto-instrumentation. openinference-instrumentation-* packages exist for nearly every popular framework — one line wires the spans.
  3. No feature gates. Eval templates, experiments, prompt playground — all in OSS. No "upgrade to Pro to see..." walls.

Where Phoenix lives in your stack

Two common patterns:

Pattern A — laptop / notebook. pip install arize-phoenix and px.launch_app(). Phoenix runs in-process, the UI is on localhost:6006, traces stream from your code as you iterate. Best for prompt iteration and offline eval work.

Pattern B — Docker / Kubernetes. Pull the Phoenix container, point your application's OTLP exporter at it, mount Postgres for persistence. Best for production observability.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

For regulated industries, Pattern B is the right default — your prompt and response data stays on infrastructure you control. That's critical when you're handling regulated data or need an audit trail for EU AI Act compliance.

How CallSphere uses Phoenix

We run a self-hosted Phoenix cluster as part of our healthcare deployment. Every voice call's trace — turn-by-turn LLM calls, tool calls, latency breakdowns, error spans — flows from the Node runtime via OTLP to Phoenix. PHI never leaves our HIPAA-eligible AWS account.

For our IT Services UrackIT deployment (ChromaDB-backed RAG agent), Phoenix's RAG-specific eval templates are what catch retrieval regressions. When the embedding model gets bumped, we run the eval suite, see hit rate / MRR / faithfulness scores, and gate the rollout.

We also use Phoenix's prompt playground as the day-to-day workflow for prompt iteration. You can replay a real production trace, edit the prompt, re-run, and compare side by side. That's faster than any other prompt iteration UX we've tried.

Pricing: $149 Starter / $499 Growth / $1499 Scale. 14-day trial. 22% affiliate.

Build steps — Phoenix self-hosted on Docker

  1. docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest.
  2. In your app, set OTEL_EXPORTER_OTLP_ENDPOINT=http://phoenix:4317.
  3. Install instrumentation: pip install openinference-instrumentation-langchain (or your framework).
  4. from openinference.instrumentation.langchain import LangChainInstrumentor; LangChainInstrumentor().instrument().
  5. Open Phoenix UI on :6006. Traces appear immediately.
  6. Define eval datasets and run phoenix.evals.run_evals(...) in CI.
  7. Mount Postgres for durable storage; without it, traces live only in memory.

Code: instrument an agent + run evals

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from phoenix.otel import register

tracer_provider = register(project_name="callsphere-after-hours",
                           endpoint="http://phoenix:4317")
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)

# ...your agent runs here, traces flow to Phoenix automatically.

from phoenix.evals import HallucinationEvaluator, run_evals
import pandas as pd

dataset = pd.read_csv("eval_set.csv")
results = run_evals(
    dataframe=dataset,
    evaluators=[HallucinationEvaluator(model="gpt-5")],
    provide_explanation=True,
)

Build steps — Phoenix on Kubernetes

  1. Use the official Helm chart (in the GitHub repo).
  2. Provision Postgres for traces, S3 for span attachments.
  3. Configure SSO (Cognito, Okta) on the UI ingress.
  4. Wire your apps' OTLP exporter to the Phoenix service.
  5. Set retention: 14 days hot, 90 days cold (export to S3 + Athena).
  6. Add Prometheus alerts on Phoenix's own metrics endpoint.
  7. Quarterly: update the chart, validate eval templates still pass.

OpenInference — the instrumentation standard

Phoenix uses OpenInference, an open semantic-convention layer on top of OpenTelemetry. OpenInference defines what a "tool span" looks like, what an "LLM span" contains, what a "retrieval span" tracks. The benefit: instrumenting your agent in Phoenix means your spans are also compatible with any other OTel backend that understands OpenInference.

Practical implication: you can run Phoenix in dev/staging for fast iteration and pipe the same traces to Datadog or Honeycomb in production. No re-instrumentation. We do exactly this — Phoenix for trace exploration during development, Datadog for production alerting.

Eval templates that ship out of the box

Phoenix's built-in evaluators are a real time-saver. The ones we use most:

  • HallucinationEvaluator — detects answers not grounded in retrieved context.
  • QAEvaluator — scores answer quality against a reference.
  • RelevanceEvaluator — scores RAG retrieval relevance.
  • ToxicityEvaluator — flags toxic or unsafe outputs.
  • CodeReadabilityEvaluator — for coding agents.

Each ships as a callable that takes a DataFrame of records and returns scored DataFrames with explanations. We run these as a nightly job over the previous day's production traces and dashboard the results.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Experiments — the prompt iteration loop

Phoenix's "experiments" feature pairs a dataset (real or synthetic inputs) with a function (your agent or a prompt variant) and runs evals across the combination. Useful when you want to compare three prompt variants or two retrieval strategies head-to-head. The UI shows a side-by-side comparison with diffs and eval scores.

This is our default workflow before any production prompt change: define the dataset, run the experiment with the old and new prompt, compare scores, ship the winner.

FAQ

How is Phoenix different from Arize AX? AX is the hosted commercial product with multi-tenant features, advanced governance, and SLAs. Phoenix OSS is the same trace + eval surface without the managed-service plumbing.

Does it work with the OpenAI Agents SDK? Yes — openinference-instrumentation-openai-agents autoinstruments. We use it daily.

Can I run Phoenix and Helicone together? Yes. Helicone for gateway-level logging and cost; Phoenix for span-level traces and evals. Different layers.

What's the resource footprint? A single-node Phoenix container handles ~10M spans/month comfortably with Postgres on a t3.medium. Scale Postgres before Phoenix.

Where do I see this in production? Book a demo and we'll walk through a live Phoenix dashboard for our healthcare voice agent.

Does Phoenix support multi-tenancy? OSS supports project-level separation; for full multi-tenant isolation with quotas and per-team auth, Arize AX is the managed answer.

Can I export traces to other backends? Yes — Phoenix is OTel-native. Pipe the same OTLP stream to Datadog, Honeycomb, or Grafana Tempo.

How do I write a custom evaluator? Subclass LLMEvaluator or write a plain function that returns a score and explanation. The framework handles batch execution and persistence.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Strategy

Zep Cloud vs Self-Hosted Zep: When to Pick Which Path in 2026

Zep Cloud and OSS Zep have diverged in 2026 with different feature sets. The build-vs-buy math for memory infrastructure with concrete cost numbers and trade-offs.

AI Engineering

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.

AI Engineering

Catching Performance Regressions in AI Agent CI Pipelines

Standard benchmarks miss agent regressions because they grade only final outputs. Trajectory-aware evals in CI catch the 20–40% of regressions that single-turn scoring hides.