TL;DR — Phoenix is what you wire up when you need full control of your agent traces and evals on infrastructure you own. It's OTel-native, framework-agnostic, fully open source, and runs from a Jupyter cell up to a multi-node Kubernetes deployment with no feature gates.

What Phoenix is

flowchart LR
  Repo[GitHub repo] --> CI[GitHub Actions]
  CI --> Eval[Agent eval suite · PromptFoo]
  Eval -->|pass| Deploy[Deploy]
  Eval -->|fail| Block[Block PR]
  Deploy --> Prod[Production agent]
  Prod --> Trace[(LangSmith trace)]
  Trace --> Eval

CallSphere reference architecture

Phoenix (Arize-ai/phoenix on GitHub) is an open-source AI observability and evaluation platform. It accepts traces over OpenTelemetry (OTLP), ships auto-instrumentation for every major framework (LangChain, LlamaIndex, OpenAI Agents SDK, CrewAI, AutoGen, Pydantic AI, smolagents, deepagents), and provides a UI for trace exploration, eval scoring, prompt iteration, and experiments.

Critical distinction from competitors: Phoenix is fully open source and self-hostable with no feature gates. The hosted Arize AX product is a separate paid offering for teams that want a managed service; the OSS Phoenix offering is feature-complete on its own.

Why Phoenix wins for self-hosted

Three properties matter:

OTel-native. Your traces are portable. If you outgrow Phoenix you can pipe the same OTLP stream into Datadog, Honeycomb, or Grafana Tempo without re-instrumenting.
Framework-agnostic auto-instrumentation. openinference-instrumentation-* packages exist for nearly every popular framework — one line wires the spans.
No feature gates. Eval templates, experiments, prompt playground — all in OSS. No "upgrade to Pro to see..." walls.

Where Phoenix lives in your stack

Two common patterns:

Pattern A — laptop / notebook. pip install arize-phoenix and px.launch_app(). Phoenix runs in-process, the UI is on localhost:6006, traces stream from your code as you iterate. Best for prompt iteration and offline eval work.

Pattern B — Docker / Kubernetes. Pull the Phoenix container, point your application's OTLP exporter at it, mount Postgres for persistence. Best for production observability.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

For regulated industries, Pattern B is the right default — your prompt and response data stays on infrastructure you control. That's critical when you're handling regulated data or need an audit trail for EU AI Act compliance.

How CallSphere uses Phoenix

We run a self-hosted Phoenix cluster as part of our healthcare deployment. Every voice call's trace — turn-by-turn LLM calls, tool calls, latency breakdowns, error spans — flows from the Node runtime via OTLP to Phoenix. PHI never leaves our HIPAA-eligible AWS account.

For our IT Services UrackIT deployment (ChromaDB-backed RAG agent), Phoenix's RAG-specific eval templates are what catch retrieval regressions. When the embedding model gets bumped, we run the eval suite, see hit rate / MRR / faithfulness scores, and gate the rollout.

We also use Phoenix's prompt playground as the day-to-day workflow for prompt iteration. You can replay a real production trace, edit the prompt, re-run, and compare side by side. That's faster than any other prompt iteration UX we've tried.

Pricing: $149 Starter / $499 Growth / $1499 Scale. 14-day trial. 22% affiliate.

Build steps — Phoenix self-hosted on Docker

docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest.
In your app, set OTEL_EXPORTER_OTLP_ENDPOINT=http://phoenix:4317.
Install instrumentation: pip install openinference-instrumentation-langchain (or your framework).
from openinference.instrumentation.langchain import LangChainInstrumentor; LangChainInstrumentor().instrument().
Open Phoenix UI on :6006. Traces appear immediately.
Define eval datasets and run phoenix.evals.run_evals(...) in CI.
Mount Postgres for durable storage; without it, traces live only in memory.

Code: instrument an agent + run evals

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from phoenix.otel import register

tracer_provider = register(project_name="callsphere-after-hours",
                           endpoint="http://phoenix:4317")
OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)

# ...your agent runs here, traces flow to Phoenix automatically.

from phoenix.evals import HallucinationEvaluator, run_evals
import pandas as pd

dataset = pd.read_csv("eval_set.csv")
results = run_evals(
    dataframe=dataset,
    evaluators=[HallucinationEvaluator(model="gpt-5")],
    provide_explanation=True,
)

Build steps — Phoenix on Kubernetes

Use the official Helm chart (in the GitHub repo).
Provision Postgres for traces, S3 for span attachments.
Configure SSO (Cognito, Okta) on the UI ingress.
Wire your apps' OTLP exporter to the Phoenix service.
Set retention: 14 days hot, 90 days cold (export to S3 + Athena).
Add Prometheus alerts on Phoenix's own metrics endpoint.
Quarterly: update the chart, validate eval templates still pass.

OpenInference — the instrumentation standard

Phoenix uses OpenInference, an open semantic-convention layer on top of OpenTelemetry. OpenInference defines what a "tool span" looks like, what an "LLM span" contains, what a "retrieval span" tracks. The benefit: instrumenting your agent in Phoenix means your spans are also compatible with any other OTel backend that understands OpenInference.

Practical implication: you can run Phoenix in dev/staging for fast iteration and pipe the same traces to Datadog or Honeycomb in production. No re-instrumentation. We do exactly this — Phoenix for trace exploration during development, Datadog for production alerting.

Eval templates that ship out of the box

Phoenix's built-in evaluators are a real time-saver. The ones we use most:

HallucinationEvaluator — detects answers not grounded in retrieved context.
QAEvaluator — scores answer quality against a reference.
RelevanceEvaluator — scores RAG retrieval relevance.
ToxicityEvaluator — flags toxic or unsafe outputs.
CodeReadabilityEvaluator — for coding agents.

Each ships as a callable that takes a DataFrame of records and returns scored DataFrames with explanations. We run these as a nightly job over the previous day's production traces and dashboard the results.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Experiments — the prompt iteration loop

Phoenix's "experiments" feature pairs a dataset (real or synthetic inputs) with a function (your agent or a prompt variant) and runs evals across the combination. Useful when you want to compare three prompt variants or two retrieval strategies head-to-head. The UI shows a side-by-side comparison with diffs and eval scores.

This is our default workflow before any production prompt change: define the dataset, run the experiment with the old and new prompt, compare scores, ship the winner.

FAQ

How is Phoenix different from Arize AX? AX is the hosted commercial product with multi-tenant features, advanced governance, and SLAs. Phoenix OSS is the same trace + eval surface without the managed-service plumbing.

Does it work with the OpenAI Agents SDK? Yes — openinference-instrumentation-openai-agents autoinstruments. We use it daily.

Can I run Phoenix and Helicone together? Yes. Helicone for gateway-level logging and cost; Phoenix for span-level traces and evals. Different layers.

What's the resource footprint? A single-node Phoenix container handles ~10M spans/month comfortably with Postgres on a t3.medium. Scale Postgres before Phoenix.

Where do I see this in production? Book a demo and we'll walk through a live Phoenix dashboard for our healthcare voice agent.

Does Phoenix support multi-tenancy? OSS supports project-level separation; for full multi-tenant isolation with quotas and per-team auth, Arize AX is the managed answer.

Can I export traces to other backends? Yes — Phoenix is OTel-native. Pipe the same OTLP stream to Datadog, Honeycomb, or Grafana Tempo.

How do I write a custom evaluator? Subclass LLMEvaluator or write a plain function that returns a score and explanation. The framework handles batch execution and persistence.

Arize Phoenix in 2026: The Self-Hosted Observability Stack for Agents

What Phoenix is

Why Phoenix wins for self-hosted

Where Phoenix lives in your stack

How CallSphere uses Phoenix

Build steps — Phoenix self-hosted on Docker

Code: instrument an agent + run evals

Build steps — Phoenix on Kubernetes

OpenInference — the instrumentation standard

Eval templates that ship out of the box

Experiments — the prompt iteration loop

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

The Agent Evaluation Stack in 2026: From Trace to Eval Score

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

Zep Cloud vs Self-Hosted Zep: When to Pick Which Path in 2026

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Catching Performance Regressions in AI Agent CI Pipelines