Skip to content
AI Infrastructure
AI Infrastructure10 min read0 views

Helicone OSS vs Cloud in 2026: When to Self-Host Your AI Gateway

Helicone has processed 2B+ LLM calls and ships both as a managed cloud and as fully self-hostable open source. Here is the actual decision tree for 2026.

TL;DR — Helicone is a one-line proxy that gives you logging, caching, cost tracking, and a dashboard for any LLM API call. The OSS version is feature-equal on the core observability surface; the Cloud adds managed infrastructure and 50-80ms latency you don't have to operate. Pick by how much infra pain you can absorb.

What Helicone is, in one sentence

flowchart LR
  Repo[GitHub repo] --> CI[GitHub Actions]
  CI --> Eval[Agent eval suite · PromptFoo]
  Eval -->|pass| Deploy[Deploy]
  Eval -->|fail| Block[Block PR]
  Deploy --> Prod[Production agent]
  Prod --> Trace[(LangSmith trace)]
  Trace --> Eval
CallSphere reference architecture

Helicone is an AI gateway — your app sends LLM requests through it instead of straight to the provider, and Helicone logs the request, the response, the token usage, the latency, and the cost. One line of code change (swap api.openai.com for oai.helicone.ai) and you have observability.

The OSS version (Apache 2.0, on GitHub) and the Cloud version (helicone.ai) share the same backend architecture: Cloudflare Workers + ClickHouse + Kafka. Helicone's pitch is that they've processed 2 billion+ LLM interactions with 50-80ms average added latency.

Cloud vs Self-Hosted

Dimension Helicone Cloud Helicone OSS (self-hosted)
Setup time 5 minutes A weekend (Docker, Kubernetes, or manual)
Free tier 10k requests/month Unlimited (your infra cost)
Paid plans Starts $79/month $0 software cost; pay for hosting
Data residency Helicone's infra Your VPC, your country
Infrastructure ownership Helicone runs CF Workers + ClickHouse + Kafka You run them
Updates Automatic You pull and redeploy
Compliance posture SOC 2, ISO 27001 Whatever you certify

For early-stage and growth-stage teams, Cloud is the right default. For regulated industries (healthcare, defense, finance) where data residency matters more than ops cost, self-host.

When to self-host

Real reasons to take on the operational cost:

  1. Regulatory data residency — EU AI Act, HIPAA, FedRAMP. Logs and prompts can't leave your VPC.
  2. PII volume — you're logging requests that contain regulated data and the legal review takes 6 months for any third-party processor.
  3. Sovereign cloud requirement — government or defense workloads with no exceptions.
  4. Massive volume — at 100M+ requests/month, the per-request math eventually favors operating ClickHouse yourself.

If none of those apply, Cloud is cheaper end-to-end when you account for engineering time.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

How CallSphere uses it

CallSphere runs Helicone Cloud for all non-voice LLM traffic — content generation, scrapers, after-hours summarization, GTM automation. Voice runtime traces go to LangSmith and our internal Postgres-backed observability layer because the Helicone proxy hop is incompatible with WebRTC's session-based auth.

For our healthcare and behavioral-health verticals, where prompts may contain protected health information, we self-host the OSS version inside our HIPAA-eligible AWS account. Same Helicone UI, same ClickHouse, just under our BAA. The OSS path is genuinely production-grade — we audit-log every prompt and response with no third-party processor in the path.

Pricing: $149 / $499 / $1499. 14-day trial. 22% affiliate.

Build steps — Helicone Cloud in 5 minutes

  1. Sign up at helicone.ai. Grab the proxy URL and an API key.
  2. Swap your OpenAI base URL: baseURL: "https://oai.helicone.ai/v1".
  3. Add the auth header: "Helicone-Auth": "Bearer hl-...".
  4. Tag requests with Helicone-Property-User and Helicone-Property-Workflow so you can slice in the dashboard.
  5. Enable caching for deterministic prompts: "Helicone-Cache-Enabled": "true".
  6. Set a per-user budget alert.
  7. Wire the dashboard to your on-call Slack.

Build steps — Helicone OSS self-hosted

  1. Clone Helicone/helicone and read docker-compose.yml.
  2. Stand up Postgres, ClickHouse, MinIO, and the proxy worker.
  3. Configure your S3 bucket for request bodies (audit-grade).
  4. Point your application at the self-hosted proxy URL.
  5. Configure SSO (Cognito, Okta, Google Workspace) for the dashboard.
  6. Set up daily ClickHouse backups; logs are your audit trail.
  7. Subscribe to the Helicone GitHub releases; review and apply.

Code: tag every request

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",  // or your self-hosted URL
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Helicone-Property-Workflow": "post-call-summary",
    "Helicone-Property-User": userId,
    "Helicone-Property-Tenant": tenantId,
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "20",
  },
});

Caching strategies that pay for themselves

Helicone's caching is the underrated cost-saver. Three patterns we use:

  1. Deterministic prompt cache. For prompts where the same input always produces the same output (classifiers, parsers, embeddings), enable the cache with a long TTL. Hit rates over 30% on our SEO content classifier dropped that workload's cost by 28%.
  2. User-bucketed cache. For per-user assistants, set Helicone-Cache-Bucket-Max-Size so cache entries are isolated per user. No cross-user contamination.
  3. Burst protection. When a popular content URL spikes, Helicone's cache absorbs the burst before it hits the model provider. Saves us from rate-limit pain.

Cost analytics — slicing the bill

The Helicone dashboard slices cost by user, model, prompt, workflow, or any custom property you tagged. Two views we check daily:

  • Cost per workflow. Surfaces the agent that quietly tripled in tokens after a prompt change.
  • Cost per user. Surfaces the customer whose AI usage is exceeding their plan budget. Useful for quota enforcement and upsell.

For our Sales product, we tag every request with the customer tenant and the agent name. Within minutes of deploying we have an exact bill per tenant per agent — that fed directly into the per-tenant pricing page math.

Failover and rate-limit smoothing

Helicone's gateway can also act as a provider failover layer. If OpenAI rate-limits or returns 5xx, Helicone retries against a fallback (Anthropic, Azure OpenAI, Bedrock) without the application knowing. Configure once; deploy twice as much resilience.

"Helicone-Fallbacks": JSON.stringify([
  { "target-url": "https://api.openai.com", "weight": 0.7 },
  { "target-url": "https://api.anthropic.com", "weight": 0.3 },
]),

We use this for non-voice batch workloads where the model can be substituted; we do not use it for voice (the model behavior differences are too pronounced for live calls).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

How much latency does Helicone add? Average 50-80ms in our measurements; usually below the variance of the LLM call itself.

Does Helicone work with Anthropic and Gemini? Yes — distinct proxy URLs per provider.

Can I switch from Cloud to OSS later? Yes. The data model is the same; you can re-import or just start fresh.

What does it not do? It doesn't run agent-level evals. Pair Helicone (gateway observability) with Phoenix or Promptfoo (agent evals).

Where do I see this on CallSphere? Book a demo and we'll show the dashboard for our SEO content engine.

Can I run multiple Helicone instances in parallel? Yes — different Helicone-Auth keys, different dashboards. Useful when you want to isolate environments (staging vs prod) or business units.

How does Helicone compare to OpenLLMetry? OpenLLMetry is a pure OTel instrumentation library (no proxy hop). Helicone is a proxy-based gateway with a UI and caching. Different abstractions; not mutually exclusive.

Does Helicone support custom models? Yes — any OpenAI-compatible endpoint works. We've routed Together AI, Groq, and our own vLLM-hosted models through Helicone with no extra work.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

AI Strategy

Zep Cloud vs Self-Hosted Zep: When to Pick Which Path in 2026

Zep Cloud and OSS Zep have diverged in 2026 with different feature sets. The build-vs-buy math for memory infrastructure with concrete cost numbers and trade-offs.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Engineering

Open-Source Agent Memory Libraries: 2026 Comparison Matrix

Open-source agent memory in 2026: Mem0, Letta, Cognee, Graphiti, txtai, MemoryScope. A side-by-side feature matrix and a recommendation per typical use case profile.

AI Engineering

Arize Phoenix: Open-Source LLM Tracing in 2026 Reviewed Honestly

Arize Phoenix is the open-source LLM observability tool that grew up significantly in 2026. Tracing, evals, and the OTel-native approach that makes Phoenix portable.