---
title: "Helicone OSS vs Cloud in 2026: When to Self-Host Your AI Gateway"
description: "Helicone has processed 2B+ LLM calls and ships both as a managed cloud and as fully self-hostable open source. Here is the actual decision tree for 2026."
canonical: https://callsphere.ai/blog/vw3g-helicone-open-source-vs-cloud-llm-observability-2026
category: "AI Infrastructure"
tags: ["Helicone", "Observability", "AI Gateway", "Open Source", "Self-Hosted"]
author: "CallSphere Team"
published: 2026-04-12T00:00:00.000Z
updated: 2026-05-07T09:59:38.279Z
---

# Helicone OSS vs Cloud in 2026: When to Self-Host Your AI Gateway

> Helicone has processed 2B+ LLM calls and ships both as a managed cloud and as fully self-hostable open source. Here is the actual decision tree for 2026.

> **TL;DR** — Helicone is a one-line proxy that gives you logging, caching, cost tracking, and a dashboard for any LLM API call. The OSS version is feature-equal on the core observability surface; the Cloud adds managed infrastructure and 50-80ms latency you don't have to operate. Pick by how much infra pain you can absorb.

## What Helicone is, in one sentence

```mermaid
flowchart LR
  Repo[GitHub repo] --> CI[GitHub Actions]
  CI --> Eval[Agent eval suite · PromptFoo]
  Eval -->|pass| Deploy[Deploy]
  Eval -->|fail| Block[Block PR]
  Deploy --> Prod[Production agent]
  Prod --> Trace[(LangSmith trace)]
  Trace --> Eval
```

CallSphere reference architecture

Helicone is an **AI gateway** — your app sends LLM requests through it instead of straight to the provider, and Helicone logs the request, the response, the token usage, the latency, and the cost. One line of code change (swap `api.openai.com` for `oai.helicone.ai`) and you have observability.

The OSS version (Apache 2.0, on GitHub) and the Cloud version (helicone.ai) share the same backend architecture: Cloudflare Workers + ClickHouse + Kafka. Helicone's pitch is that they've processed 2 billion+ LLM interactions with 50-80ms average added latency.

## Cloud vs Self-Hosted

| Dimension | Helicone Cloud | Helicone OSS (self-hosted) |
| --- | --- | --- |
| Setup time | 5 minutes | A weekend (Docker, Kubernetes, or manual) |
| Free tier | 10k requests/month | Unlimited (your infra cost) |
| Paid plans | Starts $79/month | $0 software cost; pay for hosting |
| Data residency | Helicone's infra | Your VPC, your country |
| Infrastructure ownership | Helicone runs CF Workers + ClickHouse + Kafka | You run them |
| Updates | Automatic | You pull and redeploy |
| Compliance posture | SOC 2, ISO 27001 | Whatever you certify |

For early-stage and growth-stage teams, Cloud is the right default. For regulated industries (healthcare, defense, finance) where data residency matters more than ops cost, self-host.

## When to self-host

Real reasons to take on the operational cost:

1. **Regulatory data residency** — EU AI Act, HIPAA, FedRAMP. Logs and prompts can't leave your VPC.
2. **PII volume** — you're logging requests that contain regulated data and the legal review takes 6 months for any third-party processor.
3. **Sovereign cloud requirement** — government or defense workloads with no exceptions.
4. **Massive volume** — at 100M+ requests/month, the per-request math eventually favors operating ClickHouse yourself.

If none of those apply, Cloud is cheaper end-to-end when you account for engineering time.

## How CallSphere uses it

CallSphere runs **Helicone Cloud** for all non-voice LLM traffic — content generation, scrapers, after-hours summarization, GTM automation. Voice runtime traces go to LangSmith and our internal Postgres-backed observability layer because the Helicone proxy hop is incompatible with WebRTC's session-based auth.

For our [healthcare](/industries/healthcare) and behavioral-health verticals, where prompts may contain protected health information, we self-host the OSS version inside our HIPAA-eligible AWS account. Same Helicone UI, same ClickHouse, just under our BAA. The OSS path is genuinely production-grade — we audit-log every prompt and response with no third-party processor in the path.

Pricing: [$149 / $499 / $1499](/pricing). [14-day trial](/trial). [22% affiliate](/affiliate).

## Build steps — Helicone Cloud in 5 minutes

1. Sign up at helicone.ai. Grab the proxy URL and an API key.
2. Swap your OpenAI base URL: `baseURL: "https://oai.helicone.ai/v1"`.
3. Add the auth header: `"Helicone-Auth": "Bearer hl-..."`.
4. Tag requests with `Helicone-Property-User` and `Helicone-Property-Workflow` so you can slice in the dashboard.
5. Enable caching for deterministic prompts: `"Helicone-Cache-Enabled": "true"`.
6. Set a per-user budget alert.
7. Wire the dashboard to your on-call Slack.

## Build steps — Helicone OSS self-hosted

1. Clone `Helicone/helicone` and read `docker-compose.yml`.
2. Stand up Postgres, ClickHouse, MinIO, and the proxy worker.
3. Configure your S3 bucket for request bodies (audit-grade).
4. Point your application at the self-hosted proxy URL.
5. Configure SSO (Cognito, Okta, Google Workspace) for the dashboard.
6. Set up daily ClickHouse backups; logs are your audit trail.
7. Subscribe to the Helicone GitHub releases; review and apply.

## Code: tag every request

```typescript
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",  // or your self-hosted URL
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
    "Helicone-Property-Workflow": "post-call-summary",
    "Helicone-Property-User": userId,
    "Helicone-Property-Tenant": tenantId,
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "20",
  },
});
```

## Caching strategies that pay for themselves

Helicone's caching is the underrated cost-saver. Three patterns we use:

1. **Deterministic prompt cache.** For prompts where the same input always produces the same output (classifiers, parsers, embeddings), enable the cache with a long TTL. Hit rates over 30% on our SEO content classifier dropped that workload's cost by 28%.
2. **User-bucketed cache.** For per-user assistants, set `Helicone-Cache-Bucket-Max-Size` so cache entries are isolated per user. No cross-user contamination.
3. **Burst protection.** When a popular content URL spikes, Helicone's cache absorbs the burst before it hits the model provider. Saves us from rate-limit pain.

## Cost analytics — slicing the bill

The Helicone dashboard slices cost by user, model, prompt, workflow, or any custom property you tagged. Two views we check daily:

- **Cost per workflow.** Surfaces the agent that quietly tripled in tokens after a prompt change.
- **Cost per user.** Surfaces the customer whose AI usage is exceeding their plan budget. Useful for quota enforcement and upsell.

For our [Sales product](/industries/it-services), we tag every request with the customer tenant and the agent name. Within minutes of deploying we have an exact bill per tenant per agent — that fed directly into the per-tenant pricing page math.

## Failover and rate-limit smoothing

Helicone's gateway can also act as a **provider failover layer**. If OpenAI rate-limits or returns 5xx, Helicone retries against a fallback (Anthropic, Azure OpenAI, Bedrock) without the application knowing. Configure once; deploy twice as much resilience.

```typescript
"Helicone-Fallbacks": JSON.stringify([
  { "target-url": "https://api.openai.com", "weight": 0.7 },
  { "target-url": "https://api.anthropic.com", "weight": 0.3 },
]),
```

We use this for non-voice batch workloads where the model can be substituted; we do *not* use it for voice (the model behavior differences are too pronounced for live calls).

## FAQ

**How much latency does Helicone add?** Average 50-80ms in our measurements; usually below the variance of the LLM call itself.

**Does Helicone work with Anthropic and Gemini?** Yes — distinct proxy URLs per provider.

**Can I switch from Cloud to OSS later?** Yes. The data model is the same; you can re-import or just start fresh.

**What does it not do?** It doesn't run agent-level evals. Pair Helicone (gateway observability) with Phoenix or Promptfoo (agent evals).

**Where do I see this on CallSphere?** Book a [demo](/demo) and we'll show the dashboard for our SEO content engine.

**Can I run multiple Helicone instances in parallel?** Yes — different Helicone-Auth keys, different dashboards. Useful when you want to isolate environments (staging vs prod) or business units.

**How does Helicone compare to OpenLLMetry?** OpenLLMetry is a pure OTel instrumentation library (no proxy hop). Helicone is a proxy-based gateway with a UI and caching. Different abstractions; not mutually exclusive.

**Does Helicone support custom models?** Yes — any OpenAI-compatible endpoint works. We've routed Together AI, Groq, and our own vLLM-hosted models through Helicone with no extra work.

## Sources

- [Helicone on GitHub](https://github.com/Helicone/helicone)
- [Helicone OSS docs](https://docs.helicone.ai/references/open-source)
- [Best LLM Observability Tools 2026](https://www.firecrawl.dev/blog/best-llm-observability-tools)
- [Helicone Alternatives 2026](https://latitude.so/blog/helicone-alternatives)

---

Source: https://callsphere.ai/blog/vw3g-helicone-open-source-vs-cloud-llm-observability-2026
