Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

LiteLLM: A Unified Interface for 100+ LLM Providers in Agent Applications

Set up LiteLLM to call OpenAI, Anthropic, Mistral, Ollama, and 100+ other providers through a single API. Implement fallbacks, load balancing, and cost tracking for production agents.

The Multi-Provider Problem

Production agent systems rarely depend on a single LLM provider. You might use GPT-4o for complex reasoning, Claude for long-context tasks, Mistral for cost-effective classification, and a local Ollama model for development. Each provider has a different API format, authentication mechanism, and error handling behavior.

LiteLLM solves this by providing a single completion() function that translates your request to any of 100+ providers. You write your code once, and LiteLLM handles the API differences, retry logic, and response normalization.

Installation and Basic Usage

Install LiteLLM:

flowchart TD
    START["LiteLLM: A Unified Interface for 100+ LLM Provide…"] --> A
    A["The Multi-Provider Problem"]
    A --> B
    B["Installation and Basic Usage"]
    B --> C
    C["The LiteLLM Proxy Server"]
    C --> D
    D["Implementing Fallbacks"]
    D --> E
    E["Cost Tracking and Budgets"]
    E --> F
    F["Agent Integration Pattern"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
pip install litellm

The core API mirrors OpenAI's interface. To switch providers, you only change the model string:

import litellm

# OpenAI
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello from OpenAI"}],
)

# Anthropic — same interface
response = litellm.completion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Hello from Anthropic"}],
)

# Mistral
response = litellm.completion(
    model="mistral/mistral-large-latest",
    messages=[{"role": "user", "content": "Hello from Mistral"}],
)

# Local Ollama
response = litellm.completion(
    model="ollama/llama3.1:8b",
    messages=[{"role": "user", "content": "Hello from Ollama"}],
    api_base="http://localhost:11434",
)

# All responses have the same structure
print(response.choices[0].message.content)

Set API keys via environment variables:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."

The LiteLLM Proxy Server

For production, run LiteLLM as a proxy server that your agents connect to. This centralizes API key management, logging, and cost tracking:

# litellm_config.yaml
model_list:
  - model_name: "fast-agent"
    litellm_params:
      model: "gpt-4o-mini"
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: "smart-agent"
    litellm_params:
      model: "claude-3-5-sonnet-20241022"
      api_key: "os.environ/ANTHROPIC_API_KEY"

  - model_name: "local-agent"
    litellm_params:
      model: "ollama/llama3.1:8b"
      api_base: "http://localhost:11434"

  - model_name: "smart-agent"  # Second deployment for fallback
    litellm_params:
      model: "gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"

Start the proxy:

litellm --config litellm_config.yaml --port 4000

Now your agents connect to http://localhost:4000 using the standard OpenAI client:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000/v1",
    api_key="sk-anything",  # Proxy handles real keys
)

response = client.chat.completions.create(
    model="smart-agent",  # Routes to Claude, falls back to GPT-4o
    messages=[{"role": "user", "content": "Analyze this data..."}],
)

Implementing Fallbacks

Provider outages happen. LiteLLM supports automatic fallbacks so your agent keeps working when one provider goes down:

import litellm
from litellm import completion

# Fallback chain: try Claude first, then GPT-4o, then local
response = completion(
    model="claude-3-5-sonnet-20241022",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    fallbacks=["gpt-4o", "ollama/llama3.1:8b"],
    num_retries=2,
)

For the proxy server, configure fallbacks in the YAML:

router_settings:
  routing_strategy: "simple-shuffle"  # Load balance across same-name models
  num_retries: 3
  timeout: 30
  fallbacks: [
    {"smart-agent": ["fast-agent", "local-agent"]}
  ]

When a request to smart-agent (Claude) fails, LiteLLM automatically retries with fast-agent (GPT-4o-mini), then local-agent (Ollama).

Cost Tracking and Budgets

LiteLLM tracks costs per request automatically:

import litellm

litellm.success_callback = ["langfuse"]  # Send cost data to Langfuse

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this report."}],
)

# Access cost information
print(f"Cost: ${response._hidden_params['response_cost']:.6f}")

Set spending limits per model or per user through the proxy:

general_settings:
  max_budget: 100.0  # $100 monthly budget
  budget_duration: "monthly"

Agent Integration Pattern

Here is a production-ready agent class that uses LiteLLM for multi-provider support:

from openai import OpenAI
from dataclasses import dataclass

@dataclass
class ModelConfig:
    name: str
    max_tokens: int
    temperature: float

MODELS = {
    "reasoning": ModelConfig("smart-agent", 4096, 0.2),
    "classification": ModelConfig("fast-agent", 256, 0.0),
    "summarization": ModelConfig("fast-agent", 1024, 0.3),
}

class MultiProviderAgent:
    def __init__(self, proxy_url: str = "http://localhost:4000/v1"):
        self.client = OpenAI(base_url=proxy_url, api_key="internal")

    def call(self, task_type: str, messages: list) -> str:
        config = MODELS[task_type]
        response = self.client.chat.completions.create(
            model=config.name,
            messages=messages,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
        return response.choices[0].message.content

    def classify(self, text: str, categories: list[str]) -> str:
        return self.call("classification", [
            {"role": "system", "content": f"Classify into: {categories}. "
             "Respond with just the category name."},
            {"role": "user", "content": text},
        ])

    def reason(self, query: str, context: str) -> str:
        return self.call("reasoning", [
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query},
        ])

agent = MultiProviderAgent()
category = agent.classify("My order hasn't arrived", ["billing", "shipping", "technical"])
print(f"Category: {category}")

FAQ

Does LiteLLM add significant latency?

As a Python library (not proxy mode), LiteLLM adds less than 1ms of overhead — it is just translating the request format. As a proxy server, it adds 5-15ms of network latency for the extra hop. For most agent applications, this is negligible compared to the 200-2000ms LLM inference time.

Can LiteLLM handle streaming responses?

Yes, LiteLLM fully supports streaming across all providers. Use stream=True in your completion call, and LiteLLM normalizes the streaming format so you get consistent ChatCompletionChunk objects regardless of the underlying provider.

How does LiteLLM compare to building my own provider abstraction?

Building your own abstraction for two or three providers is manageable. Beyond that, you are reinventing LiteLLM. LiteLLM handles edge cases you would not think of — different error codes, rate limit headers, token counting differences, and streaming format variations across providers. Use the library and focus your engineering time on agent logic.


#LiteLLM #LLMGateway #MultiProvider #Fallback #CostOptimization #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

AI Agent Cost Optimization: Reducing LLM API Spend by 70% with Caching and Routing

Practical cost reduction strategies for AI agents including semantic caching, intelligent model routing, prompt optimization, and batch processing to cut LLM API spend.

Learn Agentic AI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.

Learn Agentic AI

Cost Optimization for Vision-Based Browser Agents: Image Compression and Caching

Reduce GPT Vision API costs by 60-80% through image resizing, compression, region cropping, intelligent caching, and token-aware strategies. Essential techniques for production vision-based browser automation.

Technology

Understanding AI Inference Costs: How to Cut Token Prices by 10x | CallSphere Blog

Learn the technical strategies behind dramatic inference cost reduction, from precision quantization and speculative decoding to batching optimizations and hardware-software co-design.

Learn Agentic AI

Prompt Caching Strategies: Reducing Latency and Cost with Cached Prefixes

Learn how to leverage prompt caching features from OpenAI and Anthropic to dramatically reduce latency and cost by reusing cached prompt prefixes across requests.

Agentic AI

The Economics of Agentic AI: Understanding Cost-Per-Token in Multi-Step Workflows | CallSphere Blog

Analyze the true cost structure of agentic AI systems, from the 'thinking tax' to multi-step token multiplication. Learn strategies to optimize cost-per-resolution by 60-80%.