Embedding Cost Optimization: When to Re-Embed, Cache, or Use Smaller Models

The Hidden Cost of Embeddings

Embedding costs fly under the radar because individual embedding calls are cheap — $0.02 per million tokens for OpenAI’s text-embedding-3-small. But agents that perform RAG on every request, re-embed documents on every update, and store high-dimensional vectors in expensive vector databases can accumulate significant embedding-related costs. A system processing 500,000 queries daily with an average of 1,000 tokens per query spends about $10/day just on query embeddings — and that does not include document embeddings or vector storage.

Embedding Caching

The most impactful optimization is caching embeddings. Query embeddings and document embeddings should never be computed twice for the same input.

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

import hashlib
import json
import numpy as np
from typing import Optional, List
import redis

class EmbeddingCache:
    def __init__(self, redis_url: str = "redis://localhost:6379/1"):
        self.redis_client = redis.from_url(redis_url)
        self.hits = 0
        self.misses = 0

    def _cache_key(self, text: str, model: str) -> str:
        content = f"{model}:{text.strip().lower()}"
        return f"emb:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, text: str, model: str) -> Optional[List[float]]:
        key = self._cache_key(text, model)
        cached = self.redis_client.get(key)
        if cached:
            self.hits += 1
            return json.loads(cached)
        self.misses += 1
        return None

    def store(self, text: str, model: str, embedding: List[float], ttl: int = 604800):
        key = self._cache_key(text, model)
        self.redis_client.setex(key, ttl, json.dumps(embedding))

    def get_or_compute(
        self,
        text: str,
        model: str,
        compute_fn,
    ) -> List[float]:
        cached = self.get(text, model)
        if cached is not None:
            return cached
        embedding = compute_fn(text, model)
        self.store(text, model, embedding)
        return embedding

    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

Model Selection by Use Case

Not every use case needs the highest-quality embedding model. Match the model to the task requirements.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from dataclasses import dataclass
from enum import Enum

class EmbeddingUseCase(Enum):
    SEMANTIC_SEARCH = "semantic_search"
    CLASSIFICATION = "classification"
    CLUSTERING = "clustering"
    DUPLICATE_DETECTION = "duplicate_detection"
    CACHING_KEYS = "caching_keys"

@dataclass
class EmbeddingModelConfig:
    model: str
    dimensions: int
    cost_per_million_tokens: float
    quality_tier: str

MODEL_RECOMMENDATIONS = {
    EmbeddingUseCase.SEMANTIC_SEARCH: EmbeddingModelConfig(
        model="text-embedding-3-large",
        dimensions=3072,
        cost_per_million_tokens=0.13,
        quality_tier="high",
    ),
    EmbeddingUseCase.CLASSIFICATION: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=1536,
        cost_per_million_tokens=0.02,
        quality_tier="medium",
    ),
    EmbeddingUseCase.CLUSTERING: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=512,
        cost_per_million_tokens=0.02,
        quality_tier="medium",
    ),
    EmbeddingUseCase.DUPLICATE_DETECTION: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=256,
        cost_per_million_tokens=0.02,
        quality_tier="low",
    ),
    EmbeddingUseCase.CACHING_KEYS: EmbeddingModelConfig(
        model="text-embedding-3-small",
        dimensions=256,
        cost_per_million_tokens=0.02,
        quality_tier="low",
    ),
}

def select_model(use_case: EmbeddingUseCase) -> EmbeddingModelConfig:
    return MODEL_RECOMMENDATIONS[use_case]

Dimension Reduction for Storage Savings

OpenAI’s text-embedding-3 models support native dimension reduction via the dimensions parameter. Reducing from 3072 to 1024 dimensions cuts storage by 67% with only a small quality loss on most benchmarks.

import openai

class OptimizedEmbedder:
    def __init__(self, client: openai.OpenAI, cache: EmbeddingCache):
        self.client = client
        self.cache = cache

    def embed(
        self,
        texts: List[str],
        use_case: EmbeddingUseCase,
    ) -> List[List[float]]:
        config = select_model(use_case)
        uncached_texts = []
        uncached_indices = []
        results: dict[int, List[float]] = {}

        for i, text in enumerate(texts):
            cached = self.cache.get(text, config.model)
            if cached is not None:
                results[i] = cached
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        if uncached_texts:
            response = self.client.embeddings.create(
                model=config.model,
                input=uncached_texts,
                dimensions=config.dimensions,
            )
            for j, emb_data in enumerate(response.data):
                idx = uncached_indices[j]
                embedding = emb_data.embedding
                results[idx] = embedding
                self.cache.store(uncached_texts[j], config.model, embedding)

        return [results[i] for i in range(len(texts))]

Batch Sizing for Throughput

Process embeddings in optimal batch sizes to maximize throughput and minimize overhead.

def batch_embed(
    client: openai.OpenAI,
    texts: List[str],
    model: str = "text-embedding-3-small",
    batch_size: int = 100,
    dimensions: int = 1536,
) -> List[List[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model=model,
            input=batch,
            dimensions=dimensions,
        )
        batch_embeddings = [d.embedding for d in response.data]
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

When to Re-Embed

Re-embedding your entire document corpus is expensive. Only re-embed when you change the embedding model, when documents have been significantly updated, or when your retrieval quality metrics show degradation. For incremental updates, embed only the changed documents and update the vector index incrementally.

FAQ

How much storage does an embedding require?

A single 1536-dimensional float32 embedding uses 6,144 bytes (about 6 KB). For 1 million documents, that is approximately 6 GB of raw embedding storage. Using float16 cuts this in half, and reducing dimensions to 512 brings it down to about 1 GB for the same corpus. Factor in vector database overhead (indexes, metadata), which typically adds 30–50% to the raw storage.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Should I use a self-hosted embedding model to save costs?

Self-hosted models like all-MiniLM-L6-v2 from Sentence Transformers are free per-token, but you pay for compute infrastructure. The breakeven point is typically around 10–50 million tokens per month — below that, API-based embedding is cheaper when you include GPU instance costs. Above that, self-hosting provides both cost savings and lower latency.

How do I handle embedding model migrations?

Never mix embeddings from different models in the same vector index — their vector spaces are incompatible. Plan migrations by creating a new index, batch-embedding all documents with the new model, switching the search to the new index, and then deleting the old index. Run both indexes in parallel during the transition to validate quality.

#Embeddings #CostOptimization #VectorDatabase #RAG #ModelSelection #AgenticAI #LearnAI #AIEngineering

Embedding Cost Optimization: When to Re-Embed, Cache, or Use Smaller Models

The Hidden Cost of Embeddings

Embedding Caching

Model Selection by Use Case

Dimension Reduction for Storage Savings

Batch Sizing for Throughput

When to Re-Embed

FAQ

How much storage does an embedding require?

Should I use a self-hosted embedding model to save costs?

How do I handle embedding model migrations?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chatbot for Answering Questions: How to Build One That Works

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

How To Create A Chatbot In 2026: A Founder's Practical Guide

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?