Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Open-Source Embedding Models: Sentence-Transformers and BGE for RAG Agents

Select, deploy, and optimize open-source embedding models for RAG-powered agents. Compare Sentence-Transformers, BGE, and E5 models with benchmarks, fine-tuning strategies, and deployment patterns.

Why Embedding Models Matter for Agents

Retrieval-Augmented Generation (RAG) is the most common pattern for building agents that work with private data. The embedding model is the backbone of RAG — it converts text into vectors that enable semantic search. A poor embedding model means your agent retrieves irrelevant documents, and no amount of LLM quality can compensate for bad retrieval.

Open-source embedding models have caught up to and often surpassed proprietary offerings. The MTEB (Massive Text Embedding Benchmark) leaderboard shows open models like BGE, E5, and GTE consistently competing with OpenAI's Ada and Cohere's embedding APIs, while running locally at zero cost.

Top Open-Source Embedding Models

BAAI/bge-large-en-v1.5 — 335M parameters, 1024-dimensional embeddings. Currently one of the best-performing open models on MTEB. Excellent for English-language RAG.

flowchart TD
    START["Open-Source Embedding Models: Sentence-Transforme…"] --> A
    A["Why Embedding Models Matter for Agents"]
    A --> B
    B["Top Open-Source Embedding Models"]
    B --> C
    C["Getting Started with Sentence-Transform…"]
    C --> D
    D["Optimizing Embedding Performance"]
    D --> E
    E["Building a RAG Pipeline with Local Embe…"]
    E --> F
    F["Fine-Tuning Embeddings for Your Domain"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

intfloat/e5-large-v2 — 335M parameters, 1024 dimensions. Strong alternative to BGE with slightly different strengths across benchmark categories. Requires a "query: " or "passage: " prefix.

BAAI/bge-m3 — A multilingual model supporting 100+ languages with dense, sparse, and multi-vector retrieval in a single model. Ideal for multilingual agent deployments.

nomic-ai/nomic-embed-text-v1.5 — 137M parameters, 768 dimensions. Excellent quality-to-size ratio with a Matryoshka representation that allows flexible dimensionality.

Getting Started with Sentence-Transformers

The sentence-transformers library is the standard way to load and use embedding models:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the model (downloads ~1.3 GB on first run)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Encode documents
documents = [
    "AI agents can autonomously plan and execute tasks.",
    "Retrieval-augmented generation improves factual accuracy.",
    "Vector databases store high-dimensional embeddings efficiently.",
    "The weather in Paris is mild in spring.",
]

doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Embedding shape: {doc_embeddings.shape}")  # (4, 1024)

# Encode a query
query = "How do AI agents use retrieval?"
query_embedding = model.encode([query], normalize_embeddings=True)

# Compute cosine similarity (dot product since normalized)
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()

# Rank results
ranked = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.4f}: {doc}")

Optimizing Embedding Performance

For production agents processing thousands of documents, performance matters. Here are the key optimizations:

Batch encoding — Always encode in batches rather than one document at a time:

# Slow: encoding one by one
for doc in documents:
    embedding = model.encode([doc])

# Fast: batch encoding with GPU
embeddings = model.encode(
    documents,
    batch_size=64,           # Process 64 documents at once
    show_progress_bar=True,
    normalize_embeddings=True,
    device="cuda",           # Use GPU
)

Quantized embeddings — Reduce storage and search costs by quantizing float32 vectors to int8 or binary:

from sentence_transformers.quantization import quantize_embeddings

# Full precision: 1024 dimensions x 4 bytes = 4 KB per document
float_embeddings = model.encode(documents, normalize_embeddings=True)

# Int8 quantization: 1024 bytes per document (75% smaller)
int8_embeddings = quantize_embeddings(
    float_embeddings, precision="int8"
)

# Binary quantization: 128 bytes per document (97% smaller)
binary_embeddings = quantize_embeddings(
    float_embeddings, precision="binary"
)

Building a RAG Pipeline with Local Embeddings

Here is a complete RAG pipeline using local models for both embedding and generation:

from sentence_transformers import SentenceTransformer
from openai import OpenAI
import chromadb

# Local embedding model
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Local LLM via Ollama
llm = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Local vector database
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
    "knowledge_base",
    metadata={"hnsw:space": "cosine"},
)

def ingest(doc_id: str, text: str):
    embedding = embedder.encode([text], normalize_embeddings=True)[0]
    collection.upsert(
        ids=[doc_id],
        embeddings=[embedding.tolist()],
        documents=[text],
    )

def retrieve(query: str, top_k: int = 5) -> list[str]:
    query_emb = embedder.encode([query], normalize_embeddings=True)[0]
    results = collection.query(
        query_embeddings=[query_emb.tolist()],
        n_results=top_k,
    )
    return results["documents"][0]

def rag_query(user_question: str) -> str:
    # Retrieve relevant context
    docs = retrieve(user_question)
    context = "\n\n---\n\n".join(docs)

    # Generate answer with local LLM
    response = llm.chat.completions.create(
        model="llama3.1:8b",
        messages=[
            {"role": "system", "content":
             f"Answer based on this context. If the context does not contain "
             f"the answer, say so.\n\nContext:\n{context}"},
            {"role": "user", "content": user_question},
        ],
        temperature=0.2,
    )

    return response.choices[0].message.content

# Index some documents
ingest("doc1", "AI agents use tool calling to interact with external systems.")
ingest("doc2", "RAG improves LLM accuracy by providing relevant context.")
ingest("doc3", "ChromaDB is an open-source vector database for embeddings.")

# Query
answer = rag_query("How do agents interact with external systems?")
print(answer)

Fine-Tuning Embeddings for Your Domain

Generic embedding models work well out of the box, but fine-tuning on your domain data can improve retrieval quality by 5-15%. Sentence-Transformers makes this straightforward:

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Training data: (query, positive_document) pairs
train_examples = [
    InputExample(texts=["What are AI agents?",
                        "AI agents are autonomous systems that perceive, reason, and act."]),
    InputExample(texts=["How does RAG work?",
                        "RAG retrieves relevant documents and includes them in the LLM prompt."]),
    # Add hundreds more domain-specific pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine-tuned-embeddings",
)

FAQ

Should I use BGE, E5, or nomic-embed for my RAG agent?

For English-only applications, BGE-large-en-v1.5 is the safest choice — it ranks highest on most MTEB categories. For multilingual needs, use BGE-M3. If you need a smaller model for edge deployment, nomic-embed-text-v1.5 offers the best quality-per-parameter ratio.

How many dimensions should my embeddings have?

1024 dimensions (BGE-large, E5-large) provide the best retrieval quality. If storage or search speed is a concern, nomic-embed supports Matryoshka dimensionality — you can truncate to 256 or 512 dimensions with only minor quality loss. Binary quantization of 1024-dim vectors is another effective way to reduce storage.

Do I need to re-embed all documents when switching embedding models?

Yes. Embeddings from different models are not compatible — they exist in different vector spaces. When you upgrade your embedding model, you must re-encode your entire document corpus and rebuild the vector index. Plan for this in your deployment strategy.


#Embeddings #SentenceTransformers #BGE #RAG #VectorSearch #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

guides

Understanding AI Voice Technology: A Beginner's Guide

A plain-English guide to AI voice technology — LLMs, STT, TTS, RAG, function calling, and latency budgets. Learn how modern voice agents actually work.

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.

Guides

Privacy-First AI for Procurement: How to Build Secure, Guardrail-Driven Systems

Learn how to design privacy-first AI systems for procurement workflows. Covers data classification, guardrails, RBAC, prompt injection prevention, RAG, and full auditability for enterprise AI.