Open-Source Embedding Models: Sentence-Transformers and BGE for RAG Agents

Why Embedding Models Matter for Agents

Retrieval-Augmented Generation (RAG) is the most common pattern for building agents that work with private data. The embedding model is the backbone of RAG — it converts text into vectors that enable semantic search. A poor embedding model means your agent retrieves irrelevant documents, and no amount of LLM quality can compensate for bad retrieval.

Open-source embedding models have caught up to and often surpassed proprietary offerings. The MTEB (Massive Text Embedding Benchmark) leaderboard shows open models like BGE, E5, and GTE consistently competing with OpenAI's Ada and Cohere's embedding APIs, while running locally at zero cost.

Top Open-Source Embedding Models

BAAI/bge-large-en-v1.5 — 335M parameters, 1024-dimensional embeddings. Currently one of the best-performing open models on MTEB. Excellent for English-language RAG.

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

intfloat/e5-large-v2 — 335M parameters, 1024 dimensions. Strong alternative to BGE with slightly different strengths across benchmark categories. Requires a "query: " or "passage: " prefix.

BAAI/bge-m3 — A multilingual model supporting 100+ languages with dense, sparse, and multi-vector retrieval in a single model. Ideal for multilingual agent deployments.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

nomic-ai/nomic-embed-text-v1.5 — 137M parameters, 768 dimensions. Excellent quality-to-size ratio with a Matryoshka representation that allows flexible dimensionality.

Getting Started with Sentence-Transformers

The sentence-transformers library is the standard way to load and use embedding models:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the model (downloads ~1.3 GB on first run)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Encode documents
documents = [
    "AI agents can autonomously plan and execute tasks.",
    "Retrieval-augmented generation improves factual accuracy.",
    "Vector databases store high-dimensional embeddings efficiently.",
    "The weather in Paris is mild in spring.",
]

doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Embedding shape: {doc_embeddings.shape}")  # (4, 1024)

# Encode a query
query = "How do AI agents use retrieval?"
query_embedding = model.encode([query], normalize_embeddings=True)

# Compute cosine similarity (dot product since normalized)
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()

# Rank results
ranked = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.4f}: {doc}")

Optimizing Embedding Performance

For production agents processing thousands of documents, performance matters. Here are the key optimizations:

Batch encoding — Always encode in batches rather than one document at a time:

# Slow: encoding one by one
for doc in documents:
    embedding = model.encode([doc])

# Fast: batch encoding with GPU
embeddings = model.encode(
    documents,
    batch_size=64,           # Process 64 documents at once
    show_progress_bar=True,
    normalize_embeddings=True,
    device="cuda",           # Use GPU
)

Quantized embeddings — Reduce storage and search costs by quantizing float32 vectors to int8 or binary:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from sentence_transformers.quantization import quantize_embeddings

# Full precision: 1024 dimensions x 4 bytes = 4 KB per document
float_embeddings = model.encode(documents, normalize_embeddings=True)

# Int8 quantization: 1024 bytes per document (75% smaller)
int8_embeddings = quantize_embeddings(
    float_embeddings, precision="int8"
)

# Binary quantization: 128 bytes per document (97% smaller)
binary_embeddings = quantize_embeddings(
    float_embeddings, precision="binary"
)

Building a RAG Pipeline with Local Embeddings

Here is a complete RAG pipeline using local models for both embedding and generation:

from sentence_transformers import SentenceTransformer
from openai import OpenAI
import chromadb

# Local embedding model
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Local LLM via Ollama
llm = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Local vector database
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
    "knowledge_base",
    metadata={"hnsw:space": "cosine"},
)

def ingest(doc_id: str, text: str):
    embedding = embedder.encode([text], normalize_embeddings=True)[0]
    collection.upsert(
        ids=[doc_id],
        embeddings=[embedding.tolist()],
        documents=[text],
    )

def retrieve(query: str, top_k: int = 5) -> list[str]:
    query_emb = embedder.encode([query], normalize_embeddings=True)[0]
    results = collection.query(
        query_embeddings=[query_emb.tolist()],
        n_results=top_k,
    )
    return results["documents"][0]

def rag_query(user_question: str) -> str:
    # Retrieve relevant context
    docs = retrieve(user_question)
    context = "\n\n---\n\n".join(docs)

    # Generate answer with local LLM
    response = llm.chat.completions.create(
        model="llama3.1:8b",
        messages=[
            {"role": "system", "content":
             f"Answer based on this context. If the context does not contain "
             f"the answer, say so.\n\nContext:\n{context}"},
            {"role": "user", "content": user_question},
        ],
        temperature=0.2,
    )

    return response.choices[0].message.content

# Index some documents
ingest("doc1", "AI agents use tool calling to interact with external systems.")
ingest("doc2", "RAG improves LLM accuracy by providing relevant context.")
ingest("doc3", "ChromaDB is an open-source vector database for embeddings.")

# Query
answer = rag_query("How do agents interact with external systems?")
print(answer)

Fine-Tuning Embeddings for Your Domain

Generic embedding models work well out of the box, but fine-tuning on your domain data can improve retrieval quality by 5-15%. Sentence-Transformers makes this straightforward:

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Training data: (query, positive_document) pairs
train_examples = [
    InputExample(texts=["What are AI agents?",
                        "AI agents are autonomous systems that perceive, reason, and act."]),
    InputExample(texts=["How does RAG work?",
                        "RAG retrieves relevant documents and includes them in the LLM prompt."]),
    # Add hundreds more domain-specific pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine-tuned-embeddings",
)

FAQ

Should I use BGE, E5, or nomic-embed for my RAG agent?

For English-only applications, BGE-large-en-v1.5 is the safest choice — it ranks highest on most MTEB categories. For multilingual needs, use BGE-M3. If you need a smaller model for edge deployment, nomic-embed-text-v1.5 offers the best quality-per-parameter ratio.

How many dimensions should my embeddings have?

1024 dimensions (BGE-large, E5-large) provide the best retrieval quality. If storage or search speed is a concern, nomic-embed supports Matryoshka dimensionality — you can truncate to 256 or 512 dimensions with only minor quality loss. Binary quantization of 1024-dim vectors is another effective way to reduce storage.

Do I need to re-embed all documents when switching embedding models?

Yes. Embeddings from different models are not compatible — they exist in different vector spaces. When you upgrade your embedding model, you must re-encode your entire document corpus and rebuild the vector index. Plan for this in your deployment strategy.

#Embeddings #SentenceTransformers #BGE #RAG #VectorSearch #AgenticAI #LearnAI #AIEngineering

Open-Source Embedding Models: Sentence-Transformers and BGE for RAG Agents

Why Embedding Models Matter for Agents

Top Open-Source Embedding Models

Getting Started with Sentence-Transformers

Optimizing Embedding Performance

Building a RAG Pipeline with Local Embeddings

Fine-Tuning Embeddings for Your Domain

FAQ

Should I use BGE, E5, or nomic-embed for my RAG agent?

How many dimensions should my embeddings have?

Do I need to re-embed all documents when switching embedding models?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chatbot for Answering Questions: How to Build One That Works

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

How To Create A Chatbot In 2026: A Founder's Practical Guide

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026