---
title: "Agentic AI with Vector Databases: Building Semantic Search and RAG Agents"
description: "Build RAG-powered agentic AI with vector databases. Compare Pinecone, Weaviate, Chroma, and pgvector for semantic search agent systems."
canonical: https://callsphere.ai/blog/agentic-ai-vector-databases-semantic-search-integration
category: "Technology"
tags: ["Vector Databases", "Semantic Search", "RAG", "Embeddings", "Pinecone"]
author: "CallSphere Team"
published: 2026-03-16T00:00:00.000Z
updated: 2026-05-30T10:23:48.020Z
---

# Agentic AI with Vector Databases: Building Semantic Search and RAG Agents

> Build RAG-powered agentic AI with vector databases. Compare Pinecone, Weaviate, Chroma, and pgvector for semantic search agent systems.

## Why Agents Need Vector Databases

Language models have impressive general knowledge but lack specific, up-to-date information about your business, products, and customers. Retrieval-Augmented Generation (RAG) bridges this gap by retrieving relevant documents from a knowledge base and injecting them into the agent's context before it generates a response.

Vector databases are purpose-built for the similarity search that RAG requires. They store document embeddings — dense numerical representations of text meaning — and efficiently retrieve the most semantically similar documents for any query. When a user asks your agent a question, the system embeds the question, searches the vector database for similar content, and provides the top results as context for the agent to reason with.

This guide covers the full pipeline: embedding generation, index design, hybrid search strategies, RAG agent patterns, vector database comparison, and chunking strategies.

## Embedding Generation

Embeddings convert text into fixed-length numerical vectors where semantic similarity in text space maps to geometric proximity in vector space. Documents about similar topics produce embeddings that are close together.

```mermaid
flowchart LR
    Q(["User query"])
    EMB["Embed query
text-embedding-3"]
    VEC[("Vector DB
pgvector or Pinecone")]
    RET["Top-k retrieval
k = 8"]
    PROMPT["Augmented prompt
system plus context"]
    LLM["LLM generation
Claude or GPT"]
    CITE["Inline citations
and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

### Choosing an Embedding Model

| Model | Dimensions | Context Window | Speed | Quality |
| --- | --- | --- | --- | --- |
| OpenAI text-embedding-3-large | 3072 | 8,191 tokens | Fast (API) | Excellent |
| OpenAI text-embedding-3-small | 1536 | 8,191 tokens | Fast (API) | Good |
| Cohere embed-v3 | 1024 | 512 tokens | Fast (API) | Very Good |
| BGE-large-en-v1.5 | 1024 | 512 tokens | Self-hosted | Very Good |
| all-MiniLM-L6-v2 | 384 | 256 tokens | Self-hosted, fast | Acceptable |

For production agent systems, OpenAI's text-embedding-3-large provides the best quality-to-convenience ratio. For self-hosted deployments where data cannot leave your infrastructure, BGE-large or similar open-source models are the standard choice.

### Generating Embeddings

```python
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def generate_embedding(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-large",
        input=text,
    )
    return response.data[0].embedding

async def generate_embeddings_batch(
    texts: list[str],
    batch_size: int = 100
) -> list[list[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = await client.embeddings.create(
            model="text-embedding-3-large",
            input=batch,
        )
        all_embeddings.extend([d.embedding for d in response.data])
    return all_embeddings
```

## Index Design for Agent Workloads

Vector database index design depends on your query patterns, data volume, and latency requirements.

### Metadata Filtering

Agent queries are rarely pure similarity search. They typically combine semantic similarity with metadata filters — "find documents similar to this query that were published in the last 30 days and belong to the 'billing' category."

```python
# Pinecone example with metadata filtering
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "category": {"$eq": "billing"},
        "published_date": {"$gte": "2026-01-01"},
        "document_type": {"$in": ["faq", "guide", "policy"]},
    },
    include_metadata=True,
)
```

Design your metadata schema upfront. Common metadata fields for agent knowledge bases include document category or type, publication and last-updated dates, source system, access level or tenant ID, and language.

### Namespace Separation

For multi-agent systems, use namespaces or separate collections to isolate different knowledge domains. A customer support agent's knowledge base should not be mixed with an internal HR agent's knowledge base.

## Hybrid Search: Keyword + Semantic

Pure semantic search sometimes misses exact matches. If a user asks about "order #12345" the semantic embedding captures the concept of "asking about an order" but may not prioritize the exact order number match. Hybrid search combines semantic similarity with keyword (BM25) matching for better results.

```python
class HybridSearcher:
    def __init__(self, vector_store, keyword_index):
        self.vector_store = vector_store
        self.keyword_index = keyword_index

    async def search(
        self,
        query: str,
        query_embedding: list[float],
        top_k: int = 10,
        semantic_weight: float = 0.7,
        keyword_weight: float = 0.3,
    ) -> list[dict]:
        # Parallel retrieval
        semantic_results, keyword_results = await asyncio.gather(
            self.vector_store.query(query_embedding, top_k=top_k * 2),
            self.keyword_index.search(query, top_k=top_k * 2),
        )

        # Score fusion using Reciprocal Rank Fusion (RRF)
        scores: dict[str, float] = {}
        for rank, result in enumerate(semantic_results):
            doc_id = result["id"]
            scores[doc_id] = scores.get(doc_id, 0) + (
                semantic_weight / (rank + 60)
            )
        for rank, result in enumerate(keyword_results):
            doc_id = result["id"]
            scores[doc_id] = scores.get(doc_id, 0) + (
                keyword_weight / (rank + 60)
            )

        # Sort by fused score and return top_k
        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        return [{"id": doc_id, "score": score} for doc_id, score in ranked[:top_k]]
```

## RAG Agent Patterns

There are several architectural patterns for integrating RAG into agent systems.

### Tool-Based RAG

The agent has a "search knowledge base" tool that it calls when it needs information. This is the most flexible pattern because the agent decides when to search and what to search for.

```python
async def search_knowledge_base(
    query: str,
    category: str | None = None,
    max_results: int = 5,
) -> list[dict]:
    """
    Search the company knowledge base for relevant information.

    Args:
        query: Natural language search query
        category: Optional filter by category (billing, technical, policy)
        max_results: Number of results to return (1-10)
    """
    embedding = await generate_embedding(query)
    filters = {}
    if category:
        filters["category"] = {"$eq": category}

    results = await vector_store.query(
        vector=embedding,
        top_k=max_results,
        filter=filters if filters else None,
        include_metadata=True,
    )

    return [
        {
            "title": r.metadata["title"],
            "content": r.metadata["content"],
            "source": r.metadata["source_url"],
            "relevance_score": r.score,
        }
        for r in results.matches
    ]
```

### Automatic RAG (Pre-Retrieval)

Every user message triggers a retrieval step before the agent processes it. The retrieved documents are injected into the system prompt or user context automatically.

```python
async def process_with_rag(
    user_message: str,
    conversation_history: list[dict],
    agent_system_prompt: str,
) -> str:
    # Always retrieve context
    embedding = await generate_embedding(user_message)
    context_docs = await vector_store.query(vector=embedding, top_k=5)

    # Build augmented prompt
    context_block = "\n\n".join(
        f"[Source: {doc.metadata['title']}]\n{doc.metadata['content']}"
        for doc in context_docs.matches
        if doc.score > 0.7  # Relevance threshold
    )

    augmented_system = f"""{agent_system_prompt}

## Relevant Knowledge Base Context
{context_block}

Use the above context to answer the user's question. If the context
does not contain relevant information, say so rather than guessing.
"""

    response = await llm.chat(
        system=augmented_system,
        messages=conversation_history + [{"role": "user", "content": user_message}],
    )
    return response
```

### Agentic RAG with Re-Ranking

The agent performs an initial retrieval, evaluates the results, and optionally refines the query and searches again. This iterative approach handles complex questions that require multiple retrieval passes.

CallSphere's IT helpdesk RAG system uses this agentic approach. When a support ticket comes in, the agent first searches for similar resolved tickets, evaluates whether the resolutions are applicable, and if not, searches the technical documentation with a refined query derived from its analysis of why the initial results were insufficient.

## Vector Database Comparison

| Feature | Pinecone | Weaviate | Chroma | pgvector |
| --- | --- | --- | --- | --- |
| Hosting | Managed cloud | Self-hosted or cloud | Self-hosted or embedded | PostgreSQL extension |
| Scale | Billions of vectors | Hundreds of millions | Millions | Millions |
| Hybrid search | Sparse + dense | BM25 + vector built-in | Basic metadata | Full PostgreSQL text search |
| Metadata filtering | Rich filters | GraphQL filters | Where clauses | SQL WHERE |
| Latency (p99) |  list[dict]:
        chunks = []

        # Level 1: Document summary
        summary = self.summarize(document)
        chunks.append({
            "content": summary,
            "level": "document",
            "metadata": {**metadata, "chunk_level": "summary"},
        })

        # Level 2: Section-level chunks
        sections = self.split_by_headings(document)
        for i, section in enumerate(sections):
            chunks.append({
                "content": section["content"],
                "level": "section",
                "metadata": {
                    **metadata,
                    "chunk_level": "section",
                    "section_title": section["heading"],
                    "section_index": i,
                },
            })

            # Level 3: Paragraph-level chunks within sections
            paragraphs = self.split_by_paragraphs(section["content"])
            for j, para in enumerate(paragraphs):
                if len(para.split()) > 30:  # Skip very short paragraphs
                    chunks.append({
                        "content": para,
                        "level": "paragraph",
                        "metadata": {
                            **metadata,
                            "chunk_level": "paragraph",
                            "section_title": section["heading"],
                            "paragraph_index": j,
                        },
                    })

        return chunks
```

## Frequently Asked Questions

### What is the difference between RAG and fine-tuning for agent knowledge?

RAG retrieves relevant information at query time and injects it into the agent's context. Fine-tuning modifies the model's weights to encode knowledge directly. RAG is better for frequently changing information (product catalogs, policies, knowledge bases) because you update the vector database without retraining. Fine-tuning is better for teaching the model new behaviors, styles, or domain-specific reasoning patterns that do not change frequently.

### How many documents can a vector database handle for a RAG agent?

Modern vector databases scale to billions of vectors. Pinecone and Weaviate handle hundreds of millions of vectors in production. For most agent applications, the practical limit is not the vector database but the quality of your chunking and embedding — poorly chunked documents produce poor retrieval regardless of database scale.

### How do you evaluate RAG quality for agents?

Measure retrieval quality (are the right documents being retrieved?) and generation quality (does the agent use the retrieved context correctly?). Key metrics include recall at k (what fraction of relevant documents appear in the top k results), precision at k (what fraction of retrieved documents are relevant), faithfulness (does the agent's response align with the retrieved context?), and answer relevancy (does the response actually answer the question?).

### Should I use pgvector or a dedicated vector database?

Use pgvector if you are already running PostgreSQL and your knowledge base is under 2-3 million documents. The operational simplicity of not adding another database to your stack is significant. Switch to a dedicated vector database when you need higher scale, faster query latency at high concurrency, or advanced features like built-in hybrid search.

### How does CallSphere use RAG in its agent products?

CallSphere's IT helpdesk product uses an agentic RAG architecture where the support agent searches a vector database of resolved tickets and technical documentation. The agent performs iterative retrieval — if initial results are insufficient, it reformulates the query and searches again. This approach achieves higher resolution rates than single-pass RAG because complex issues often require synthesizing information from multiple knowledge base articles.

---

Source: https://callsphere.ai/blog/agentic-ai-vector-databases-semantic-search-integration
