---
title: "Building a Semantic Search Engine from Scratch: Embeddings, Indexing, and Retrieval"
description: "Learn how to build a complete semantic search engine from scratch using sentence embeddings, approximate nearest neighbor indexing, and a query processing pipeline that returns relevant results by meaning rather than keywords."
canonical: https://callsphere.ai/blog/building-semantic-search-engine-embeddings-indexing-retrieval
category: "Learn Agentic AI"
tags: ["Semantic Search", "Embeddings", "FAISS", "Information Retrieval", "Vector Search"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T17:38:35.665Z
---

# Building a Semantic Search Engine from Scratch: Embeddings, Indexing, and Retrieval

> Learn how to build a complete semantic search engine from scratch using sentence embeddings, approximate nearest neighbor indexing, and a query processing pipeline that returns relevant results by meaning rather than keywords.

## Why Semantic Search Matters

Traditional keyword search fails when users express the same idea with different words. Searching for "how to fix a leaking faucet" returns nothing if your documents say "repair a dripping tap." Semantic search solves this by comparing meaning rather than surface-level text, using dense vector embeddings to represent documents and queries in a shared mathematical space.

In this guide we will build a complete semantic search engine from the ground up: an embedding pipeline that converts documents into vectors, an approximate nearest neighbor (ANN) index for fast retrieval, and a query processing layer that ranks results by semantic similarity.

## Architecture Overview

A semantic search system has three main components:

```mermaid
flowchart TD
    DOC(["Document"])
    CHUNK["Chunker
recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata
source, page, tenant"]
    INDEX[("HNSW or IVF index
in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search
cosine similarity"]
    FILTER["Metadata filter
tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff
```

1. **Embedding Pipeline** — converts raw text into fixed-dimension vectors using a pre-trained model.
2. **Vector Index** — stores embeddings in a structure optimized for fast similarity lookups.
3. **Query Processor** — embeds the user query, searches the index, and returns ranked results.

## Step 1: The Embedding Pipeline

We use the `sentence-transformers` library with the `all-MiniLM-L6-v2` model, which produces 384-dimensional vectors and balances speed with quality.

```python
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List, Dict

class EmbeddingPipeline:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()

    def embed_documents(self, documents: List[Dict]) -> np.ndarray:
        """Embed a list of documents, combining title and body."""
        texts = []
        for doc in documents:
            combined = f"{doc['title']}. {doc['body']}"
            texts.append(combined)
        embeddings = self.model.encode(
            texts,
            show_progress_bar=True,
            batch_size=64,
            normalize_embeddings=True,
        )
        return np.array(embeddings, dtype="float32")

    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single search query."""
        embedding = self.model.encode(
            [query],
            normalize_embeddings=True,
        )
        return np.array(embedding, dtype="float32")
```

The `normalize_embeddings=True` flag ensures all vectors have unit length, which means cosine similarity reduces to a simple dot product — a significant performance win.

## Step 2: Building the FAISS Index

FAISS (Facebook AI Similarity Search) provides highly optimized ANN index structures. For datasets under a million documents, an `IndexFlatIP` (exact inner product) works well. For larger corpora, we use `IndexIVFFlat` which partitions the space into clusters.

```python
import faiss
import pickle
import os

class VectorIndex:
    def __init__(self, dimension: int, use_ivf: bool = False, nlist: int = 100):
        self.dimension = dimension
        if use_ivf:
            quantizer = faiss.IndexFlatIP(dimension)
            self.index = faiss.IndexIVFFlat(
                quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT
            )
            self.needs_training = True
        else:
            self.index = faiss.IndexFlatIP(dimension)
            self.needs_training = False

    def build(self, embeddings: np.ndarray):
        """Add embeddings to the index."""
        if self.needs_training:
            self.index.train(embeddings)
        self.index.add(embeddings)

    def search(self, query_embedding: np.ndarray, top_k: int = 10):
        """Return top_k most similar document indices and scores."""
        scores, indices = self.index.search(query_embedding, top_k)
        return scores[0], indices[0]

    def save(self, path: str):
        faiss.write_index(self.index, path)

    def load(self, path: str):
        self.index = faiss.read_index(path)
```

## Step 3: The Query Processor

The query processor ties everything together. It embeds the user query, searches the index, and maps results back to document metadata.

```python
class SemanticSearchEngine:
    def __init__(self, documents: List[Dict]):
        self.documents = documents
        self.pipeline = EmbeddingPipeline()
        self.index = VectorIndex(self.pipeline.dimension)

        # Build the index
        embeddings = self.pipeline.embed_documents(documents)
        self.index.build(embeddings)

    def search(self, query: str, top_k: int = 5, min_score: float = 0.3):
        query_emb = self.pipeline.embed_query(query)
        scores, indices = self.index.search(query_emb, top_k)

        results = []
        for score, idx in zip(scores, indices):
            if idx == -1 or score < min_score:
                continue
            doc = self.documents[idx].copy()
            doc["score"] = float(score)
            results.append(doc)
        return results

# Usage
documents = [
    {"title": "Plumbing Repair Guide", "body": "How to fix a dripping tap..."},
    {"title": "Garden Watering Tips", "body": "Efficient irrigation methods..."},
]
engine = SemanticSearchEngine(documents)
results = engine.search("leaking faucet repair")
for r in results:
    print(f"{r['score']:.3f} — {r['title']}")
```

Searching for "leaking faucet repair" now correctly returns the plumbing guide even though those exact words never appear in the document.

## Performance Considerations

For production deployments, consider these optimizations:

- **Batch embedding** — process documents in batches of 64-128 to maximize GPU utilization.
- **Product quantization** — use `IndexIVFPQ` to compress vectors from 1.5KB to 48 bytes each, enabling billion-scale search.
- **Pre-filtering** — apply metadata filters before the vector search to reduce the candidate set.
- **Caching** — cache frequent query embeddings to avoid re-encoding.

## FAQ

### What embedding model should I use for semantic search?

Start with `all-MiniLM-L6-v2` for general English text. It offers excellent quality-to-speed ratio with 384 dimensions. For higher accuracy at the cost of speed, use `all-mpnet-base-v2` (768 dimensions). For domain-specific needs like legal or medical text, fine-tune a base model on your domain corpus.

### How does semantic search handle exact keyword matches?

Pure semantic search can sometimes miss exact matches that keyword search catches easily. The recommended approach is hybrid search: combine BM25 keyword scores with vector similarity scores using reciprocal rank fusion. This gives you the best of both worlds.

### How many documents can FAISS handle on a single machine?

A flat index comfortably handles up to one million 384-dimensional vectors in about 1.5 GB of RAM. With product quantization (`IndexIVFPQ`), a single machine with 64 GB of RAM can index over 100 million documents while maintaining sub-10ms query latency.

---

#SemanticSearch #Embeddings #FAISS #InformationRetrieval #VectorSearch #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-semantic-search-engine-embeddings-indexing-retrieval
