---
title: "Benchmarking Vector Databases: Latency, Throughput, and Recall at Scale"
description: "Learn how to rigorously benchmark vector databases with proper methodology — measuring latency, throughput, and recall under realistic conditions to make informed infrastructure decisions."
canonical: https://callsphere.ai/blog/benchmarking-vector-databases-latency-throughput-recall-scale
category: "Learn Agentic AI"
tags: ["Benchmarking", "Vector Database", "Performance", "Latency", "Recall"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T07:05:50.780Z
---

# Benchmarking Vector Databases: Latency, Throughput, and Recall at Scale

> Learn how to rigorously benchmark vector databases with proper methodology — measuring latency, throughput, and recall under realistic conditions to make informed infrastructure decisions.

## Why Benchmark Your Own Workload

Vendor benchmarks are marketing. They show optimal configurations on favorable datasets under ideal conditions. Your application has specific embedding dimensions, query patterns, filter complexity, and concurrency levels that no generic benchmark captures.

The only benchmark that matters is one that simulates your actual workload. This guide covers the methodology, metrics, and tooling to run rigorous vector database benchmarks that inform real infrastructure decisions.

## The Three Metrics That Matter

**1. Recall at K** — What fraction of the true nearest neighbors does the system return? Recall of 0.95 at K=10 means 9.5 out of 10 true neighbors are found.

```mermaid
flowchart TD
    DOC(["Document"])
    CHUNK["Chunker
recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata
source, page, tenant"]
    INDEX[("HNSW or IVF index
in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search
cosine similarity"]
    FILTER["Metadata filter
tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff
```

**2. Query Latency** — How long does a single query take? Measure P50, P95, and P99 — averages hide tail latency that affects user experience.

**3. Queries Per Second (QPS)** — How many concurrent queries can the system handle before latency degrades? This determines how many users your system can serve.

These three metrics are in tension. Higher recall requires searching more candidates, which increases latency and reduces throughput. Every index configuration is a point on this three-way tradeoff surface.

## Building a Benchmark Suite

Start with a reproducible benchmark framework:

```python
import time
import numpy as np
from dataclasses import dataclass, field

@dataclass
class BenchmarkResult:
    recall_at_k: float
    latencies_ms: list[float] = field(default_factory=list)

    @property
    def p50_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 50))

    @property
    def p95_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 95))

    @property
    def p99_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 99))

    @property
    def qps(self) -> float:
        total_seconds = sum(self.latencies_ms) / 1000.0
        return len(self.latencies_ms) / total_seconds if total_seconds > 0 else 0
```

## Computing Ground Truth

To measure recall, you need exact nearest neighbors as ground truth. Generate these with brute-force search:

```python
import faiss

def compute_ground_truth(
    vectors: np.ndarray,
    queries: np.ndarray,
    k: int = 10
) -> np.ndarray:
    """Compute exact nearest neighbors using brute-force search."""
    dimension = vectors.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)
    distances, indices = index.search(queries, k)
    return indices  # shape: (num_queries, k)
```

## Measuring Recall

Compare ANN results against ground truth:

```python
def compute_recall(
    ann_results: list[list[int]],
    ground_truth: np.ndarray,
    k: int = 10
) -> float:
    """Compute recall@k: fraction of true neighbors found."""
    total_recall = 0.0
    for i, ann_ids in enumerate(ann_results):
        true_ids = set(ground_truth[i][:k])
        found = len(set(ann_ids[:k]) & true_ids)
        total_recall += found / k
    return total_recall / len(ann_results)
```

## Benchmarking pgvector

```python
import psycopg
from pgvector.psycopg import register_vector

def benchmark_pgvector(
    conn,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10,
    ef_search: int = 40
) -> BenchmarkResult:
    register_vector(conn)
    conn.execute(f"SET hnsw.ef_search = {ef_search}")

    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        rows = conn.execute(
            "SELECT id FROM documents ORDER BY embedding  %s LIMIT %s",
            (query_vec.tolist(), k)
        ).fetchall()
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        all_results.append([row[0] for row in rows])

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)
```

## Benchmarking Pinecone

```python
from pinecone import Pinecone

def benchmark_pinecone(
    index,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10
) -> BenchmarkResult:
    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        response = index.query(
            vector=query_vec.tolist(),
            top_k=k
        )
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        result_ids = [int(m["id"]) for m in response["matches"]]
        all_results.append(result_ids)

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)
```

## Concurrent Load Testing

Single-query latency tells only part of the story. Test under concurrent load to find throughput limits:

```python
import concurrent.futures

def concurrent_benchmark(
    search_fn,
    queries: np.ndarray,
    concurrency: int = 10
) -> dict:
    latencies = []

    def run_query(query_vec):
        start = time.perf_counter()
        search_fn(query_vec)
        return (time.perf_counter() - start) * 1000

    start_all = time.perf_counter()

    with concurrent.futures.ThreadPoolExecutor(
        max_workers=concurrency
    ) as executor:
        futures = [
            executor.submit(run_query, q)
            for q in queries
        ]
        for future in concurrent.futures.as_completed(futures):
            latencies.append(future.result())

    total_time = time.perf_counter() - start_all
    return {
        "concurrency": concurrency,
        "total_queries": len(queries),
        "total_time_s": total_time,
        "qps": len(queries) / total_time,
        "p50_ms": float(np.percentile(latencies, 50)),
        "p95_ms": float(np.percentile(latencies, 95)),
        "p99_ms": float(np.percentile(latencies, 99)),
    }
```

## Running a Sweep

Test multiple configurations to find the optimal recall-latency tradeoff:

```python
def parameter_sweep_pgvector(conn, queries, ground_truth):
    results = []
    for ef_search in [10, 20, 40, 80, 160, 320]:
        result = benchmark_pgvector(
            conn, queries, ground_truth,
            k=10, ef_search=ef_search
        )
        results.append({
            "ef_search": ef_search,
            "recall": result.recall_at_k,
            "p50_ms": result.p50_ms,
            "p95_ms": result.p95_ms,
            "qps": result.qps,
        })
        print(
            f"ef_search={ef_search}: "
            f"recall={result.recall_at_k:.3f}, "
            f"p50={result.p50_ms:.1f}ms, "
            f"p95={result.p95_ms:.1f}ms"
        )
    return results
```

## Benchmarking Best Practices

**Use realistic data.** Random vectors behave differently from real embeddings. Use a subset of your actual production embeddings or a standard benchmark dataset like ANN-Benchmarks (sift-128, gist-960, or deep-96).

**Warm up before measuring.** Run 100-200 throwaway queries to fill caches and warm JIT-compiled code paths. Only measure after warmup.

**Test with filters.** If your application uses metadata filtering, include filters in your benchmark. Filtered search performance can differ dramatically from unfiltered.

**Measure at your target scale.** Performance at 100K vectors does not predict performance at 10M vectors. Load your benchmark with the volume you expect in production.

**Run multiple trials.** Network variability (especially for cloud databases) can skew individual measurements. Run each configuration 3-5 times and report the median.

## Real-World Performance Expectations

Based on publicly available benchmarks and community reports for 1M vectors at 1536 dimensions with HNSW:

| Database | P50 Latency | Recall@10 | QPS (single client) |
| --- | --- | --- | --- |
| pgvector (PostgreSQL 16) | 3-8ms | 0.95-0.99 | 200-500 |
| Pinecone (serverless) | 10-30ms | 0.95+ | 100-300 |
| Weaviate (self-hosted) | 2-5ms | 0.95-0.99 | 300-800 |
| Chroma (self-hosted) | 5-15ms | 0.95+ | 100-400 |

These numbers vary significantly based on hardware, index configuration, and query complexity. Always benchmark your own workload.

## FAQ

### How many queries should I run to get statistically meaningful benchmark results?

At minimum, run 1,000 queries per configuration. For latency percentiles (P95, P99), you need at least 10,000 queries to get stable measurements. Use different query vectors for each run — repeating the same queries can bias results due to caching effects.

### Should I benchmark with or without metadata filters?

Both. Run a baseline without filters to understand raw vector search performance, then add filters that match your production query patterns. The performance gap between filtered and unfiltered search reveals how much overhead your filter strategy adds, which helps you design better metadata schemas.

### How do I compare self-hosted vs managed vector databases fairly?

Match the compute resources. If your self-hosted pgvector runs on a 4-core, 16GB machine, compare it against a similarly sized managed instance, not the vendor's top-tier offering. Also account for operational costs — the managed service includes monitoring, backups, and scaling that you would need to build yourself.

---

#Benchmarking #VectorDatabase #Performance #Latency #Recall #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/benchmarking-vector-databases-latency-throughput-recall-scale
