Skip to content
Learn Agentic AI
Learn Agentic AI14 min read10 views

Benchmarking Vector Databases: Latency, Throughput, and Recall at Scale

Learn how to rigorously benchmark vector databases with proper methodology — measuring latency, throughput, and recall under realistic conditions to make informed infrastructure decisions.

Why Benchmark Your Own Workload

Vendor benchmarks are marketing. They show optimal configurations on favorable datasets under ideal conditions. Your application has specific embedding dimensions, query patterns, filter complexity, and concurrency levels that no generic benchmark captures.

The only benchmark that matters is one that simulates your actual workload. This guide covers the methodology, metrics, and tooling to run rigorous vector database benchmarks that inform real infrastructure decisions.

The Three Metrics That Matter

1. Recall at K — What fraction of the true nearest neighbors does the system return? Recall of 0.95 at K=10 means 9.5 out of 10 true neighbors are found.

flowchart TD
    START["Benchmarking Vector Databases: Latency, Throughpu…"] --> A
    A["Why Benchmark Your Own Workload"]
    A --> B
    B["The Three Metrics That Matter"]
    B --> C
    C["Building a Benchmark Suite"]
    C --> D
    D["Computing Ground Truth"]
    D --> E
    E["Measuring Recall"]
    E --> F
    F["Benchmarking pgvector"]
    F --> G
    G["Benchmarking Pinecone"]
    G --> H
    H["Concurrent Load Testing"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

2. Query Latency — How long does a single query take? Measure P50, P95, and P99 — averages hide tail latency that affects user experience.

3. Queries Per Second (QPS) — How many concurrent queries can the system handle before latency degrades? This determines how many users your system can serve.

These three metrics are in tension. Higher recall requires searching more candidates, which increases latency and reduces throughput. Every index configuration is a point on this three-way tradeoff surface.

Building a Benchmark Suite

Start with a reproducible benchmark framework:

import time
import numpy as np
from dataclasses import dataclass, field

@dataclass
class BenchmarkResult:
    recall_at_k: float
    latencies_ms: list[float] = field(default_factory=list)

    @property
    def p50_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 50))

    @property
    def p95_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 95))

    @property
    def p99_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 99))

    @property
    def qps(self) -> float:
        total_seconds = sum(self.latencies_ms) / 1000.0
        return len(self.latencies_ms) / total_seconds if total_seconds > 0 else 0

Computing Ground Truth

To measure recall, you need exact nearest neighbors as ground truth. Generate these with brute-force search:

import faiss

def compute_ground_truth(
    vectors: np.ndarray,
    queries: np.ndarray,
    k: int = 10
) -> np.ndarray:
    """Compute exact nearest neighbors using brute-force search."""
    dimension = vectors.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)
    distances, indices = index.search(queries, k)
    return indices  # shape: (num_queries, k)

Measuring Recall

Compare ANN results against ground truth:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def compute_recall(
    ann_results: list[list[int]],
    ground_truth: np.ndarray,
    k: int = 10
) -> float:
    """Compute recall@k: fraction of true neighbors found."""
    total_recall = 0.0
    for i, ann_ids in enumerate(ann_results):
        true_ids = set(ground_truth[i][:k])
        found = len(set(ann_ids[:k]) & true_ids)
        total_recall += found / k
    return total_recall / len(ann_results)

Benchmarking pgvector

import psycopg
from pgvector.psycopg import register_vector

def benchmark_pgvector(
    conn,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10,
    ef_search: int = 40
) -> BenchmarkResult:
    register_vector(conn)
    conn.execute(f"SET hnsw.ef_search = {ef_search}")

    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        rows = conn.execute(
            "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
            (query_vec.tolist(), k)
        ).fetchall()
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        all_results.append([row[0] for row in rows])

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Benchmarking Pinecone

from pinecone import Pinecone

def benchmark_pinecone(
    index,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10
) -> BenchmarkResult:
    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        response = index.query(
            vector=query_vec.tolist(),
            top_k=k
        )
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        result_ids = [int(m["id"]) for m in response["matches"]]
        all_results.append(result_ids)

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Concurrent Load Testing

Single-query latency tells only part of the story. Test under concurrent load to find throughput limits:

import concurrent.futures

def concurrent_benchmark(
    search_fn,
    queries: np.ndarray,
    concurrency: int = 10
) -> dict:
    latencies = []

    def run_query(query_vec):
        start = time.perf_counter()
        search_fn(query_vec)
        return (time.perf_counter() - start) * 1000

    start_all = time.perf_counter()

    with concurrent.futures.ThreadPoolExecutor(
        max_workers=concurrency
    ) as executor:
        futures = [
            executor.submit(run_query, q)
            for q in queries
        ]
        for future in concurrent.futures.as_completed(futures):
            latencies.append(future.result())

    total_time = time.perf_counter() - start_all
    return {
        "concurrency": concurrency,
        "total_queries": len(queries),
        "total_time_s": total_time,
        "qps": len(queries) / total_time,
        "p50_ms": float(np.percentile(latencies, 50)),
        "p95_ms": float(np.percentile(latencies, 95)),
        "p99_ms": float(np.percentile(latencies, 99)),
    }

Running a Sweep

Test multiple configurations to find the optimal recall-latency tradeoff:

def parameter_sweep_pgvector(conn, queries, ground_truth):
    results = []
    for ef_search in [10, 20, 40, 80, 160, 320]:
        result = benchmark_pgvector(
            conn, queries, ground_truth,
            k=10, ef_search=ef_search
        )
        results.append({
            "ef_search": ef_search,
            "recall": result.recall_at_k,
            "p50_ms": result.p50_ms,
            "p95_ms": result.p95_ms,
            "qps": result.qps,
        })
        print(
            f"ef_search={ef_search}: "
            f"recall={result.recall_at_k:.3f}, "
            f"p50={result.p50_ms:.1f}ms, "
            f"p95={result.p95_ms:.1f}ms"
        )
    return results

Benchmarking Best Practices

Use realistic data. Random vectors behave differently from real embeddings. Use a subset of your actual production embeddings or a standard benchmark dataset like ANN-Benchmarks (sift-128, gist-960, or deep-96).

Warm up before measuring. Run 100-200 throwaway queries to fill caches and warm JIT-compiled code paths. Only measure after warmup.

Test with filters. If your application uses metadata filtering, include filters in your benchmark. Filtered search performance can differ dramatically from unfiltered.

Measure at your target scale. Performance at 100K vectors does not predict performance at 10M vectors. Load your benchmark with the volume you expect in production.

Run multiple trials. Network variability (especially for cloud databases) can skew individual measurements. Run each configuration 3-5 times and report the median.

Real-World Performance Expectations

Based on publicly available benchmarks and community reports for 1M vectors at 1536 dimensions with HNSW:

Database P50 Latency Recall@10 QPS (single client)
pgvector (PostgreSQL 16) 3-8ms 0.95-0.99 200-500
Pinecone (serverless) 10-30ms 0.95+ 100-300
Weaviate (self-hosted) 2-5ms 0.95-0.99 300-800
Chroma (self-hosted) 5-15ms 0.95+ 100-400

These numbers vary significantly based on hardware, index configuration, and query complexity. Always benchmark your own workload.

FAQ

How many queries should I run to get statistically meaningful benchmark results?

At minimum, run 1,000 queries per configuration. For latency percentiles (P95, P99), you need at least 10,000 queries to get stable measurements. Use different query vectors for each run — repeating the same queries can bias results due to caching effects.

Should I benchmark with or without metadata filters?

Both. Run a baseline without filters to understand raw vector search performance, then add filters that match your production query patterns. The performance gap between filtered and unfiltered search reveals how much overhead your filter strategy adds, which helps you design better metadata schemas.

How do I compare self-hosted vs managed vector databases fairly?

Match the compute resources. If your self-hosted pgvector runs on a 4-core, 16GB machine, compare it against a similarly sized managed instance, not the vendor's top-tier offering. Also account for operational costs — the managed service includes monitoring, backups, and scaling that you would need to build yourself.


#Benchmarking #VectorDatabase #Performance #Latency #Recall #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Learn Agentic AI

Sub-500ms Latency Voice Agents: Architecture Patterns for Production Deployment

Technical deep dive into achieving under 500ms voice agent latency with streaming architectures, edge deployment, connection pooling, pre-warming, and async tool execution.

Learn Agentic AI

Vector Database Selection for AI Agents 2026: Pinecone vs Weaviate vs ChromaDB vs Qdrant

Technical comparison of vector databases for AI agent RAG systems: Pinecone, Weaviate, ChromaDB, and Qdrant benchmarked on performance, pricing, features, and scaling.

Learn Agentic AI

Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.

Learn Agentic AI

Agent Memory Systems: Short-Term, Long-Term, and Episodic Memory for AI Agents

Technical deep dive into agent memory architectures covering conversation context, vector DB persistence, and experience replay with implementation code for production systems.