Skip to content
AI Infrastructure
AI Infrastructure12 min read0 views

pgvector HNSW Index Tuning at Scale: m, ef_construction, ef_search (2026)

A measured guide to tuning pgvector HNSW indexes for AI agent workloads — what m, ef_construction, and ef_search actually do, how to size them at 1M, 10M, and 50M rows, and how to monitor recall in production.

TL;DR — Default HNSW params (m=16, ef_construction=64, ef_search=40) are optimized for 100k-row demos, not 10M-row production. Bumping ef_construction to 200 and ef_search to 100–200 typically lifts recall@10 from 0.85 to 0.97 with manageable latency cost.

What you'll build

A reproducible benchmark loop that measures recall and p95 latency across HNSW parameter sets, plus a production tuning playbook for 1M, 10M, and 50M-row pgvector tables.

Schema

CREATE TABLE rag_chunks (
  id BIGSERIAL PRIMARY KEY,
  doc_id UUID NOT NULL,
  chunk_text TEXT NOT NULL,
  embedding vector(1536) NOT NULL
);

-- Build index AFTER bulk load
CREATE INDEX rag_chunks_hnsw ON rag_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 32, ef_construction = 200);

Architecture

flowchart TD
  LOAD[Bulk load 10M chunks] --> IDX[Build HNSW with m=32, ef=200]
  IDX --> BENCH[Benchmark loop]
  BENCH --> RECALL[Measure recall@10]
  BENCH --> P95[Measure p95 latency]
  RECALL --> TUNE{Recall > 0.95?}
  P95 --> TUNE
  TUNE -->|No| EFUP[Raise ef_search]
  TUNE -->|Yes| SHIP[Ship config]

Step 1 — Understand the three knobs

  • m — neighbors per node. Default 16. Higher m = better recall, larger index, slower build. For 10M+ vectors set m = 24–32.
  • ef_construction — candidate list during build. Default 64. Production: 128–200. Affects build time, not query time.
  • ef_search — candidate list during query. Default 40. Production: 80–200. Linear knob: latency vs recall.

Step 2 — Build with parallel workers

SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 7;

CREATE INDEX CONCURRENTLY rag_chunks_hnsw ON rag_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 32, ef_construction = 200);

pgvector 0.7+ supports parallel HNSW builds — 4-8x faster on 8-core machines.

Step 3 — Generate a recall ground truth

import psycopg, numpy as np
conn = psycopg.connect(...)

def brute_force_topk(q: list[float], k: int = 10):
    with conn.cursor() as cur:
        cur.execute("SET LOCAL enable_indexscan = off")
        cur.execute(
            """
            SELECT id FROM rag_chunks
            ORDER BY embedding <=> %s::vector LIMIT %s
            """,
            (q, k),
        )
        return [r[0] for r in cur.fetchall()]

Run brute-force on 200 sampled queries, store as ground truth.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
def hnsw_topk(q, k=10, ef=100):
    with conn.cursor() as cur:
        cur.execute(f"SET LOCAL hnsw.ef_search = {ef}")
        cur.execute(
            "SELECT id FROM rag_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
            (q, k),
        )
        return [r[0] for r in cur.fetchall()]

for ef in [40, 80, 120, 160, 200, 300]:
    hits, lat = [], []
    for q, gt in samples:
        t0 = time.perf_counter()
        ids = hnsw_topk(q, ef=ef)
        lat.append(time.perf_counter() - t0)
        hits.append(len(set(ids) & set(gt)) / 10)
    print(f"ef={ef} recall={np.mean(hits):.3f} p95={np.percentile(lat,95)*1000:.1f}ms")

Step 5 — Read the curve, pick a point

Typical 10M-row result on a 16-vCPU Postgres:

ef_search recall@10 p95 latency
40 0.86 8 ms
100 0.94 14 ms
200 0.98 26 ms
400 0.99 51 ms

For an agent that hits memory once per turn, 200 is the sweet spot.

Step 6 — Production monitoring

SELECT relname, idx_scan, idx_tup_read, idx_tup_fetch,
       pg_size_pretty(pg_relation_size(indexrelid)) AS idx_size
FROM pg_stat_user_indexes
WHERE indexrelname = 'rag_chunks_hnsw';

Track index size weekly — HNSW grows ~1.5–2x the raw vector size at m=32.

Pitfalls

  • Building before load — wastes hours, produces worse graphs. Always load first.
  • maintenance_work_mem too small — index spills to disk, build slows 10x. Set it to 25-50% of RAM.
  • Filtering on un-indexed columnsWHERE tenant_id = $1 ORDER BY embedding <=> $2 is post-filtered. Use a partial HNSW or pgvectorscale's StreamingDiskANN.
  • Ignoring write amplification — every UPDATE to embedding rebuilds graph edges. Batch updates.

CallSphere production note

CallSphere's RAG layer indexes 8M+ chunks across 115+ DB tables with m=24, ef_construction=128, ef_search=160. Healthcare and Behavioral Health verticals run on a HIPAA-isolated healthcare_voice Prisma schema; OneRoof uses RLS-scoped HNSW indexes per landlord; UrackIT keeps its non-HIPAA RAG on Supabase + ChromaDB. 37 agents · 90+ tools · 6 verticals. Plans: $149 / $499 / $1,499, 14-day trial, 22% affiliate.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

Q: Does SET hnsw.ef_search need to be SESSION-scoped? SET LOCAL inside the transaction is safest — avoids leaking to pooled connections.

Q: When is IVFFlat actually better than HNSW? Memory-constrained boxes (<8 GB) and >100M vectors with low QPS.

Q: Should I rebuild the index after bulk imports? Only if you imported >20% of total rows. HNSW handles incremental inserts well.

Q: Can I use halfvec to halve memory? Yes — pgvector 0.7+ ships halfvec(n). Recall drop is usually <1%, memory savings 50%.

Q: What about pgvectorscale? StreamingDiskANN beats HNSW past ~50M vectors. Worth evaluating if you outgrow pgvector.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

AI Engineering

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Cognee builds and queries a knowledge graph from your unstructured data automatically. A walkthrough from install to your first agent integration in production.

AI Infrastructure

Database Backup and Recovery for AI Agent State: Postgres + pgvector

Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables.

AI Strategy

Enterprise CIO Guide: Retell AI Knowledge Base — RAG Goes Native in Voice

Enterprise CIO Guide perspective on Retell shipped first-class knowledge bases for voice agents, removing one of the last reasons to roll your own RAG layer.