TL;DR — Default HNSW params (m=16, ef_construction=64, ef_search=40) are optimized for 100k-row demos, not 10M-row production. Bumping ef_construction to 200 and ef_search to 100–200 typically lifts recall@10 from 0.85 to 0.97 with manageable latency cost.

What you'll build

A reproducible benchmark loop that measures recall and p95 latency across HNSW parameter sets, plus a production tuning playbook for 1M, 10M, and 50M-row pgvector tables.

Schema

CREATE TABLE rag_chunks (
  id BIGSERIAL PRIMARY KEY,
  doc_id UUID NOT NULL,
  chunk_text TEXT NOT NULL,
  embedding vector(1536) NOT NULL
);

-- Build index AFTER bulk load
CREATE INDEX rag_chunks_hnsw ON rag_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 32, ef_construction = 200);

Architecture

flowchart TD
  LOAD[Bulk load 10M chunks] --> IDX[Build HNSW with m=32, ef=200]
  IDX --> BENCH[Benchmark loop]
  BENCH --> RECALL[Measure recall@10]
  BENCH --> P95[Measure p95 latency]
  RECALL --> TUNE{Recall &gt; 0.95?}
  P95 --> TUNE
  TUNE -->|No| EFUP[Raise ef_search]
  TUNE -->|Yes| SHIP[Ship config]

Step 1 — Understand the three knobs

m — neighbors per node. Default 16. Higher m = better recall, larger index, slower build. For 10M+ vectors set m = 24–32.
ef_construction — candidate list during build. Default 64. Production: 128–200. Affects build time, not query time.
ef_search — candidate list during query. Default 40. Production: 80–200. Linear knob: latency vs recall.

Step 2 — Build with parallel workers

SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 7;

CREATE INDEX CONCURRENTLY rag_chunks_hnsw ON rag_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 32, ef_construction = 200);

pgvector 0.7+ supports parallel HNSW builds — 4-8x faster on 8-core machines.

Step 3 — Generate a recall ground truth

import psycopg, numpy as np
conn = psycopg.connect(...)

def brute_force_topk(q: list[float], k: int = 10):
    with conn.cursor() as cur:
        cur.execute("SET LOCAL enable_indexscan = off")
        cur.execute(
            """
            SELECT id FROM rag_chunks
            ORDER BY embedding <=> %s::vector LIMIT %s
            """,
            (q, k),
        )
        return [r[0] for r in cur.fetchall()]

Run brute-force on 200 sampled queries, store as ground truth.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 4 — Sweep `ef_search`

def hnsw_topk(q, k=10, ef=100):
    with conn.cursor() as cur:
        cur.execute(f"SET LOCAL hnsw.ef_search = {ef}")
        cur.execute(
            "SELECT id FROM rag_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
            (q, k),
        )
        return [r[0] for r in cur.fetchall()]

for ef in [40, 80, 120, 160, 200, 300]:
    hits, lat = [], []
    for q, gt in samples:
        t0 = time.perf_counter()
        ids = hnsw_topk(q, ef=ef)
        lat.append(time.perf_counter() - t0)
        hits.append(len(set(ids) & set(gt)) / 10)
    print(f"ef={ef} recall={np.mean(hits):.3f} p95={np.percentile(lat,95)*1000:.1f}ms")

Step 5 — Read the curve, pick a point

Typical 10M-row result on a 16-vCPU Postgres:

ef_search	recall@10	p95 latency
40	0.86	8 ms
100	0.94	14 ms
200	0.98	26 ms
400	0.99	51 ms

For an agent that hits memory once per turn, 200 is the sweet spot.

Step 6 — Production monitoring

SELECT relname, idx_scan, idx_tup_read, idx_tup_fetch,
       pg_size_pretty(pg_relation_size(indexrelid)) AS idx_size
FROM pg_stat_user_indexes
WHERE indexrelname = 'rag_chunks_hnsw';

Track index size weekly — HNSW grows ~1.5–2x the raw vector size at m=32.

Pitfalls

Building before load — wastes hours, produces worse graphs. Always load first.
maintenance_work_mem too small — index spills to disk, build slows 10x. Set it to 25-50% of RAM.
Filtering on un-indexed columns — WHERE tenant_id = $1 ORDER BY embedding <=> $2 is post-filtered. Use a partial HNSW or pgvectorscale's StreamingDiskANN.
Ignoring write amplification — every UPDATE to embedding rebuilds graph edges. Batch updates.

CallSphere production note

CallSphere's RAG layer indexes 8M+ chunks across 115+ DB tables with m=24, ef_construction=128, ef_search=160. Healthcare and Behavioral Health verticals run on a HIPAA-isolated healthcare_voice Prisma schema; OneRoof uses RLS-scoped HNSW indexes per landlord; UrackIT keeps its non-HIPAA RAG on Supabase + ChromaDB. 37 agents · 90+ tools · 6 verticals. Plans: $149 / $499 / $1,499, 14-day trial, 22% affiliate.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

Q: Does SET hnsw.ef_search need to be SESSION-scoped? SET LOCAL inside the transaction is safest — avoids leaking to pooled connections.

Q: When is IVFFlat actually better than HNSW? Memory-constrained boxes (<8 GB) and >100M vectors with low QPS.

Q: Should I rebuild the index after bulk imports? Only if you imported >20% of total rows. HNSW handles incremental inserts well.

Q: Can I use halfvec to halve memory? Yes — pgvector 0.7+ ships halfvec(n). Recall drop is usually <1%, memory savings 50%.

Q: What about pgvectorscale? StreamingDiskANN beats HNSW past ~50M vectors. Worth evaluating if you outgrow pgvector.

pgvector HNSW Index Tuning at Scale: m, ef_construction, ef_search (2026)

What you'll build

Schema

Architecture

Step 1 — Understand the three knobs

Step 2 — Build with parallel workers

Step 3 — Generate a recall ground truth

Step 4 — Sweep `ef_search`

Step 5 — Read the curve, pick a point

Step 6 — Production monitoring

Pitfalls

CallSphere production note

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Database Backup and Recovery for AI Agent State: Postgres + pgvector

Enterprise CIO Guide: Retell AI Knowledge Base — RAG Goes Native in Voice

What you'll build

Schema

Architecture

Step 1 — Understand the three knobs

Step 2 — Build with parallel workers

Step 3 — Generate a recall ground truth

Step 4 — Sweep ef_search

Step 5 — Read the curve, pick a point

Step 6 — Production monitoring

Pitfalls

CallSphere production note

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Database Backup and Recovery for AI Agent State: Postgres + pgvector

Enterprise CIO Guide: Retell AI Knowledge Base — RAG Goes Native in Voice

Step 4 — Sweep `ef_search`