By Sagar Shankaran, Founder of CallSphere
A measured guide to tuning pgvector HNSW indexes for AI agent workloads — what m, ef_construction, and ef_search actually do, how to size them at 1M, 10M, and 50M rows, and how to monitor recall in production.
Key takeaways
TL;DR — Default HNSW params (
m=16,ef_construction=64,ef_search=40) are optimized for 100k-row demos, not 10M-row production. Bumpingef_constructionto 200 andef_searchto 100–200 typically lifts recall@10 from 0.85 to 0.97 with manageable latency cost.
A reproducible benchmark loop that measures recall and p95 latency across HNSW parameter sets, plus a production tuning playbook for 1M, 10M, and 50M-row pgvector tables.
CREATE TABLE rag_chunks (
id BIGSERIAL PRIMARY KEY,
doc_id UUID NOT NULL,
chunk_text TEXT NOT NULL,
embedding vector(1536) NOT NULL
);
-- Build index AFTER bulk load
CREATE INDEX rag_chunks_hnsw ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
flowchart TD
LOAD[Bulk load 10M chunks] --> IDX[Build HNSW with m=32, ef=200]
IDX --> BENCH[Benchmark loop]
BENCH --> RECALL[Measure recall@10]
BENCH --> P95[Measure p95 latency]
RECALL --> TUNE{Recall > 0.95?}
P95 --> TUNE
TUNE -->|No| EFUP[Raise ef_search]
TUNE -->|Yes| SHIP[Ship config]
m — neighbors per node. Default 16. Higher m = better recall, larger index, slower build. For 10M+ vectors set m = 24–32.ef_construction — candidate list during build. Default 64. Production: 128–200. Affects build time, not query time.ef_search — candidate list during query. Default 40. Production: 80–200. Linear knob: latency vs recall.SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 7;
CREATE INDEX CONCURRENTLY rag_chunks_hnsw ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
pgvector 0.7+ supports parallel HNSW builds — 4-8x faster on 8-core machines.
import psycopg, numpy as np
conn = psycopg.connect(...)
def brute_force_topk(q: list[float], k: int = 10):
with conn.cursor() as cur:
cur.execute("SET LOCAL enable_indexscan = off")
cur.execute(
"""
SELECT id FROM rag_chunks
ORDER BY embedding <=> %s::vector LIMIT %s
""",
(q, k),
)
return [r[0] for r in cur.fetchall()]
Run brute-force on 200 sampled queries, store as ground truth.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
ef_searchdef hnsw_topk(q, k=10, ef=100):
with conn.cursor() as cur:
cur.execute(f"SET LOCAL hnsw.ef_search = {ef}")
cur.execute(
"SELECT id FROM rag_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
(q, k),
)
return [r[0] for r in cur.fetchall()]
for ef in [40, 80, 120, 160, 200, 300]:
hits, lat = [], []
for q, gt in samples:
t0 = time.perf_counter()
ids = hnsw_topk(q, ef=ef)
lat.append(time.perf_counter() - t0)
hits.append(len(set(ids) & set(gt)) / 10)
print(f"ef={ef} recall={np.mean(hits):.3f} p95={np.percentile(lat,95)*1000:.1f}ms")
Typical 10M-row result on a 16-vCPU Postgres:
| ef_search | recall@10 | p95 latency |
|---|---|---|
| 40 | 0.86 | 8 ms |
| 100 | 0.94 | 14 ms |
| 200 | 0.98 | 26 ms |
| 400 | 0.99 | 51 ms |
For an agent that hits memory once per turn, 200 is the sweet spot.
SELECT relname, idx_scan, idx_tup_read, idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS idx_size
FROM pg_stat_user_indexes
WHERE indexrelname = 'rag_chunks_hnsw';
Track index size weekly — HNSW grows ~1.5–2x the raw vector size at m=32.
maintenance_work_mem too small — index spills to disk, build slows 10x. Set it to 25-50% of RAM.WHERE tenant_id = $1 ORDER BY embedding <=> $2 is post-filtered. Use a partial HNSW or pgvectorscale's StreamingDiskANN.embedding rebuilds graph edges. Batch updates.CallSphere's RAG layer indexes 8M+ chunks across 115+ DB tables with m=24, ef_construction=128, ef_search=160. Healthcare and Behavioral Health verticals run on a HIPAA-isolated healthcare_voice Prisma schema; OneRoof uses RLS-scoped HNSW indexes per landlord; UrackIT keeps its non-HIPAA RAG on Supabase + ChromaDB. 37 agents · 90+ tools · 6 verticals. Plans: $149 / $499 / $1,499, 14-day trial, 22% affiliate.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Does SET hnsw.ef_search need to be SESSION-scoped?
SET LOCAL inside the transaction is safest — avoids leaking to pooled connections.
Q: When is IVFFlat actually better than HNSW? Memory-constrained boxes (<8 GB) and >100M vectors with low QPS.
Q: Should I rebuild the index after bulk imports? Only if you imported >20% of total rows. HNSW handles incremental inserts well.
Q: Can I use halfvec to halve memory?
Yes — pgvector 0.7+ ships halfvec(n). Recall drop is usually <1%, memory savings 50%.
Q: What about pgvectorscale? StreamingDiskANN beats HNSW past ~50M vectors. Worth evaluating if you outgrow pgvector.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI