By Sagar Shankaran, Founder of CallSphere
pgvector 0.8 with binary quantization cut HNSW build time 150x and hits 471 QPS at 99% recall on 50M vectors. Here is the production tuning guide for Postgres-shop teams.
Key takeaways
TL;DR — pgvector 0.8 (early 2026) ships parallel HNSW build, binary quantization, and halfvec scalar quantization. On dbpedia-1M, build time dropped ~150x vs 0.5; throughput at 99% recall improved ~30x over IVFFlat. With pgvectorscale's StreamingDiskANN, 50M vectors hit 471 QPS at 99% recall — competitive with Pinecone at 75% lower cost.
pgvector is a Postgres extension that adds a vector type, distance operators (<=> cosine, <-> L2, <#> inner product), and two index types: IVFFlat and HNSW. HNSW dominates production workloads in 2026 because it offers tighter recall guarantees at the cost of higher build time and memory — both of which 0.7+ have aggressively addressed.
flowchart LR
E[Embeddings] --> T{Type}
T -->|fullvec 32-bit| F[vector type]
T -->|halfvec 16-bit| H[halfvec type]
T -->|binary| B[bit type]
F --> I[HNSW index]
H --> I
B --> I
I --> Q[Query]
Q --> R[Top-K]
HNSW builds a multi-layer skip-list-of-graphs. Top layers are sparse (long jumps); bottom layer is the full graph. Search starts at the top, greedily descends. Two key knobs:
m: max neighbors per node (default 16). Higher = better recall, more memory.ef_construction: candidate list size at build (default 64). Higher = better recall, slower build.ef_search: at query time. Higher = better recall, slower query.Quantization options:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere stores Healthcare retrieval embeddings (patient summaries, insurance plan text, provider directories) in pgvector inside the same 115-table Postgres that runs the rest of the platform. One database, one transactional consistency story. We use:
37 agents · 90+ tools · 115+ DB tables · 6 verticals. Pricing $149/$499/$1499, 14-day trial, 22% affiliate. The Healthcare retrieval lives at the heart of the platform; visit /pricing to compare plan-level retrieval limits.
-- 1. Install + extension
CREATE EXTENSION IF NOT EXISTS vector;
-- 2. Create table with halfvec for compression
CREATE TABLE kb (
id BIGSERIAL PRIMARY KEY,
text TEXT,
embedding halfvec(1536)
);
-- 3. Insert
INSERT INTO kb (text, embedding) VALUES ($1, $2);
-- 4. Build HNSW with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 8;
CREATE INDEX ON kb USING hnsw (embedding halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- 5. Query
SET hnsw.ef_search = 80;
SELECT id, text, 1 - (embedding <=> $1::halfvec) AS score
FROM kb ORDER BY embedding <=> $1::halfvec LIMIT 10;
For binary quantization with re-rank:
-- Coarse search on bit, then re-rank with halfvec
WITH coarse AS (
SELECT id FROM kb_bit ORDER BY embedding <~> $1::bit(1536) LIMIT 100
)
SELECT k.id, k.text, 1 - (k.embedding <=> $2::halfvec) AS score
FROM kb k JOIN coarse ON k.id = coarse.id
ORDER BY k.embedding <=> $2::halfvec LIMIT 10;
maintenance_work_mem = 25–50% of RAM during index build.max_parallel_maintenance_workers for 0.7+ parallel build.ef_search per workload — 40 for fast voice, 100+ for batch quality.pgvectorscale (StreamingDiskANN) when you cross 20M vectors.VACUUM or set autovacuum_vacuum_scale_factor low for the table.Postgres or dedicated DB? If you already run Postgres at scale, pgvector. Otherwise it depends on workload.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
halfvec or fullvec? halfvec for 99% of cases. The 0.5–1pp accuracy drop is invisible in practice.
Binary quantization? Yes if 50M+ vectors and you can afford a re-rank pass.
pgvectorscale required? Above 20M vectors, yes. Below, vanilla pgvector is enough.
Plan limits? /pricing shows per-tenant retrieval allowances.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.
© 2026 CallSphere LLC. All rights reserved.