pgvector at Scale in 2026: HNSW Tuning + Binary Quantization
pgvector 0.8 with binary quantization cut HNSW build time 150x and hits 471 QPS at 99% recall on 50M vectors. Here is the production tuning guide for Postgres-shop teams.
TL;DR — pgvector 0.8 (early 2026) ships parallel HNSW build, binary quantization, and halfvec scalar quantization. On dbpedia-1M, build time dropped ~150x vs 0.5; throughput at 99% recall improved ~30x over IVFFlat. With pgvectorscale's StreamingDiskANN, 50M vectors hit 471 QPS at 99% recall — competitive with Pinecone at 75% lower cost.
The technique
pgvector is a Postgres extension that adds a vector type, distance operators (<=> cosine, <-> L2, <#> inner product), and two index types: IVFFlat and HNSW. HNSW dominates production workloads in 2026 because it offers tighter recall guarantees at the cost of higher build time and memory — both of which 0.7+ have aggressively addressed.
flowchart LR
E[Embeddings] --> T{Type}
T -->|fullvec 32-bit| F[vector type]
T -->|halfvec 16-bit| H[halfvec type]
T -->|binary| B[bit type]
F --> I[HNSW index]
H --> I
B --> I
I --> Q[Query]
Q --> R[Top-K]
How it works
HNSW builds a multi-layer skip-list-of-graphs. Top layers are sparse (long jumps); bottom layer is the full graph. Search starts at the top, greedily descends. Two key knobs:
m: max neighbors per node (default 16). Higher = better recall, more memory.ef_construction: candidate list size at build (default 64). Higher = better recall, slower build.ef_search: at query time. Higher = better recall, slower query.
Quantization options:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- halfvec — 16-bit floats, ~50x faster build, ~negligible accuracy drop on most embeddings.
- bit / binary — 1-bit per dim, ~150x faster build, ~5–10% recall drop unless you re-rank with the full vectors on top-100.
CallSphere implementation
CallSphere stores Healthcare retrieval embeddings (patient summaries, insurance plan text, provider directories) in pgvector inside the same 115-table Postgres that runs the rest of the platform. One database, one transactional consistency story. We use:
- halfvec + HNSW for the patient-summary index (5M vectors, dense semantic queries)
- bit + re-rank with halfvec for the provider directory (10M+ rows, exact-match dominant)
- fullvec + HNSW for low-volume but high-precision indexes like billing codes
37 agents · 90+ tools · 115+ DB tables · 6 verticals. Pricing $149/$499/$1499, 14-day trial, 22% affiliate. The Healthcare retrieval lives at the heart of the platform; visit /pricing to compare plan-level retrieval limits.
Build steps with code
-- 1. Install + extension
CREATE EXTENSION IF NOT EXISTS vector;
-- 2. Create table with halfvec for compression
CREATE TABLE kb (
id BIGSERIAL PRIMARY KEY,
text TEXT,
embedding halfvec(1536)
);
-- 3. Insert
INSERT INTO kb (text, embedding) VALUES ($1, $2);
-- 4. Build HNSW with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 8;
CREATE INDEX ON kb USING hnsw (embedding halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- 5. Query
SET hnsw.ef_search = 80;
SELECT id, text, 1 - (embedding <=> $1::halfvec) AS score
FROM kb ORDER BY embedding <=> $1::halfvec LIMIT 10;
For binary quantization with re-rank:
-- Coarse search on bit, then re-rank with halfvec
WITH coarse AS (
SELECT id FROM kb_bit ORDER BY embedding <~> $1::bit(1536) LIMIT 100
)
SELECT k.id, k.text, 1 - (k.embedding <=> $2::halfvec) AS score
FROM kb k JOIN coarse ON k.id = coarse.id
ORDER BY k.embedding <=> $2::halfvec LIMIT 10;
- Set
maintenance_work_mem= 25–50% of RAM during index build. - Use
max_parallel_maintenance_workersfor 0.7+ parallel build. - Tune
ef_searchper workload — 40 for fast voice, 100+ for batch quality. - Run
pgvectorscale(StreamingDiskANN) when you cross 20M vectors.
Pitfalls
- Forgetting halfvec: storing 1536-dim fullvec at 10M is 60GB — pointless when halfvec halves it with no measurable accuracy loss.
- Default ef_construction: 64 is fine for testing; production deserves 128–200.
- No vacuum: HNSW indexes bloat on update-heavy workloads. Schedule
VACUUMor setautovacuum_vacuum_scale_factorlow for the table. - Single replica: vector workloads are CPU + RAM hungry. Read replicas help.
FAQ
Postgres or dedicated DB? If you already run Postgres at scale, pgvector. Otherwise it depends on workload.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
halfvec or fullvec? halfvec for 99% of cases. The 0.5–1pp accuracy drop is invisible in practice.
Binary quantization? Yes if 50M+ vectors and you can afford a re-rank pass.
pgvectorscale required? Above 20M vectors, yes. Below, vanilla pgvector is enough.
Plan limits? /pricing shows per-tenant retrieval allowances.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.