Skip to content
AI Engineering
AI Engineering11 min read0 views

pgvector at Scale in 2026: HNSW Tuning + Binary Quantization

pgvector 0.8 with binary quantization cut HNSW build time 150x and hits 471 QPS at 99% recall on 50M vectors. Here is the production tuning guide for Postgres-shop teams.

TL;DR — pgvector 0.8 (early 2026) ships parallel HNSW build, binary quantization, and halfvec scalar quantization. On dbpedia-1M, build time dropped ~150x vs 0.5; throughput at 99% recall improved ~30x over IVFFlat. With pgvectorscale's StreamingDiskANN, 50M vectors hit 471 QPS at 99% recall — competitive with Pinecone at 75% lower cost.

The technique

pgvector is a Postgres extension that adds a vector type, distance operators (<=> cosine, <-> L2, <#> inner product), and two index types: IVFFlat and HNSW. HNSW dominates production workloads in 2026 because it offers tighter recall guarantees at the cost of higher build time and memory — both of which 0.7+ have aggressively addressed.

flowchart LR
  E[Embeddings] --> T{Type}
  T -->|fullvec 32-bit| F[vector type]
  T -->|halfvec 16-bit| H[halfvec type]
  T -->|binary| B[bit type]
  F --> I[HNSW index]
  H --> I
  B --> I
  I --> Q[Query]
  Q --> R[Top-K]

How it works

HNSW builds a multi-layer skip-list-of-graphs. Top layers are sparse (long jumps); bottom layer is the full graph. Search starts at the top, greedily descends. Two key knobs:

  • m: max neighbors per node (default 16). Higher = better recall, more memory.
  • ef_construction: candidate list size at build (default 64). Higher = better recall, slower build.
  • ef_search: at query time. Higher = better recall, slower query.

Quantization options:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • halfvec — 16-bit floats, ~50x faster build, ~negligible accuracy drop on most embeddings.
  • bit / binary — 1-bit per dim, ~150x faster build, ~5–10% recall drop unless you re-rank with the full vectors on top-100.

CallSphere implementation

CallSphere stores Healthcare retrieval embeddings (patient summaries, insurance plan text, provider directories) in pgvector inside the same 115-table Postgres that runs the rest of the platform. One database, one transactional consistency story. We use:

  • halfvec + HNSW for the patient-summary index (5M vectors, dense semantic queries)
  • bit + re-rank with halfvec for the provider directory (10M+ rows, exact-match dominant)
  • fullvec + HNSW for low-volume but high-precision indexes like billing codes

37 agents · 90+ tools · 115+ DB tables · 6 verticals. Pricing $149/$499/$1499, 14-day trial, 22% affiliate. The Healthcare retrieval lives at the heart of the platform; visit /pricing to compare plan-level retrieval limits.

Build steps with code

-- 1. Install + extension
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. Create table with halfvec for compression
CREATE TABLE kb (
  id BIGSERIAL PRIMARY KEY,
  text TEXT,
  embedding halfvec(1536)
);

-- 3. Insert
INSERT INTO kb (text, embedding) VALUES ($1, $2);

-- 4. Build HNSW with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 8;
CREATE INDEX ON kb USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- 5. Query
SET hnsw.ef_search = 80;
SELECT id, text, 1 - (embedding <=> $1::halfvec) AS score
FROM kb ORDER BY embedding <=> $1::halfvec LIMIT 10;

For binary quantization with re-rank:

-- Coarse search on bit, then re-rank with halfvec
WITH coarse AS (
  SELECT id FROM kb_bit ORDER BY embedding <~> $1::bit(1536) LIMIT 100
)
SELECT k.id, k.text, 1 - (k.embedding <=> $2::halfvec) AS score
FROM kb k JOIN coarse ON k.id = coarse.id
ORDER BY k.embedding <=> $2::halfvec LIMIT 10;
  1. Set maintenance_work_mem = 25–50% of RAM during index build.
  2. Use max_parallel_maintenance_workers for 0.7+ parallel build.
  3. Tune ef_search per workload — 40 for fast voice, 100+ for batch quality.
  4. Run pgvectorscale (StreamingDiskANN) when you cross 20M vectors.

Pitfalls

  • Forgetting halfvec: storing 1536-dim fullvec at 10M is 60GB — pointless when halfvec halves it with no measurable accuracy loss.
  • Default ef_construction: 64 is fine for testing; production deserves 128–200.
  • No vacuum: HNSW indexes bloat on update-heavy workloads. Schedule VACUUM or set autovacuum_vacuum_scale_factor low for the table.
  • Single replica: vector workloads are CPU + RAM hungry. Read replicas help.

FAQ

Postgres or dedicated DB? If you already run Postgres at scale, pgvector. Otherwise it depends on workload.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

halfvec or fullvec? halfvec for 99% of cases. The 0.5–1pp accuracy drop is invisible in practice.

Binary quantization? Yes if 50M+ vectors and you can afford a re-rank pass.

pgvectorscale required? Above 20M vectors, yes. Below, vanilla pgvector is enough.

Plan limits? /pricing shows per-tenant retrieval allowances.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

AI Engineering

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Cognee builds and queries a knowledge graph from your unstructured data automatically. A walkthrough from install to your first agent integration in production.

AI Infrastructure

Database Backup and Recovery for AI Agent State: Postgres + pgvector

Your agent's memory, embeddings, and conversation state all live in Postgres. Backups must include vector data and survive a full-region loss. Here's how CallSphere does PITR for 115+ tables.

AI Mythology

The 200K Context Window That Wasn't: Claude's Effective Memory Tested Under Load

Marketing context length is not effective context. We test Claude's memory under realistic load, compare to Gemini and GPT, and give you a hard rule of thumb.