TL;DR — pgvector 0.8 (early 2026) ships parallel HNSW build, binary quantization, and halfvec scalar quantization. On dbpedia-1M, build time dropped ~150x vs 0.5; throughput at 99% recall improved ~30x over IVFFlat. With pgvectorscale's StreamingDiskANN, 50M vectors hit 471 QPS at 99% recall — competitive with Pinecone at 75% lower cost.

The technique

pgvector is a Postgres extension that adds a vector type, distance operators (<=> cosine, <-> L2, <#> inner product), and two index types: IVFFlat and HNSW. HNSW dominates production workloads in 2026 because it offers tighter recall guarantees at the cost of higher build time and memory — both of which 0.7+ have aggressively addressed.

flowchart LR
  E[Embeddings] --> T{Type}
  T -->|fullvec 32-bit| F[vector type]
  T -->|halfvec 16-bit| H[halfvec type]
  T -->|binary| B[bit type]
  F --> I[HNSW index]
  H --> I
  B --> I
  I --> Q[Query]
  Q --> R[Top-K]

How it works

HNSW builds a multi-layer skip-list-of-graphs. Top layers are sparse (long jumps); bottom layer is the full graph. Search starts at the top, greedily descends. Two key knobs:

m: max neighbors per node (default 16). Higher = better recall, more memory.
ef_construction: candidate list size at build (default 64). Higher = better recall, slower build.
ef_search: at query time. Higher = better recall, slower query.

Quantization options:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

halfvec — 16-bit floats, ~50x faster build, ~negligible accuracy drop on most embeddings.
bit / binary — 1-bit per dim, ~150x faster build, ~5–10% recall drop unless you re-rank with the full vectors on top-100.

CallSphere implementation

CallSphere stores Healthcare retrieval embeddings (patient summaries, insurance plan text, provider directories) in pgvector inside the same 115-table Postgres that runs the rest of the platform. One database, one transactional consistency story. We use:

halfvec + HNSW for the patient-summary index (5M vectors, dense semantic queries)
bit + re-rank with halfvec for the provider directory (10M+ rows, exact-match dominant)
fullvec + HNSW for low-volume but high-precision indexes like billing codes

37 agents · 90+ tools · 115+ DB tables · 6 verticals. Pricing $149/$499/$1499, 14-day trial, 22% affiliate. The Healthcare retrieval lives at the heart of the platform; visit /pricing to compare plan-level retrieval limits.

Build steps with code

-- 1. Install + extension
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. Create table with halfvec for compression
CREATE TABLE kb (
  id BIGSERIAL PRIMARY KEY,
  text TEXT,
  embedding halfvec(1536)
);

-- 3. Insert
INSERT INTO kb (text, embedding) VALUES ($1, $2);

-- 4. Build HNSW with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 8;
CREATE INDEX ON kb USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- 5. Query
SET hnsw.ef_search = 80;
SELECT id, text, 1 - (embedding <=> $1::halfvec) AS score
FROM kb ORDER BY embedding <=> $1::halfvec LIMIT 10;

For binary quantization with re-rank:

-- Coarse search on bit, then re-rank with halfvec
WITH coarse AS (
  SELECT id FROM kb_bit ORDER BY embedding <~> $1::bit(1536) LIMIT 100
)
SELECT k.id, k.text, 1 - (k.embedding <=> $2::halfvec) AS score
FROM kb k JOIN coarse ON k.id = coarse.id
ORDER BY k.embedding <=> $2::halfvec LIMIT 10;

Set maintenance_work_mem = 25–50% of RAM during index build.
Use max_parallel_maintenance_workers for 0.7+ parallel build.
Tune ef_search per workload — 40 for fast voice, 100+ for batch quality.
Run pgvectorscale (StreamingDiskANN) when you cross 20M vectors.

Pitfalls

Forgetting halfvec: storing 1536-dim fullvec at 10M is 60GB — pointless when halfvec halves it with no measurable accuracy loss.
Default ef_construction: 64 is fine for testing; production deserves 128–200.
No vacuum: HNSW indexes bloat on update-heavy workloads. Schedule VACUUM or set autovacuum_vacuum_scale_factor low for the table.
Single replica: vector workloads are CPU + RAM hungry. Read replicas help.

FAQ

Postgres or dedicated DB? If you already run Postgres at scale, pgvector. Otherwise it depends on workload.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

halfvec or fullvec? halfvec for 99% of cases. The 0.5–1pp accuracy drop is invisible in practice.

Binary quantization? Yes if 50M+ vectors and you can afford a re-rank pass.

pgvectorscale required? Above 20M vectors, yes. Below, vanilla pgvector is enough.

Plan limits? /pricing shows per-tenant retrieval allowances.

pgvector at Scale in 2026: HNSW Tuning + Binary Quantization

The technique

How it works

CallSphere implementation

Build steps with code

Pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Database Backup and Recovery for AI Agent State: Postgres + pgvector

The 200K Context Window That Wasn't: Claude's Effective Memory Tested Under Load