---
title: "pgvector at Scale in 2026: HNSW Tuning + Binary Quantization"
description: "pgvector 0.8 with binary quantization cut HNSW build time 150x and hits 471 QPS at 99% recall on 50M vectors. Here is the production tuning guide for Postgres-shop teams."
canonical: https://callsphere.ai/blog/vw6g-pgvector-hnsw-quantization-scale-2026
category: "AI Engineering"
tags: ["pgvector", "Postgres", "HNSW", "Quantization", "RAG"]
author: "CallSphere Team"
published: 2026-04-16T00:00:00.000Z
updated: 2026-05-07T16:46:12.597Z
---

# pgvector at Scale in 2026: HNSW Tuning + Binary Quantization

> pgvector 0.8 with binary quantization cut HNSW build time 150x and hits 471 QPS at 99% recall on 50M vectors. Here is the production tuning guide for Postgres-shop teams.

> **TL;DR** — pgvector 0.8 (early 2026) ships parallel HNSW build, binary quantization, and halfvec scalar quantization. On dbpedia-1M, build time dropped ~150x vs 0.5; throughput at 99% recall improved ~30x over IVFFlat. With pgvectorscale's StreamingDiskANN, 50M vectors hit 471 QPS at 99% recall — competitive with Pinecone at 75% lower cost.

## The technique

pgvector is a Postgres extension that adds a `vector` type, distance operators (`` cosine, `` L2, `` inner product), and two index types: IVFFlat and HNSW. HNSW dominates production workloads in 2026 because it offers tighter recall guarantees at the cost of higher build time and memory — both of which 0.7+ have aggressively addressed.

```mermaid
flowchart LR
  E[Embeddings] --> T{Type}
  T -->|fullvec 32-bit| F[vector type]
  T -->|halfvec 16-bit| H[halfvec type]
  T -->|binary| B[bit type]
  F --> I[HNSW index]
  H --> I
  B --> I
  I --> Q[Query]
  Q --> R[Top-K]
```

## How it works

HNSW builds a multi-layer skip-list-of-graphs. Top layers are sparse (long jumps); bottom layer is the full graph. Search starts at the top, greedily descends. Two key knobs:

- `m`: max neighbors per node (default 16). Higher = better recall, more memory.
- `ef_construction`: candidate list size at build (default 64). Higher = better recall, slower build.
- `ef_search`: at query time. Higher = better recall, slower query.

Quantization options:

- **halfvec** — 16-bit floats, ~50x faster build, ~negligible accuracy drop on most embeddings.
- **bit / binary** — 1-bit per dim, ~150x faster build, ~5–10% recall drop unless you re-rank with the full vectors on top-100.

## CallSphere implementation

CallSphere stores Healthcare retrieval embeddings (patient summaries, insurance plan text, provider directories) in pgvector inside the same 115-table Postgres that runs the rest of the platform. One database, one transactional consistency story. We use:

- **halfvec** + HNSW for the patient-summary index (5M vectors, dense semantic queries)
- **bit + re-rank with halfvec** for the provider directory (10M+ rows, exact-match dominant)
- **fullvec** + HNSW for low-volume but high-precision indexes like billing codes

37 agents · 90+ tools · 115+ DB tables · 6 verticals. Pricing **$149/$499/$1499**, [14-day trial](/trial), [22% affiliate](/affiliate). The Healthcare retrieval lives at the heart of the platform; visit [/pricing](/pricing) to compare plan-level retrieval limits.

## Build steps with code

```sql
-- 1. Install + extension
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. Create table with halfvec for compression
CREATE TABLE kb (
  id BIGSERIAL PRIMARY KEY,
  text TEXT,
  embedding halfvec(1536)
);

-- 3. Insert
INSERT INTO kb (text, embedding) VALUES ($1, $2);

-- 4. Build HNSW with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 8;
CREATE INDEX ON kb USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- 5. Query
SET hnsw.ef_search = 80;
SELECT id, text, 1 - (embedding  $1::halfvec) AS score
FROM kb ORDER BY embedding  $1::halfvec LIMIT 10;
```

For binary quantization with re-rank:

```sql
-- Coarse search on bit, then re-rank with halfvec
WITH coarse AS (
  SELECT id FROM kb_bit ORDER BY embedding  $1::bit(1536) LIMIT 100
)
SELECT k.id, k.text, 1 - (k.embedding  $2::halfvec) AS score
FROM kb k JOIN coarse ON k.id = coarse.id
ORDER BY k.embedding  $2::halfvec LIMIT 10;
```

1. Set `maintenance_work_mem` = 25–50% of RAM during index build.
2. Use `max_parallel_maintenance_workers` for 0.7+ parallel build.
3. Tune `ef_search` per workload — 40 for fast voice, 100+ for batch quality.
4. Run `pgvectorscale` (StreamingDiskANN) when you cross 20M vectors.

## Pitfalls

- **Forgetting halfvec**: storing 1536-dim fullvec at 10M is 60GB — pointless when halfvec halves it with no measurable accuracy loss.
- **Default ef_construction**: 64 is fine for testing; production deserves 128–200.
- **No vacuum**: HNSW indexes bloat on update-heavy workloads. Schedule `VACUUM` or set `autovacuum_vacuum_scale_factor` low for the table.
- **Single replica**: vector workloads are CPU + RAM hungry. Read replicas help.

## FAQ

**Postgres or dedicated DB?** If you already run Postgres at scale, pgvector. Otherwise it depends on workload.

**halfvec or fullvec?** halfvec for 99% of cases. The 0.5–1pp accuracy drop is invisible in practice.

**Binary quantization?** Yes if 50M+ vectors and you can afford a re-rank pass.

**pgvectorscale required?** Above 20M vectors, yes. Below, vanilla pgvector is enough.

**Plan limits?** [/pricing](/pricing) shows per-tenant retrieval allowances.

## Sources

- [pgvector GitHub](https://github.com/pgvector/pgvector)
- [pgvector performance benchmark - Instaclustr](https://www.instaclustr.com/education/vector-database/pgvector-performance-benchmark-results-and-5-ways-to-boost-performance/)
- [pgvector: A Guide for DBA Part 2 - DBI Services](https://www.dbi-services.com/blog/pgvector-a-guide-for-dba-part-2-indexes-update-march-2026/)
- [Pgvector vs Qdrant - TigerData](https://www.tigerdata.com/blog/pgvector-vs-qdrant)

---

Source: https://callsphere.ai/blog/vw6g-pgvector-hnsw-quantization-scale-2026