Vector DB Sharding Strategies for Hundreds of Millions of Vectors
Sharding patterns that hold up beyond 100M vectors. The 2026 designs for partition keys, replication, and rebalancing.
When Sharding Becomes Necessary
A single vector index works well up to roughly 50-100M vectors with HNSW, depending on dimensions and hardware. Beyond that, query latency and memory pressure force sharding. The sharding choice — by what key, with what topology, with what replication — decides the system's scale ceiling and operational behavior.
This piece walks through the patterns that hold up.
The Three Sharding Strategies
flowchart TB
Strategy[Sharding strategy] --> S1[By tenant]
Strategy --> S2[By hash]
Strategy --> S3[By semantic / cluster]
By Tenant
Each tenant's vectors live in their own shard. Natural for multi-tenant SaaS.
- Pro: per-tenant isolation; predictable; easy to reason about
- Con: hot tenants overload one shard; small tenants waste capacity
By Hash
Hash the vector ID, shard by hash. Even distribution.
- Pro: even load distribution
- Con: queries must fan out to all shards (unless filterable up-front)
By Semantic / Cluster
Cluster vectors by topic; each cluster is a shard.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Pro: queries hit only relevant shards (lower fan-out)
- Con: complex to maintain; cluster boundaries drift
For most production systems in 2026, by-tenant works for multi-tenant SaaS; by-hash for single-tenant high-volume; by-semantic for specialized very-large corpora.
Replication
Each shard is replicated for availability:
- 2-3 replicas per shard typical
- Reads served from any replica
- Writes go to primary, propagated to replicas
The replication strategy affects consistency:
- Synchronous: all replicas updated before write returns
- Asynchronous: write returns; replicas catch up
- Quorum: majority of replicas updated
For vector data, asynchronous is common because vector updates are usually appends with eventual consistency tolerance.
Rebalancing
When shards become uneven:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Add a new shard
- Move vectors from overloaded shards to the new one
- Update routing
Rebalancing is operationally painful. Patterns to minimize:
- Predict shard growth and pre-shard
- Use consistent hashing so adds move minimal data
- Schedule rebalancing during low-traffic windows
- Some vector DBs (Milvus) support online rebalancing
Routing
Where do queries go?
flowchart LR
Q[Query] --> R[Router]
R -->|filter shards| S1[Shard 1]
R -->|fan out| S2[Shard 2]
R -->|by tenant| S3[Shard 3]
S1 --> Merge[Merge results]
S2 --> Merge
S3 --> Merge
Merge --> Top[Top K]
The router decides which shards to query. Smart routing saves cost and latency.
Cross-Shard Top-K
Top-K across shards: each shard returns its top-K, the router merges, returns the global top-K. The math:
- Each shard returns top-K
- Merge sorts (K * num_shards) candidates
- Returns global top-K
For consistency in score scales, all shards should use the same scoring (same embedding model, same distance metric).
What Goes Wrong
- Hot shard — one tenant or one hash bucket dominates
- Skewed distribution — some shards 90 percent full, others 10 percent
- Query fan-out at every request — saturates the network
- Stale replicas serving wrong data
- Rebalancing during peak — performance degrades
Operational Realities
For very large vector deployments (1B+ vectors):
- Monitor per-shard query rate, latency, memory
- Alert on imbalance
- Have a rebalance runbook
- Test failover regularly
Specific Vendor Patterns
- pgvector: Citus / Postgres sharding extensions
- Qdrant: native sharding with replication
- Milvus: distributed by design; multiple sharding modes
- Weaviate: multi-tenant with shard per tenant
- Pinecone: managed sharding (the user does not see it)
When Not to Shard
- Workloads under 50M vectors fit in one HNSW index — do not over-engineer
- Workloads with very low QPS may be fine on a single node
- Latency-sensitive workloads where fan-out cost is unacceptable
Sources
- Milvus distributed architecture — https://milvus.io/docs
- Weaviate multi-tenancy — https://weaviate.io/developers/weaviate/manage-data/multi-tenancy
- Qdrant clustering — https://qdrant.tech/documentation
- "Sharding strategies for vector DBs" — https://www.pinecone.io/learn
- Citus Postgres extension — https://www.citusdata.com
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.