When Sharding Becomes Necessary

A single vector index works well up to roughly 50-100M vectors with HNSW, depending on dimensions and hardware. Beyond that, query latency and memory pressure force sharding. The sharding choice — by what key, with what topology, with what replication — decides the system's scale ceiling and operational behavior.

This piece walks through the patterns that hold up.

The Three Sharding Strategies

flowchart TB
    Strategy[Sharding strategy] --> S1[By tenant]
    Strategy --> S2[By hash]
    Strategy --> S3[By semantic / cluster]

By Tenant

Each tenant's vectors live in their own shard. Natural for multi-tenant SaaS.

Pro: per-tenant isolation; predictable; easy to reason about
Con: hot tenants overload one shard; small tenants waste capacity

By Hash

Hash the vector ID, shard by hash. Even distribution.

Pro: even load distribution
Con: queries must fan out to all shards (unless filterable up-front)

By Semantic / Cluster

Cluster vectors by topic; each cluster is a shard.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Pro: queries hit only relevant shards (lower fan-out)
Con: complex to maintain; cluster boundaries drift

For most production systems in 2026, by-tenant works for multi-tenant SaaS; by-hash for single-tenant high-volume; by-semantic for specialized very-large corpora.

Replication

Each shard is replicated for availability:

2-3 replicas per shard typical
Reads served from any replica
Writes go to primary, propagated to replicas

The replication strategy affects consistency:

Synchronous: all replicas updated before write returns
Asynchronous: write returns; replicas catch up
Quorum: majority of replicas updated

For vector data, asynchronous is common because vector updates are usually appends with eventual consistency tolerance.

Rebalancing

When shards become uneven:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Add a new shard
Move vectors from overloaded shards to the new one
Update routing

Rebalancing is operationally painful. Patterns to minimize:

Predict shard growth and pre-shard
Use consistent hashing so adds move minimal data
Schedule rebalancing during low-traffic windows
Some vector DBs (Milvus) support online rebalancing

Routing

Where do queries go?

flowchart LR
    Q[Query] --> R[Router]
    R -->|filter shards| S1[Shard 1]
    R -->|fan out| S2[Shard 2]
    R -->|by tenant| S3[Shard 3]
    S1 --> Merge[Merge results]
    S2 --> Merge
    S3 --> Merge
    Merge --> Top[Top K]

The router decides which shards to query. Smart routing saves cost and latency.

Cross-Shard Top-K

Top-K across shards: each shard returns its top-K, the router merges, returns the global top-K. The math:

Each shard returns top-K
Merge sorts (K * num_shards) candidates
Returns global top-K

For consistency in score scales, all shards should use the same scoring (same embedding model, same distance metric).

What Goes Wrong

Hot shard — one tenant or one hash bucket dominates
Skewed distribution — some shards 90 percent full, others 10 percent
Query fan-out at every request — saturates the network
Stale replicas serving wrong data
Rebalancing during peak — performance degrades

Operational Realities

For very large vector deployments (1B+ vectors):

Monitor per-shard query rate, latency, memory
Alert on imbalance
Have a rebalance runbook
Test failover regularly

Specific Vendor Patterns

pgvector: Citus / Postgres sharding extensions
Qdrant: native sharding with replication
Milvus: distributed by design; multiple sharding modes
Weaviate: multi-tenant with shard per tenant
Pinecone: managed sharding (the user does not see it)

When Not to Shard

Workloads under 50M vectors fit in one HNSW index — do not over-engineer
Workloads with very low QPS may be fine on a single node
Latency-sensitive workloads where fan-out cost is unacceptable

Sources

Milvus distributed architecture — https://milvus.io/docs
Weaviate multi-tenancy — https://weaviate.io/developers/weaviate/manage-data/multi-tenancy
Qdrant clustering — https://qdrant.tech/documentation
"Sharding strategies for vector DBs" — https://www.pinecone.io/learn
Citus Postgres extension — https://www.citusdata.com

## Vector DB Sharding Strategies for Hundreds of Millions of Vectors: production view Vector DB Sharding Strategies for Hundreds of Millions of Vectors forces a tension most teams underestimate: agent handoff state. A single LLM call is easy. A booking agent that hands a confirmed slot to a billing agent that hands a follow-up to an escalation agent — that's where context loss, hallucinated IDs, and double-bookings live. Solving it well means treating the conversation as a stateful workflow, not a chat. ## Broader technology framing The protocol layer determines what's possible: WebRTC for browser-side widgets, SIP trunks (Twilio, Telnyx) for PSTN voice, WebSockets for the Realtime API streaming session. Each has its own jitter buffer, its own ICE/STUN dance, and its own failure modes when a customer's corporate firewall is hostile. Front-end is **Next.js 15 + React 19** for the marketing surface and the in-app dashboards, with server components used heavily for the SEO-critical pages. Backend splits across **FastAPI** for the AI worker, **NestJS + Prisma** for the customer-facing API, and a thin **Go gateway** that does auth, rate limiting, and routing — letting each service scale on its own characteristics. Datastores: **Postgres** as the source of truth (per-vertical schemas like `healthcare_voice`, `realestate_voice`), **ChromaDB** for RAG over support docs, **Redis** for ephemeral session state. Postgres RLS enforces tenant isolation at the row level so a misconfigured query can't leak across customers. ## FAQ **What's the right way to scope the proof-of-concept?** Real Estate runs as a 6-container pod (frontend, gateway, ai-worker, voice-server, NATS event bus, Redis) backed by Postgres `realestate_voice` with row-level security so multi-tenant data never crosses tenants. For a topic like "Vector DB Sharding Strategies for Hundreds of Millions of Vectors", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations. **How do you handle compliance and data isolation?** Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar. **When does it make sense to switch from a managed model to a self-hosted one?** The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer. ## Talk to us Want to see how this maps to your stack? Book a live walkthrough at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting), or try the vertical-specific demo at [salon.callsphere.tech](https://salon.callsphere.tech). 14-day trial, no credit card, pilot live in 3–5 business days.

Vector DB Sharding Strategies for Hundreds of Millions of Vectors

When Sharding Becomes Necessary

The Three Sharding Strategies

By Tenant

By Hash

By Semantic / Cluster

Replication

Rebalancing

Routing

Cross-Shard Top-K

What Goes Wrong

Operational Realities

Specific Vendor Patterns

When Not to Shard

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Vector DB Build vs Buy: The 2026 Decision Framework Made Simple

Agent Personalization at Scale: Patterns That Work for 1M Users

Forgetting Curves and Decay in Agent Memory: Four Strategies

Agent Memory Cost Modeling in 2026: An Honest Numbers Walkthrough

Webhook-Driven AI Integrations: Patterns That Scale