Skip to content
Technology
Technology8 min read0 views

Vector DB Sharding Strategies for Hundreds of Millions of Vectors

Sharding patterns that hold up beyond 100M vectors. The 2026 designs for partition keys, replication, and rebalancing.

When Sharding Becomes Necessary

A single vector index works well up to roughly 50-100M vectors with HNSW, depending on dimensions and hardware. Beyond that, query latency and memory pressure force sharding. The sharding choice — by what key, with what topology, with what replication — decides the system's scale ceiling and operational behavior.

This piece walks through the patterns that hold up.

The Three Sharding Strategies

flowchart TB
    Strategy[Sharding strategy] --> S1[By tenant]
    Strategy --> S2[By hash]
    Strategy --> S3[By semantic / cluster]

By Tenant

Each tenant's vectors live in their own shard. Natural for multi-tenant SaaS.

  • Pro: per-tenant isolation; predictable; easy to reason about
  • Con: hot tenants overload one shard; small tenants waste capacity

By Hash

Hash the vector ID, shard by hash. Even distribution.

  • Pro: even load distribution
  • Con: queries must fan out to all shards (unless filterable up-front)

By Semantic / Cluster

Cluster vectors by topic; each cluster is a shard.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Pro: queries hit only relevant shards (lower fan-out)
  • Con: complex to maintain; cluster boundaries drift

For most production systems in 2026, by-tenant works for multi-tenant SaaS; by-hash for single-tenant high-volume; by-semantic for specialized very-large corpora.

Replication

Each shard is replicated for availability:

  • 2-3 replicas per shard typical
  • Reads served from any replica
  • Writes go to primary, propagated to replicas

The replication strategy affects consistency:

  • Synchronous: all replicas updated before write returns
  • Asynchronous: write returns; replicas catch up
  • Quorum: majority of replicas updated

For vector data, asynchronous is common because vector updates are usually appends with eventual consistency tolerance.

Rebalancing

When shards become uneven:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Add a new shard
  • Move vectors from overloaded shards to the new one
  • Update routing

Rebalancing is operationally painful. Patterns to minimize:

  • Predict shard growth and pre-shard
  • Use consistent hashing so adds move minimal data
  • Schedule rebalancing during low-traffic windows
  • Some vector DBs (Milvus) support online rebalancing

Routing

Where do queries go?

flowchart LR
    Q[Query] --> R[Router]
    R -->|filter shards| S1[Shard 1]
    R -->|fan out| S2[Shard 2]
    R -->|by tenant| S3[Shard 3]
    S1 --> Merge[Merge results]
    S2 --> Merge
    S3 --> Merge
    Merge --> Top[Top K]

The router decides which shards to query. Smart routing saves cost and latency.

Cross-Shard Top-K

Top-K across shards: each shard returns its top-K, the router merges, returns the global top-K. The math:

  • Each shard returns top-K
  • Merge sorts (K * num_shards) candidates
  • Returns global top-K

For consistency in score scales, all shards should use the same scoring (same embedding model, same distance metric).

What Goes Wrong

  • Hot shard — one tenant or one hash bucket dominates
  • Skewed distribution — some shards 90 percent full, others 10 percent
  • Query fan-out at every request — saturates the network
  • Stale replicas serving wrong data
  • Rebalancing during peak — performance degrades

Operational Realities

For very large vector deployments (1B+ vectors):

  • Monitor per-shard query rate, latency, memory
  • Alert on imbalance
  • Have a rebalance runbook
  • Test failover regularly

Specific Vendor Patterns

  • pgvector: Citus / Postgres sharding extensions
  • Qdrant: native sharding with replication
  • Milvus: distributed by design; multiple sharding modes
  • Weaviate: multi-tenant with shard per tenant
  • Pinecone: managed sharding (the user does not see it)

When Not to Shard

  • Workloads under 50M vectors fit in one HNSW index — do not over-engineer
  • Workloads with very low QPS may be fine on a single node
  • Latency-sensitive workloads where fan-out cost is unacceptable

Sources

## Vector DB Sharding Strategies for Hundreds of Millions of Vectors: production view Vector DB Sharding Strategies for Hundreds of Millions of Vectors forces a tension most teams underestimate: agent handoff state. A single LLM call is easy. A booking agent that hands a confirmed slot to a billing agent that hands a follow-up to an escalation agent — that's where context loss, hallucinated IDs, and double-bookings live. Solving it well means treating the conversation as a stateful workflow, not a chat. ## Broader technology framing The protocol layer determines what's possible: WebRTC for browser-side widgets, SIP trunks (Twilio, Telnyx) for PSTN voice, WebSockets for the Realtime API streaming session. Each has its own jitter buffer, its own ICE/STUN dance, and its own failure modes when a customer's corporate firewall is hostile. Front-end is **Next.js 15 + React 19** for the marketing surface and the in-app dashboards, with server components used heavily for the SEO-critical pages. Backend splits across **FastAPI** for the AI worker, **NestJS + Prisma** for the customer-facing API, and a thin **Go gateway** that does auth, rate limiting, and routing — letting each service scale on its own characteristics. Datastores: **Postgres** as the source of truth (per-vertical schemas like `healthcare_voice`, `realestate_voice`), **ChromaDB** for RAG over support docs, **Redis** for ephemeral session state. Postgres RLS enforces tenant isolation at the row level so a misconfigured query can't leak across customers. ## FAQ **What's the right way to scope the proof-of-concept?** Real Estate runs as a 6-container pod (frontend, gateway, ai-worker, voice-server, NATS event bus, Redis) backed by Postgres `realestate_voice` with row-level security so multi-tenant data never crosses tenants. For a topic like "Vector DB Sharding Strategies for Hundreds of Millions of Vectors", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations. **How do you handle compliance and data isolation?** Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar. **When does it make sense to switch from a managed model to a self-hosted one?** The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer. ## Talk to us Want to see how this maps to your stack? Book a live walkthrough at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting), or try the vertical-specific demo at [salon.callsphere.tech](https://salon.callsphere.tech). 14-day trial, no credit card, pilot live in 3–5 business days.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.