By Sagar Shankaran, Founder of CallSphere
Drizzle ORM ships first-class pgvector helpers in 2026. Wire HNSW cosine search to gpt-4o-mini for sub-50ms RAG with type-safe queries — no Python, no Pinecone.
Key takeaways
TL;DR — Drizzle ORM 0.36+ adds
vectorcolumn type andcosineDistanceoperator. With pgvector 0.7 HNSW indexes you can hit sub-50ms top-K retrieval on a free Neon tier — all from typed TypeScript.
A documentation-search agent that ingests Markdown files, embeds chunks with text-embedding-3-small, stores them in Postgres, and answers questions with citations using gpt-4o-mini.
drizzle-orm@^0.36, drizzle-kit@^0.27, postgres@^3.4 (or pg).CREATE EXTENSION vector). Neon, Supabase, RDS all support it.openai@^4.70.flowchart LR
MD[Markdown files] --> CH[Chunker]
CH --> EMB[text-embedding-3-small]
EMB --> PG[(Postgres + pgvector HNSW)]
Q[User Q] --> EMB2[embed]
EMB2 --> PG --> CTX[top-K]
CTX --> LLM[gpt-4o-mini] --> A[Answer + citations]
```ts import { pgTable, serial, text, vector, index } from "drizzle-orm/pg-core";
export const chunks = pgTable("chunks", { id: serial("id").primaryKey(), doc: text("doc").notNull(), content: text("content").notNull(), embedding: vector("embedding", { dimensions: 1536 }).notNull(), }, (t) => ({ // HNSW + cosine — fastest for OpenAI embeddings hnsw: index("hnsw_idx").using("hnsw", t.embedding.op("vector_cosine_ops")), })); ```
```ts import OpenAI from "openai"; import { db } from "./db"; import { chunks } from "./schema";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
const oa = new OpenAI();
export async function ingest(doc: string, text: string) { const pieces = text.match(/[^.!?]{200,400}[.!?]/g) ?? []; const r = await oa.embeddings.create({ model: "text-embedding-3-small", input: pieces, }); await db.insert(chunks).values(pieces.map((c, i) => ({ doc, content: c, embedding: r.data[i].embedding, }))); } ```
```ts import { cosineDistance, sql, desc } from "drizzle-orm";
export async function search(q: string, k = 5) { const [{ embedding }] = (await oa.embeddings.create({ model: "text-embedding-3-small", input: q, })).data;
const sim = sql1 - (${cosineDistance(chunks.embedding, embedding)});
return db.select({ doc: chunks.doc, content: chunks.content,
score: sim.as("score") })
.from(chunks)
.orderBy(desc(sim))
.limit(k);
}
```
```ts
export async function answer(q: string) {
const ctx = await search(q, 5);
const r = await oa.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "Answer using ONLY the context. Cite [doc]." },
{ role: "user", content: Context:\n${ctx.map(c => [${c.doc}] ${c.content}).join("\n")}\n\nQ: ${q} },
],
});
return { answer: r.choices[0].message.content, citations: ctx };
}
```
pnpm drizzle-kit generate && pnpm drizzle-kit migrate produces migration SQL with CREATE INDEX hnsw_idx USING hnsw. Re-run after dimension changes.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```sql SET hnsw.ef_search = 100; -- higher = better recall, slower ```
For 1M+ chunks, also set m = 16 and ef_construction = 64 at index build time.
text-embedding-3-small is 1536; -large is 3072 — schema must match.vector_l2_ops vs vector_cosine_ops: OpenAI embeddings are normalized — cosine is correct.CallSphere uses Drizzle + pgvector across 115+ DB tables for tool-call memory, agent transcripts, and KB RAG. The platform powers 37 agents, 90+ tools, and 6 verticals including Healthcare (FastAPI), OneRoof (Next.js 16 + React 19), Salon (NestJS 10 + Prisma), and Sales (Node.js 20 + React 18 + Vite). $149/$499/$1,499, 14-day no-card trial, 22% affiliate.
Drizzle vs Prisma for RAG? Drizzle has native cosineDistance and lets you keep raw SQL when needed; Prisma still requires $queryRaw for vectors.
Cost? text-embedding-3-small is $0.02/1M tokens — millions of chunks for cents.
Why HNSW over IVFFlat? Better recall at low latencies for 1M+ rows; rebuild-free incremental insert.
Vector size limit? pgvector caps at 16,000 dimensions; OpenAI's largest is 3,072.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to building a chatbot for answering questions on your website: RAG, voice, and how CallSphere ships one in 3-5 days.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
A founder's guide on how to create a chatbot in 2026. Build options, AI stack, integration patterns, and when buying a managed agent wins over building.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI