Domain-Specific RAG: Medical, Legal, Financial Vocabularies
Domain vocabulary breaks generic embeddings. The 2026 patterns for medical, legal, and financial RAG that actually retrieve the right docs.
Why Domain RAG Is Harder
Generic embeddings (text-embedding-3-large, BGE-base) trained on web text understand "myocardial infarction" because the web does. They struggle with ICD-10 codes, CPT codes, drug names, legal Latin, financial instrument abbreviations. The vocabulary gap means relevant documents do not embed near the queries.
For medical, legal, and financial RAG in 2026, addressing this gap is the difference between "demo works" and "production reliable."
Three Approaches
flowchart TB
Approach[Approach] --> A1[Domain-tuned embedding model]
Approach --> A2[Hybrid retrieval: BM25 + dense]
Approach --> A3[Vocabulary expansion]
Domain-Tuned Embedding
Fine-tune an embedding model on domain text. Improves recall substantially.
- Medical: PubMedBERT, MedCPT, BioGPT-derived embeddings
- Legal: Casetext-trained embeddings (proprietary), Law-specific BGE variants
- Financial: FinBERT-derived embeddings, BloombergGPT-derived
The 2026 reality: open-source domain embeddings exist for medical and legal; financial domain embeddings are mostly proprietary.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Hybrid Retrieval
BM25 catches exact-match domain terms (ICD codes, drug names) that dense embeddings miss. The 2026 hybrid pattern combines:
- BM25 for keyword and code matches
- Dense embeddings for conceptual queries
- Sparse learned embeddings for hybrid
Fused via RRF, this pattern handles both code-heavy and language-heavy queries.
Vocabulary Expansion
Expand the user's query before retrieval to include synonyms and codes:
- "heart attack" → "heart attack | myocardial infarction | MI | I21"
- "fired" → "fired | terminated | discharged | severance"
- "loan" → "loan | debt | borrowing | credit"
LLM generates expansions; retriever queries the expanded form.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Domain-Specific Patterns
Medical RAG
- Index ICD-10 / CPT / SNOMED codes alongside text
- Use medical-tuned embeddings (MedCPT, PubMedBERT)
- Respect HIPAA: PHI in prompts must follow BAA paths
- Prefer cited sources (clinical guidelines, peer-reviewed)
- Date-aware: medical knowledge evolves; old answers may be wrong
Legal RAG
- Cite-aware retrieval (citations are first-class)
- Jurisdiction filtering critical (federal vs state vs Fifth Circuit)
- Date-aware (laws change; case law evolves)
- Plain-language vs technical-language modes
- Disclaimers in outputs ("not legal advice")
Financial RAG
- Time-aware (yesterday's prices vs today's)
- Entity disambiguation (Apple Inc vs Apple Records)
- Compliance-aware outputs (FINRA 2210 for investor-facing content)
- Privileged information handling
- Audit trail per query
A Production Architecture
flowchart LR
Q[Query] --> Domain{Domain classifier}
Domain --> Med[Medical: MedCPT + ICD index]
Domain --> Leg[Legal: Citator + jurisdiction filter]
Domain --> Fin[Financial: time-aware + entity index]
Med --> Gen[Generate with citations]
Leg --> Gen
Fin --> Gen
Each domain gets its own retrieval pipeline; the generation step uses domain-aware system prompts.
Evaluation
Domain RAG eval suites must include:
- Domain-specific test questions
- Code-only queries (ICD, CPT, statute citations)
- Mixed code + natural-language queries
- Time-sensitive queries
- Edge cases the domain has known issues with
Generic RAG benchmarks (HotpotQA, NaturalQuestions) miss domain failure modes.
Cost Considerations
Domain-tuned embedding models are typically smaller than frontier text models, but require:
- Re-embedding the corpus when models are updated
- Storage for embeddings
- Compute for re-embedding
For corpora that change rarely (medical guidelines, statute law), this is a one-time cost. For high-velocity corpora (financial news), it adds up.
What Goes Wrong
- Using generic embeddings on a domain corpus and accepting poor recall
- Not handling codes (ICD, CPT, statute citations)
- Stale corpora (last year's guidelines, last quarter's regulations)
- Mixing domains (legal corpus retrieved for medical query)
- Privacy violations (PHI / PII in prompts to non-BAA providers)
Sources
- MedCPT embeddings — https://github.com/ncbi/MedCPT
- "BioGPT" Microsoft — https://github.com/microsoft/BioGPT
- "Legal AI" Stanford CodeX — https://law.stanford.edu/codex
- "FinBERT" — https://github.com/yya518/FinBERT
- Casetext / CoCounsel — https://casetext.com
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.