LLM Pre-Training Data Curation: Quality Filtering Techniques That Actually Matter
By Sagar Shankaran, Founder of CallSphere
Deep dive into the data curation and quality filtering techniques that determine LLM performance — from deduplication to classifier-based filtering and data mixing strategies.
Key takeaways
Data Quality Is the Largest Lever in LLM Performance
The AI industry spent 2024 and 2025 learning an expensive lesson: throwing more compute at bad data does not produce good models. Research from teams at Meta, Google DeepMind, and Apple consistently shows that data quality and composition have a larger impact on model capability than model size or training duration.
The Llama 3 technical report revealed that Meta's data curation pipeline filters out roughly 85% of raw web data before it enters pre-training. Apple's DataComp-LM project demonstrated that a 1.5B parameter model trained on carefully filtered data can outperform a 7B model trained on unfiltered CommonCrawl.
The Data Curation Pipeline
Stage 1: URL and Domain Filtering
The first pass removes entire domains known to produce low-quality content: spam farms, content mills, auto-generated SEO pages, and sites that are predominantly ads. This is typically done with curated blocklists combined with domain-quality classifiers.
flowchart LR
CORPUS[("Pre-training corpus<br/>trillions of tokens")]
FILTER["Quality filter and<br/>dedupe"]
TOK["BPE tokenizer"]
SHARD["Shard plus<br/>data parallel"]
GPU{"GPU cluster<br/>FSDP or DeepSpeed"}
CKPT[("Checkpoints<br/>every N steps")]
LOSS["Loss curve plus<br/>eval gates"]
SFT["SFT phase"]
DPO["DPO or RLHF"]
BASE([Base model])
INSTR([Instruct model])
CORPUS --> FILTER --> TOK --> SHARD --> GPU
GPU --> CKPT --> LOSS
LOSS --> BASE --> SFT --> DPO --> INSTR
style GPU fill:#4f46e5,stroke:#4338ca,color:#fff
style LOSS fill:#f59e0b,stroke:#d97706,color:#1f2937
style INSTR fill:#059669,stroke:#047857,color:#fff
# Simplified domain quality scoring
def score_domain(domain: str, features: DomainFeatures) -> float:
signals = [
features.ads_to_content_ratio < 0.3,
features.unique_authors > 10,
features.avg_page_word_count > 200,
features.external_link_quality_score > 0.5,
not features.is_known_spam_domain,
]
return sum(signals) / len(signals)
Stage 2: Document-Level Deduplication
Duplicate documents in training data cause models to memorize specific passages rather than learning general patterns. There are three main approaches:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Exact dedup: Hash-based matching (fast but misses near-duplicates)
- MinHash LSH: Probabilistic near-duplicate detection using locality-sensitive hashing. The standard approach used by most labs.
- Suffix array dedup: Identifies repeated substrings across the corpus, enabling paragraph-level deduplication
Research from the BigScience project showed that aggressive deduplication can reduce dataset size by 30-50% while improving downstream task performance.
Stage 3: Quality Classification
This is where the real art lies. Quality classifiers are typically trained to distinguish between "high-quality" text (Wikipedia articles, published books, academic papers) and "low-quality" web text.
Common approaches:
- Perplexity filtering: Use a language model trained on high-quality text to score documents. Low-perplexity documents (more predictable text) are assumed to be higher quality.
- Fasttext classifiers: Train a binary classifier on hand-labeled quality examples. Fast inference makes this practical at web scale.
- LLM-as-judge: Use a strong LLM to rate document quality on multiple axes (coherence, informativeness, writing quality). Expensive but high precision.
Stage 4: Content Safety Filtering
Remove personally identifiable information (PII), hate speech, explicit content, and copyrighted material. This combines rule-based detectors (regex for SSNs, emails) with classifier-based approaches for nuanced content categories.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Stage 5: Data Mixing
The final and often most impactful step: deciding what proportion of each data source to include. The training mix — the ratio of web text, books, code, academic papers, conversational data, and instruction data — fundamentally shapes model behavior.
The DoReMi Approach
Google Research's DoReMi algorithm optimizes data mixing ratios automatically. Rather than hand-tuning proportions, DoReMi trains a small proxy model with different mixes and measures which composition produces the best downstream performance. The optimal mix is then used for the full-scale training run.
Key finding: the optimal data mix is often counterintuitive. For instance, code data improves reasoning capability even for non-coding tasks, and including a small percentage of multilingual data improves English performance on certain benchmarks.
Practical Takeaways for 2026
- Invest in curation before compute: A week spent improving your data pipeline often outperforms a month of additional training
- Build quality classifiers specific to your domain: Generic quality filters miss domain-specific nuances
- Monitor for data contamination: Ensure your evaluation benchmarks have not leaked into your training data
- Track data provenance: Know where every document in your training set came from for reproducibility and compliance
Sources:
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.