LLM Pre-Training Data Curation: Quality Filtering Techniques That Actually Matter
Deep dive into the data curation and quality filtering techniques that determine LLM performance — from deduplication to classifier-based filtering and data mixing strategies.
Data Quality Is the Largest Lever in LLM Performance
The AI industry spent 2024 and 2025 learning an expensive lesson: throwing more compute at bad data does not produce good models. Research from teams at Meta, Google DeepMind, and Apple consistently shows that data quality and composition have a larger impact on model capability than model size or training duration.
The Llama 3 technical report revealed that Meta's data curation pipeline filters out roughly 85% of raw web data before it enters pre-training. Apple's DataComp-LM project demonstrated that a 1.5B parameter model trained on carefully filtered data can outperform a 7B model trained on unfiltered CommonCrawl.
The Data Curation Pipeline
Stage 1: URL and Domain Filtering
The first pass removes entire domains known to produce low-quality content: spam farms, content mills, auto-generated SEO pages, and sites that are predominantly ads. This is typically done with curated blocklists combined with domain-quality classifiers.
flowchart TD
START(["LLM Pre-Training Data Curation: Quality<br/>Filtering Techniques That Actually Matter"])
S0["Data Quality Is the Largest<br/>Lever in LLM Performance"]
START --> S0
S1["The Data Curation Pipeline"]
S0 --> S1
S2["The DoReMi Approach"]
S1 --> S2
S3["Practical Takeaways for 2026"]
S2 --> S3
DONE(["Key Takeaways"])
S3 --> DONE
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
# Simplified domain quality scoring
def score_domain(domain: str, features: DomainFeatures) -> float:
signals = [
features.ads_to_content_ratio < 0.3,
features.unique_authors > 10,
features.avg_page_word_count > 200,
features.external_link_quality_score > 0.5,
not features.is_known_spam_domain,
]
return sum(signals) / len(signals)
Stage 2: Document-Level Deduplication
Duplicate documents in training data cause models to memorize specific passages rather than learning general patterns. There are three main approaches:
- Exact dedup: Hash-based matching (fast but misses near-duplicates)
- MinHash LSH: Probabilistic near-duplicate detection using locality-sensitive hashing. The standard approach used by most labs.
- Suffix array dedup: Identifies repeated substrings across the corpus, enabling paragraph-level deduplication
Research from the BigScience project showed that aggressive deduplication can reduce dataset size by 30-50% while improving downstream task performance.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Stage 3: Quality Classification
This is where the real art lies. Quality classifiers are typically trained to distinguish between "high-quality" text (Wikipedia articles, published books, academic papers) and "low-quality" web text.
Common approaches:
- Perplexity filtering: Use a language model trained on high-quality text to score documents. Low-perplexity documents (more predictable text) are assumed to be higher quality.
- Fasttext classifiers: Train a binary classifier on hand-labeled quality examples. Fast inference makes this practical at web scale.
- LLM-as-judge: Use a strong LLM to rate document quality on multiple axes (coherence, informativeness, writing quality). Expensive but high precision.
Stage 4: Content Safety Filtering
Remove personally identifiable information (PII), hate speech, explicit content, and copyrighted material. This combines rule-based detectors (regex for SSNs, emails) with classifier-based approaches for nuanced content categories.
Stage 5: Data Mixing
The final and often most impactful step: deciding what proportion of each data source to include. The training mix — the ratio of web text, books, code, academic papers, conversational data, and instruction data — fundamentally shapes model behavior.
The DoReMi Approach
Google Research's DoReMi algorithm optimizes data mixing ratios automatically. Rather than hand-tuning proportions, DoReMi trains a small proxy model with different mixes and measures which composition produces the best downstream performance. The optimal mix is then used for the full-scale training run.
Key finding: the optimal data mix is often counterintuitive. For instance, code data improves reasoning capability even for non-coding tasks, and including a small percentage of multilingual data improves English performance on certain benchmarks.
Practical Takeaways for 2026
- Invest in curation before compute: A week spent improving your data pipeline often outperforms a month of additional training
- Build quality classifiers specific to your domain: Generic quality filters miss domain-specific nuances
- Monitor for data contamination: Ensure your evaluation benchmarks have not leaked into your training data
- Track data provenance: Know where every document in your training set came from for reproducibility and compliance
Sources:
- https://arxiv.org/abs/2407.21783
- https://arxiv.org/abs/2305.10429
- https://huggingface.co/blog/data-is-better-together
flowchart LR
IN(["Input prompt"])
subgraph PRE["Pre processing"]
TOK["Tokenize"]
EMB["Embed"]
end
subgraph CORE["Model Core"]
ATTN["Self attention layers"]
MLP["Feed forward layers"]
end
subgraph POST["Post processing"]
SAMP["Sampling"]
DETOK["Detokenize"]
end
OUT(["Generated text"])
IN --> TOK --> EMB --> ATTN --> MLP --> SAMP --> DETOK --> OUT
style IN fill:#f1f5f9,stroke:#64748b,color:#0f172a
style CORE fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
flowchart TD
HUB(("The Data Curation<br/>Pipeline"))
HUB --> L0["Stage 1: URL and Domain<br/>Filtering"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["Stage 2: Document-Level<br/>Deduplication"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["Stage 3: Quality<br/>Classification"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["Stage 4: Content Safety<br/>Filtering"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L4["Stage 5: Data Mixing"]
style L4 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.