Filter Bubble Bursts: When Common Crawl Beats Clean Data

Cleaning is comforting.

Every serious AI team has some version of the same ritual. Remove spam. Remove repetition. Remove bad language detection. Remove low-quality pages. Remove documents that look too weird, too short, too duplicated, too uneducational, too internet. Then hope the model learns from the respectable leftovers.

That instinct is not foolish. In small or compute-constrained training runs, filtering often helps. The expensive mistake is treating that local truth as a permanent law.

Christopher Mohri, John Duchi, and Tatsunori Hashimoto’s paper A Bitter Lesson for Data Filtering asks a deliberately uncomfortable question: what if, at sufficiently high compute, the best Common Crawl filter is no filter at all?¹ Not “less filtering.” Not “a better quality classifier.” No filter. The claim sounds like an invitation to upload the sewer into the model and call it scale. The paper is more careful than that, thankfully. Its actual result is narrower, more interesting, and more useful: data quality is not independent of model size, training length, and token scarcity.

The business translation is simple but inconvenient: dataset curation is not a hygiene checklist. It is a compute allocation strategy.

The paper tests dataset value by asking how much performance can be extracted from it

The authors do not ask whether a filtered dataset performs better under one fixed training budget. That would be the usual leaderboard version of the question, and it would mostly reward whichever dataset is best tuned for the chosen budget.

Instead, they define dataset value as best achievable performance after varying model size and training steps. In simplified form, for a dataset $D$, a training algorithm $A$, model size $M$, training steps $N$, and evaluation loss $\ell$, they care about:

$$ L^\ast(D) = \min_{M,N} \ell(A(D, M, N)). $$

This framing matters. A small filtered dataset may look excellent when compute is limited because it concentrates useful signal early. But if the same filtered dataset eventually runs out of novelty while a larger unfiltered pool still contains usable long-tail information, the winner can reverse.

The experiments use the DCLM-Pool version of Common Crawl: 240 trillion GPT-NeoX tokens from pre-2023 Common Crawl, parsed from HTML. The practical experiments scale down from the full pool to random subsets from roughly 670 million to 10 billion tokens. Models are Llama-style dense transformers from 15 million to 7 billion parameters. The main metric is average negative log-likelihood across C4 English, FineWeb-Edu, and Cosmopedia, with downstream benchmark checks in the appendix.

This is not a toy classification experiment. It is still not frontier training either. That distinction will matter later.

The first result is a reversal: filtered data wins early, then raw Common Crawl catches up

The first evidence block compares a 670M-token Common Crawl pool against five filtered variants:

Filter or dataset	What it does	Share of original pool retained
English filter	Keeps documents above a fastText English-score threshold	28.2%
Repetition filter	Removes documents with excessive duplicated segments	45.3%
Stop-word filter	Requires at least two common English stop words	50.4%
RefinedWeb-style filtering	Applies the above plus similar heuristic filters	13.0%
DCLM-Baseline	Adds deduplication and quality-classifier filtering	2.1%
Pool / Common Crawl	No filtering beyond the DCLM-Pool parsing setup	100%

At the 15M model scale, the intuitive story mostly holds. The English-filtered subset reaches a better average NLL than the raw pool. The raw pool is not the hero yet; it is just the messy warehouse at the back of the building.

Then the plot turns. At 330M parameters, the raw pool reaches the best average NLL among the tested datasets. At 1B parameters, the raw pool improves further. The authors report a best average NLL of 3.37 for the raw pool on the 1B model, versus 3.46 for the stop-word filter, 3.58 for the repetition filter, 3.59 for the English filter, 3.93 for RefinedWeb, and 5.29 for DCLM-Baseline.

The important part is not just that raw Common Crawl eventually wins. It is how it wins.

The raw pool starts as one of the weaker options in low-compute regimes. It requires both a larger model and enough training steps. On the compute-performance Pareto frontier, the authors show that as compute rises, the raw pool moves from worst to best. Some filters never earn a place on the frontier at all. The repetition filter, for example, is dominated: at each compute level, at least two other datasets perform better.

This is the first business-relevant reversal. Filtering is not “good” or “bad.” Filtering changes the shape of the compute curve. It can buy early efficiency while selling away late-stage capacity.

That is a portfolio trade, not a moral virtue.

The junk-data tests are robustness probes, not permission slips

After the filtering result, the authors stress-test a more aggressive claim: maybe models are not just tolerant of ordinary Common Crawl mess, but robust to intentionally degraded data.

They inject two types of low-quality data into the 670M-token pool:

Test	Likely purpose	What it supports	What it does not prove
Random strings	Robustness / sensitivity test against data with almost no semantic signal	Large models can learn to discount clearly alien distributions, and the performance gap closes with scale	Random text is not the same as persuasive false information, poisoned data, or malicious synthetic content
Shuffled-word documents	Exploratory robustness test where word order is destroyed but unigram and co-occurrence signals remain	Large models can extract useful signal from degraded documents when residual associations remain	Word order is irrelevant, or shuffled data is generally better than normal text

The random-string test uses a vocabulary of 10,000 artificial words made from 3–8 lowercase characters. The documents are meaningless but visually text-like. The shuffled-word test takes additional Common Crawl documents and randomizes word order inside each document. This preserves word presence and local document-level associations while destroying syntax.

The results are odd in the useful way.

Small models suffer. The 15M model shows clear separation when junk is added. But as model size grows, the gap narrows. For random strings, the large models approach pool performance; at 1B parameters, the +20% random condition slightly beats the raw pool in average NLL, though the authors treat this carefully as possible regularization or accidental similarity to natural text.

The shuffled-word condition is more striking. At 330M parameters, all shuffled datasets except the +800% condition surpass the raw pool after roughly 11B training tokens. At 1B parameters, +400% shuffled reaches an average NLL of 3.36, slightly better than the raw pool at 3.40.

This is where readers are likely to overreact. The headline is not “syntax is useless.” The more precise interpretation is that broken documents can still carry usable statistical structure. If a document contains “France” and “Paris,” even after word order is scrambled, the model may still learn that the two belong near each other in semantic space. A larger model can apparently separate the degraded distribution from normal text and still harvest some of the remaining signal.

A smaller model cannot afford that luxury. It gets confused. Like many interns, it lacks the capacity to ignore what should be ignored.

The downstream benchmark appendix is useful here, but mainly as supporting evidence rather than a second thesis. The authors report that PIQA, ARC-Easy, and SocialIQA trends are broadly consistent but noisier than validation loss. That is exactly what one should expect at this scale: small benchmarks have fewer samples and more variance. For business interpretation, the appendix should reduce the fear that the loss curves are pure metric theater, but it should not be treated as a decisive deployment-performance claim.

Scaling the pool moves the no-filter crossover far to the right

A 670M-token pool is not the internet. The authors know this, so the next step scales pool size and asks whether the same reversal might hold for the full 240T-token DCLM-Pool.

Here the comparison narrows to raw Common Crawl versus RefinedWeb-style filtering. The authors vary pool size, model size, and training steps, then define a crossing point: the minimum training step count where the raw pool beats the best RefinedWeb result.

The pattern is clear but expensive. For a 1B model, the crossing point grows rapidly with pool size. The authors summarize the rough epoch requirement as about one epoch for the 670M-token pool, about three epochs for the 2B-token pool, and about ten epochs for the 10B-token pool. In high-epoch regimes above 100 epochs, losses can become nonmonotone, so extrapolations there become less reliable.

Model size pushes in the opposite direction. Larger models need fewer epochs for the raw pool to win. With an 80M model, crossing points disappear for the largest tested pool sizes. With 330M, 1B, and 7B models, the crossing behavior becomes more favorable.

The authors then fit two scaling-law projections:

Projection method	Constraint used	Estimated compute for 240T-token pool to beat RefinedWeb
Epoch-constrained fit	4 epochs	$9.0 \times 10^{29}$ FLOPs
Token-per-parameter fit	600 tokens per non-embedding parameter	$3.6 \times 10^{30}$ FLOPs

Both land near $10^{30}$ FLOPs. That number is large. The authors compare it with an estimate of roughly $5 \times 10^{26}$ FLOPs for frontier pretraining compute cited in the paper, and with forecasts of $10^{29}$ FLOP training runs by 2030.

This section is the practical center of the article. The no-filter result is not saying that every company should dump raw Common Crawl into next quarter’s model run. It says the optimal amount of filtering shifts with compute, capacity, and token availability. The bigger and longer the training run, the more expensive it becomes to throw away weak-but-real signal.

For most companies, the message is not “stop filtering.” It is “stop assuming your filtering policy is scale-invariant.”

The paper’s edge cases explain why wrong-but-plausible data is more dangerous than ugly data

The authors do not claim that all bad data is harmless. Their distinction is important:

Covariate shift: the input distribution looks different, but the conditional facts or labels are not actively wrong.
Conditional shift or wrong labels: the training data gives the model a false association for the target task.

Random strings and shuffled documents are mostly covariate-shift tests. They look strange. They may waste capacity. But they do not necessarily teach the model that Paris is in Denmark.

Wrong factual content is different. The authors give the simple example: if the model sees enough instances of “The capital of France is Copenhagen,” it may learn the wrong thing. This is not noise in the decorative sense. It is a poisonous label.

The paper includes a brief corpus analysis using MMLU-related keyword matches in Common Crawl. A GPT5-mini classifier labels documents as supporting, refuting, related, or unrelated to selected MMLU answers. In the reported subjects, supporting documents outnumber refuting ones by at least an order of magnitude. For example, world religions shows 5.89 support and 0.00 refute on average; astronomy shows 2.03 support and 0.14 refute; medical genetics shows 2.80 support and 0.23 refute.

This is not a full factuality audit. It is a plausibility check. Its role is to support the claim that, for these selected cases, Common Crawl may contain much more weakly useful evidence than actively contradictory evidence.

A second edge case comes from the shuffled-word data. Average validation loss over full sequences improves, but when the authors evaluate only the first token or first few tokens, the shuffled condition loses its advantage. That makes sense. Before seeing any tokens, the model cannot know whether the document is shuffled. Once it sees enough context, it can infer the distribution and adapt. The paper argues this first-token weakness may not matter much for most language-model use cases, which involve more than a few tokens.

For businesses, this is the cleanest operational lesson in the paper: ugly data and wrong data are different risk categories. A crawler artifact is not the same as a false medical claim. Treating both as generic “low quality” is how a data policy becomes confident and stupid at the same time.

The theory section gives a mechanism: capacity lets models separate tasks instead of averaging them

The theory is not the main evidence. It is a post-hoc mechanism that makes the empirical result less mystical.

The first model is low-rank matrix factorization. It imagines examples coming from multiple tasks whose inputs are orthogonal, meaning the tasks can in principle be separated. If the model rank is high enough, it can represent the tasks without interference. If the rank is too low, the tasks collide, and performance degrades.

That is a simplified mirror of the empirical story. Large models may route strange or low-quality distributions through representations that do not damage the useful ones. Small models have to compress everything into too little space, so “bad” data interferes with “good” data.

The appendix adds a second theoretical condition for when filtering helps. In plain language, filtering improves prediction when it removes incorrect or irrelevant examples more effectively than it removes correct similar examples. If the original dataset already has high prevalence of the correct label, the possible gain from filtering is small. If a strong filter removes many correct similar examples, its true positive rate can collapse, and filtering can make the model worse.

That theory fits the business picture neatly. A filter is not valuable because it looks sophisticated. It is valuable only if it improves the signal ratio for the actual target distribution after accounting for what it deletes.

Quality classifiers, please take a seat. Preferably near the audit log.

The business implication is compute-aware curation, not curation nihilism

The lazy interpretation of this paper is “data filtering is dead.” That is wrong. The useful interpretation is more demanding: filtering must be optimized against compute, architecture, target domain, and downstream risk.

A practical decision framework looks like this:

Business situation	Likely data strategy	Reason
Small model, limited training budget	Filter aggressively, but measure what is lost	Filtering concentrates early signal and reduces waste
Medium model with repeated epochs over a small corpus	Loosen filters and test crossing points	The model may benefit from more weak signal once the curated subset saturates
Frontier-scale dense pretraining	Treat raw or lightly filtered web data as a serious baseline	Over-filtering may discard long-tail information that large models can use
High-stakes factual domain	Filter for correctness, provenance, and contradiction risk	Wrong-but-plausible content is more dangerous than strange formatting
Synthetic-data-heavy pipeline	Separate synthetic utility from synthetic drift	Synthetic data may add effective tokens, but may also shift the crossing point or create new failure modes
MoE or post-training-heavy architecture	Re-test; do not inherit dense-pretraining conclusions blindly	The paper’s evidence is for dense transformer pretraining, not every training stack

This table is deliberately less exciting than the headline. Good. Excitement is cheap; H200 hours are not.

For AI infrastructure teams, the paper suggests that data policy should be run like an experiment portfolio. A curation policy should report not only “quality score” but also:

token-retention rate;
performance at multiple model sizes;
performance after multiple epoch counts;
validation loss on both broad and target distributions;
downstream benchmark sensitivity;
factual contradiction rate, not just stylistic cleanliness;
marginal compute cost of keeping versus removing each data family.

A filter that improves a 15M diagnostic run may still damage a 1B or 7B production run. A filter that improves average benchmark score may still delete domain-specific tail content. A filter that removes ugly pages may also remove rare but valuable technical documents. This is the boring accounting layer beneath the glamorous “data quality” slogan.

The boundaries are narrow enough to matter

The paper’s strongest evidence comes from dense transformer pretraining on scaled Common Crawl subsets, evaluated primarily by negative log-likelihood across broad validation datasets. That is a meaningful setting, but not a universal one.

Several boundaries should remain attached to the result:

Architecture boundary: the experiments do not establish the same conclusion for Mixture-of-Experts models, retrieval-augmented systems, or unusual training curricula.
Stage boundary: the findings concern pretraining, not post-training, instruction tuning, alignment, or application-specific fine-tuning.
Data-era boundary: the DCLM-Pool data is pre-2023. Future Common Crawl snapshots may contain more AI-generated content, and the paper explicitly treats that as uncertain.
Metric boundary: validation loss is smoother and informative, but it is not the same as product reliability, factual accuracy, safety, or task-specific ROI.
Scale boundary: the full-pool result is a projection. The 240T-token crossing near $10^{30}$ FLOPs is supported by scaling laws, not directly demonstrated by training at that scale.

These limitations do not weaken the paper’s main insight. They prevent it from turning into a slogan, which is where insights usually go to retire.

Conclusion: data quality is a scaling variable

The most useful thing about A Bitter Lesson for Data Filtering is not that it attacks filtering. It attacks fixed thinking about filtering.

At low compute, filtering can be rational. At high compute, aggressive filtering can become a tax on long-tail learning. Large models can sometimes tolerate or even exploit data that small models cannot handle. Shuffled text can still carry co-occurrence signal. Random strings can be discounted. But false, plausible, target-relevant content remains dangerous.

So the new operating rule is not “never clean.” It is this:

Clean data only after asking what compute regime, model capacity, and target risk make the cleaning valuable.

That is less satisfying than a universal policy. It also has the minor advantage of being true.

Cognaptus: Automate the Present, Incubate the Future.

Christopher Mohri, John Duchi, and Tatsunori Hashimoto, “A Bitter Lesson for Data Filtering,” arXiv:2605.19407, 2026. ↩︎

The paper tests dataset value by asking how much performance can be extracted from it#

The first result is a reversal: filtered data wins early, then raw Common Crawl catches up#

The junk-data tests are robustness probes, not permission slips#

Scaling the pool moves the no-filter crossover far to the right#

The paper’s edge cases explain why wrong-but-plausible data is more dangerous than ugly data#

The theory section gives a mechanism: capacity lets models separate tasks instead of averaging them#

The business implication is compute-aware curation, not curation nihilism#

The boundaries are narrow enough to matter#

Conclusion: data quality is a scaling variable#