Bad data is not one problem. It is at least three problems wearing the same cheap trench coat.
There is bad data that appears once and disappears. There is bad data that keeps appearing, but becomes rarer as the corpus grows. And there is bad data that settles in at a stable rate, like a permanent tenant with poor hygiene and legal representation. Business discussions about AI training data often compress these into one vague category called “noise”. Convenient, yes. Informative, no.
Language Generation with Infinite Contamination by Anay Mehrotra, Grigoris Velegkas, Xifan Yu, and Felix Zhou gives that category some structure.1 The paper studies a formal model of language generation where a learner sees an adversarial stream of examples and must eventually generate valid, unseen strings from the target language. The twist is that the stream may contain contamination: invalid examples inserted into the data, valid examples omitted from it, or both.
The paper does not test a transformer, benchmark a pretraining corpus, or produce a new cleaning recipe for web-scale data. It is learning theory, not an MLOps playbook in theorem cosplay. Its value is different: it separates the cases where contamination merely makes generation harder from the cases where it collapses the kind of coverage that businesses actually care about.
The distinction matters because a model that can produce some valid output is not the same as a model that has learned a broad, useful slice of a domain. One is a chatbot that can say safe things forever. The other is a system that can support discovery, automation, compliance, reasoning, and edge-case handling without shrinking into a decorative autocomplete fern.
The paper’s central contrast is correctness versus coverage
The setup begins with “language generation in the limit”, a formal game inherited from classical learning theory. An adversary selects a target language from a known countable collection and reveals examples over time. The generator must eventually output new elements that belong to the target language and have not already been seen.
This is weaker than identifying the whole language. A learner does not need to know the exact rule behind the language. It only needs to stop hallucinating invalid new examples after some finite point. That weaker requirement is powerful: earlier work showed that generation in the limit is possible for every countable collection of languages under perfect data.
But there is a catch, because of course there is. Standard generation can succeed while covering only a thin subset of the target. The paper’s introduction gives a useful intuition: a generator that keeps outputting “I generated 1”, “I generated 2”, and so on may technically produce valid English strings, but it has not learned English in any meaningful sense. It has learned to count while wearing a language model costume.
That is why prior work introduced dense generation. Density asks whether the generator covers a positive fraction of the target language rather than hiding in an ever-narrower corner. The paper distinguishes two views:
| View of generation | What is measured | Practical analogue |
|---|---|---|
| Element-based density | The long-run sequence of individual outputs | What users see through an API over many interactions |
| Set-based density | The set of outputs the generator can produce at a given stage | The expressive capability of a trained model snapshot |
That distinction is not decorative. Element-based density asks whether the visible output stream is broad. Set-based density asks whether the model’s available output space is broad. For business readers, the second is closer to capability coverage: not just “did we see a good answer eventually?”, but “does the system have broad access to the domain when we need it?”
The new paper asks what happens to both ordinary generation and dense generation when the input stream is contaminated.
Vanishing noise is tolerable; constant noise is not a rounding error
The first major result is surprisingly forgiving. If the fraction of noisy examples converges to zero, then ordinary generation remains possible for all countable language collections, even with infinitely many bad examples and arbitrary omissions.
That sounds generous, but it is more precise than the usual “large datasets wash out noise” folk wisdom. The paper’s condition is not that the total number of bad examples stays small. It can be infinite. The condition is that bad examples become asymptotically sparse.
In business terms, a dataset may contain endless errors and still preserve learnability if the error rate keeps declining as the corpus expands. That is the theoretical version of “scale helps, but only if quality improves faster than the mess accumulates.” Scale does not magically absolve a static contamination rate. It merely gives you a bigger landfill with better branding.
The contrast appears when the noise rate is constant. Under a fixed nonzero contamination rate, ordinary generation is no longer guaranteed for all countable collections. The paper provides a characterization: a collection is generable under constant noise and arbitrary omissions only if it satisfies a specific “constant noise generation property”. The important business translation is not the technical condition itself, but the shift from universal guarantee to collection-specific fragility.
| Contamination regime | What the paper shows for ordinary generation | Business reading | Boundary |
|---|---|---|---|
| Finite contamination | Achievable for all countable collections | One-off bad data can be absorbed | Infinite-time formal model, not finite training performance |
| Infinite noise with vanishing rate | Achievable for all countable collections, even with arbitrary omissions | Data quality can improve asymptotically and still permit valid generation | Says little about sample efficiency or model architecture |
| Constant-rate noise | Achievable only for collections satisfying an additional condition | Persistent contamination changes the problem class | Not a universal operational threshold for real corpora |
The asymmetry around omissions is also worth noticing. The paper allows arbitrary omissions in several positive results for ordinary generation. This is because generation penalizes false positives more than false negatives. If the learner must output valid unseen elements, missing some target elements is less catastrophic than being tricked into outputting invalid ones.
That maps cleanly to many enterprise settings. A support agent that lacks some obscure product answer is inconvenient. A compliance agent that confidently invents a permitted action is expensive. The formal model captures that asymmetry elegantly.
Dense generation breaks sooner because breadth is a stricter promise
The second major contrast is less comforting. Dense generation is much less robust than ordinary generation.
For set-based density, finite contamination already imposes stronger restrictions. The paper proves that set-based upper density can still be achieved under finite contamination for all countable collections, and even gives a stronger result under finite noise with sufficiently non-dense omissions. But set-based lower density becomes sharply harder: the paper characterizes when it is possible, and simple two-language examples can limit or block stronger density guarantees.
The difference between upper and lower density is the difference between “the system is broad infinitely often” and “the system remains broad eventually and consistently.” Businesses usually want the second one. Nobody wants a due-diligence assistant that covers the acquisition target’s risk universe every third Thursday, provided the moon is in a cooperative mood.
Under vanishing noise plus arbitrary omissions, the paper’s result is even more sobering: set-based lower and upper density collapse to the same kind of characterization. In other words, contamination removes some of the separation that made weaker density easier in the clean setting. The system cannot simply rely on occasional broad moments if the adversarial stream can keep ambiguity alive.
Element-based density behaves differently. Under finite contamination, the paper shows a reduction: if an algorithm achieves element-based density in the noiseless setting, a corresponding algorithm can achieve the same type of guarantee under finite contamination. This relies on a finite-expansion idea: treat the observed contaminated stream as a clean enumeration of a modified language formed by adding finitely many noisy elements and removing finitely many omitted ones. Since finite modifications preserve the relevant density behaviour, the guarantee carries over.
But this rescue does not survive infinite contamination. The paper proves that there exists a two-language collection and a vanishing-noise enumeration, with no omissions, for which no element-based generator can achieve any non-trivial upper density. Sparse bad data can still be arranged to keep the learner trapped in a thin ambiguous region.
The operational lesson is blunt: correctness and coverage do not degrade at the same speed. A model can keep producing valid outputs while losing breadth. In business systems, that failure mode is subtle because it does not necessarily announce itself as hallucination. It appears as repetitive answers, missing alternatives, shallow recommendations, brittle long-tail handling, or risk analysis that quietly ignores entire regions of the problem.
The algorithmic trick is not “clean harder”; it is “rank suspicion over time”
The paper’s constructive results rely mainly on two algorithmic templates.
The first is the finite-expansion subroutine. If contamination is finite, the learner can conceptually expand the language collection to include every finite add-delete variant of each language. Because the original collection is countable, and finite subsets of a countable universe are countable, the expanded collection remains countable. A contaminated enumeration of the original target then becomes an uncontaminated enumeration of one of these expanded languages.
This is mathematically neat and operationally suggestive. Finite contamination can sometimes be handled by broadening the hypothesis space rather than perfectly cleaning the input. In real systems, that resembles robust parsing, tolerant schema matching, and error-aware retrieval: do not assume every anomaly is fatal; model a bounded class of anomalies.
The second template is more interesting for infinite contamination. The paper introduces a priority-based intersection algorithm. Earlier algorithms filtered candidate languages by consistency with observed data. That breaks under noise, because one invalid example can wrongly eliminate the true target. So the new approach does not simply discard candidates. It assigns priorities and penalizes candidate languages when their empirical mismatch with the stream exceeds appropriate thresholds.
The key idea is temporal: the true target is penalized only finitely often under the relevant noise condition, while bad candidates are penalized infinitely often. Eventually, the target and its useful supersets rise above misleading candidates.
This is the formal version of a familiar data-engineering instinct: do not overreact to one bad record, but do track systematic inconsistency. The theorem says, in this stylised world, the distinction can be enough.
| Technical mechanism | Role in the paper | Practical analogue |
|---|---|---|
| Finite expansion | Converts finite contamination into clean learning over an expanded countable collection | Robust handling of bounded anomalies |
| Priority-based ranking | Handles limiting noise rates that cannot be verified at any finite time | Longitudinal data-quality scoring rather than one-shot filtering |
| Early stopping in intersections | Prevents density collapse from intersecting too many candidate languages | Avoiding over-conservative model behaviour that becomes technically safe but practically useless |
| Bounded displacement | Restricts adversarial ordering of examples | Curriculum-like ordering from simpler to harder cases |
The “early stopping” part is important. If a generator intersects too many plausible candidate languages, it may remain valid but shrink to a low-density subset. That is exactly the formal analogue of over-conservative automation: the system becomes safe by becoming nearly useless. Congratulations, your model no longer hallucinates; it also no longer helps.
Curriculum helps only under a specific structural interpretation
The most business-friendly result in the paper is also the easiest to overstate. The authors introduce a beyond-worst-case model called bounded displacement. Informally, the adversary cannot keep presenting very late, difficult, or tail-end examples far before earlier canonical examples. If the canonical ordering is interpreted as simple-to-complex, bounded displacement resembles curriculum learning: examples arrive in an order that is not perfectly clean, but not maliciously scrambled forever.
Under this assumption, dense generation becomes achievable again for all countable collections under vanishing noise and arbitrary omissions. For set-based density, the paper proves lower-density guarantees approaching $1/d$, where $d$ is the bounded-displacement parameter. It also proves a matching-style lower bound showing that, in general, no algorithm can exceed $1/d$ set-based upper density in that setting. For element-based density, the paper obtains a lower-density guarantee via reduction from the set-based result, with a remaining gap against the upper-density lower bound.
This is the paper’s most useful practical metaphor: ordering matters. Not just what data enters the system, but when and how it enters.
However, the curriculum implication has to be handled carefully. The result does not prove that curriculum learning cleans noisy web-scale pretraining. It does not compare training schedules, model sizes, token mixtures, or downstream benchmarks. It says that in a formal adversarial enumeration model, restricting the adversary’s ability to scramble the ordering can restore dense generation guarantees under vanishing contamination.
That is still valuable. It suggests that curriculum-like structure may be a robustness mechanism, not merely an optimisation trick. But it remains a theoretical pointer, not a procurement-ready checklist.
The useful business frame is coverage risk, not just data quality risk
Most enterprise AI governance treats noisy data as a quality issue: remove duplicates, block spam, filter low-quality sources, document provenance, and call it a day. Those steps matter. But this paper suggests a sharper category: coverage risk.
Coverage risk is the risk that a model remains plausibly correct while becoming narrow. It can answer common cases, pass obvious tests, and still fail to represent the long tail of a domain. That is the uncomfortable part, because conventional evaluation often rewards visible correctness more than latent breadth.
For business interpretation, the paper’s results split into three layers:
| What the paper directly shows | Cognaptus interpretation | What remains uncertain |
|---|---|---|
| Ordinary generation tolerates infinite contamination when the noise fraction vanishes | Quality trends matter more than raw counts of bad examples | No finite-sample threshold is provided for real LLM training |
| Constant-rate noise requires collection-specific conditions | Persistent contamination can change whether learning is possible at all | Real datasets are not adversarial enumerations of formal languages |
| Dense generation is more fragile than ordinary generation | Broad domain capability may fail before visible correctness fails | Density is an abstraction, not a direct benchmark metric |
| Bounded displacement restores dense-generation guarantees under vanishing noise | Curriculum and ordering may support robustness under noisy data | The paper does not empirically validate curriculum schedules |
| Proper learning becomes much more restrictive under contamination | Systems may need flexible intermediate representations, not rigid exact-class identification | Mapping “proper” versus “improper” learning to deployed architectures is indirect |
This matters most in domains where breadth is economically valuable: legal review, medical triage, financial analysis, cyber defence, industrial troubleshooting, scientific search, and agentic workflow automation. In these settings, a narrow-but-fluent system may be worse than an obviously broken one. The broken one gets escalated. The narrow one gets trusted.
The limits are part of the result, not legal footnotes
The paper is formal and deliberately abstract. Its languages are infinite subsets of a countable universe. Its examples arrive through adversarial enumerations. Its guarantees are asymptotic. Its generators are algorithms with access to mathematical oracles about the language collection. None of that is how a transformer is pretrained on a finite token budget using stochastic gradient descent, unless one has had a very long lunch with category theorists.
So the boundaries are clear.
First, the paper does not quantify how much contamination an actual model can tolerate in a real corpus. There is no claim like “5% spam is safe” or “10% synthetic data breaks coverage.” Anyone extracting such a number should be gently escorted away from the dashboard.
Second, density is not the same as benchmark diversity. It is a mathematical coverage notion over a target language. It is useful because it formalises mode collapse-like behaviour, not because it gives a ready-made evaluation metric.
Third, bounded displacement depends on a canonical ordering. The business analogue is curriculum, but defining “easy before hard” in real data is itself a modelling decision. For code, it may relate to syntactic simplicity or dependency structure. For legal documents, perhaps clause complexity. For medical cases, perhaps severity, ambiguity, or comorbidity. None of these orderings drops from the heavens with a YAML file attached.
Fourth, the results are worst-case or beyond-worst-case guarantees. They tell us what is possible under specified adversarial powers. They do not prove that today’s production pipelines behave like these algorithms.
Still, dismissing the paper because it is theoretical would be a mistake. The point of theory is not to imitate production logs. It is to remove the comforting fog around them.
The strategic lesson: data pipelines should measure decay, not just dirt
The most useful operational idea from the paper is simple: contamination rate over scale matters. A corpus with many errors can still be theoretically manageable if the error fraction declines. A pipeline with a stable nonzero error rate may be structurally riskier, even if the team keeps adding more data and more compute.
That suggests a different governance posture. Teams should not only ask, “How dirty is this dataset?” They should ask:
- Is contamination declining as the corpus grows?
- Are omissions systematic in particular subdomains?
- Does the model retain broad coverage, or merely visible correctness?
- Does data ordering support progressive generalisation, or does it scramble rare hard cases into misleading early signals?
- Are evaluation suites checking diversity of valid outputs, not just absence of obvious hallucination?
The answers will not come directly from this theorem. They require empirical instrumentation. But the theorem tells us which instrumentation is worth building.
For AI businesses, the paper’s uncomfortable message is that “good enough data” is not one threshold. It depends on the capability you want. If all you need is occasional valid generation, vanishing contamination may be enough in the formal limit. If you need broad, reliable coverage, the system is much more fragile. The bad data does not merely add noise. It changes the shape of what the learner can safely know.
That is the part worth remembering. Not every contaminated dataset destroys generation. Some merely teach the model to survive by becoming narrower.
And in enterprise AI, narrow fluency is still fluency. That is precisely why it is dangerous.
Cognaptus: Automate the Present, Incubate the Future.
-
Anay Mehrotra, Grigoris Velegkas, Xifan Yu, and Felix Zhou, “Language Generation with Infinite Contamination,” arXiv:2511.07417, 2025, https://arxiv.org/abs/2511.07417. ↩︎