Opening — Why this matters now
The AI research ecosystem is sprinting, not strolling. Submissions to ICLR alone ballooned from 1,013 (2018) to nearly 20,000 (2026) — a growth curve that would make even the wildest crypto bull market blush. Yet the peer‑review system evaluating these papers… did not scale. The inevitable happened: errors slipped through, and then multiplied.
The paper To Err Is Human fileciteturn0file0 turns a quiet suspicion into quantitative evidence: published AI papers are riddled with objective mistakes, and those mistakes are increasing over time, not stabilizing. In an era where models depend on clean abstractions and reproducible results, this trend is more than academic housekeeping — it’s a structural risk to the entire research stack.
And ironically, the fix might not be more humans. It might be LLMs themselves.
Background — From peer review to peer overwhelm
Peer review was designed for a slower world: fewer papers, longer cycles, and humans with time to think. But a modern AI paper is a hybrid creature — equal parts mathematics, algorithm design, experiment orchestration, and infrastructural engineering. Under time pressure, authors ship fast. Reviewers skim faster.
The result is predictable:
- incorrect formulas, invalid derivations
- wrong table entries, mismatched text–table claims
- flawed proofs that made it to publication
- contradictory assumptions that no one noticed
The paper categorizes these into four buckets (page 3):
| Category | Examples |
|---|---|
| Math/Formulas | invalid derivations, wrong assumptions, incorrect properties |
| Text | logically incorrect explanations, factually wrong statements |
| Table/Figure | miscalculated values, mismatched captions |
| Cross‑reference | wrong figure/table citations |
And as Figure 1 (page 1) shows, the average number of mistakes is moving up, year after year, across NeurIPS, ICLR, and TMLR.
Analysis — What the paper actually does
The authors build a GPT‑5‑based Correctness Checker. Not a reviewer, not a summarizer — a focused, automated auditor that:
- Parses a full PDF.
- Identifies objective, ground‑truth‑verifiable mistakes.
- Uses a second LLM to filter false positives.
- Categorizes error types.
- Suggests corrections where possible.
Human researchers then validated the system:
- Precision: 83.2% (263/316 flagged mistakes were real)
- Recall: ~60% on controlled injected‑mistake experiments
- Correction success rate: 75.8% of LLM‑proposed fixes were judged correct
This is not hand‑wavy “AI might help review” speculation. It’s an empirical demonstration: LLMs can, today, systematically detect and correct many objective errors across thousands of published papers.
Error frequencies
Based on 2,500 papers:
- Math/Formula mistakes: 54.0%
- Text mistakes: 31.4%
- Table/Figure errors: 9.3%
- Cross‑reference errors: 5.3%
Temporal trend
Consider NeurIPS (page 1):
- 2021: 3.8 mistakes per paper
- 2025: 5.9 mistakes per paper (+55%)
ICLR shows a similar climb (2018 → 2025: ~4.1 → 5.2 mistakes).
In other words: the research community is not just missing mistakes — it’s missing more mistakes each year.
Findings — An uncomfortable but necessary mirror
1. Errors are widespread — almost universal
99.2% of papers contained at least one mistake.
2. Substantive errors — not just cosmetic
Depending on the venue, 24%–36% of papers contained at least one mistake that could affect interpretation or reproducibility.
3. The LLM identified real mathematical faults
Consider these real examples from the paper:
- Claiming the product of two PSD matrices is PSD (false; page 8).
- An incorrect proof relying on injectivity of multiset functions (page 6–7).
- Invalid applications of Radon‑Nikodym derivatives in control theory (page 9).
- Replacing log integrals with integrals of logs — a classic but consequential error (page 10).
These aren’t typo‑level issues. They’re the kinds of errors that reshape conclusions.
4. Error detection is improving — but not perfect
LLMs excel at structured mathematical errors but struggle with:
- narrative inconsistencies
- cross‑reference misalignments
- OCR‑induced weirdness
A hybrid human–AI workflow is therefore not optional; it’s optimal.
5. The rise of paper‑auditing agents
Because the checker costs < $0.50 per paper, the economics shift dramatically. For the first time, systematic, large‑scale auditing of published literature becomes feasible.
Visualization — Mistakes by venue and type
A simplified reproduction of the paper’s core insights:
Average Mistakes per Paper (Trend Summary)
| Venue | Earlier Year | 2025 | Increase |
|---|---|---|---|
| NeurIPS | 3.8 (2021) | 5.9 | +55% |
| ICLR | 4.1 (2018) | 5.2 | +27% |
| TMLR | 5.0 (2022–23) | 5.5 | +10% |
Mistake Distribution
| Type | % of Mistakes |
|---|---|
| Math/Formulas | 54% |
| Text | 31% |
| Table/Figure | 9% |
| Cross‑reference | 5% |
The math‑heavy nature of AI research means the most brittle parts are also the most critical.
Implications — The research world is due for a systems upgrade
1. Peer review is no longer enough
Human reviewers cannot consistently detect the volume or complexity of modern AI‑paper mistakes. This is not a moral failing — it’s a scaling mismatch.
2. LLM‑assisted correctness checking will become standard
Just as spell‑check became mandatory, correctness‑check tools will be embedded into:
- submission pipelines
- conference workflows
- preprint servers
- journal editorial systems
3. Scientific rigor becomes machine‑augmented
Humans specialize in conceptual evaluation; machines specialize in pattern matching and mathematical consistency. Together, they approximate something like actual rigor.
4. Research reproducibility gains a new enforcement layer
Instead of relying on ad hoc replication attempts, systematic LLM auditing can:
- flag methodological inconsistencies
- detect miscalculations early
- catch proofs that don’t actually work
5. A new class of autonomous research agents emerges
The “Correctness Checker” is the early prototype of an inevitable future:
- agents that audit literature
- agents that suggest formal corrections
- agents that maintain knowledge bases of verified results
- agents that detect citation cascades built on flawed premises
In other words: AI begins maintaining science, not just generating it.
Conclusion — A small tool with large consequences
To Err Is Human is both a diagnosis and a preview. The diagnosis: published AI papers contain more mistakes than anyone wants to admit, and the trend is worsening. The preview: LLM‑powered auditors will soon become embedded in the scientific process, not as judges, but as stabilizers.
Science is cumulative. Mistakes are contagious. And LLMs — when used wisely — may be the immune system the field didn’t know it needed.
Cognaptus: Automate the Present, Incubate the Future.