Opening — Why this matters now

The AI research ecosystem is sprinting, not strolling. Submissions to ICLR alone ballooned from 1,013 (2018) to nearly 20,000 (2026) — a growth curve that would make even the wildest crypto bull market blush. Yet the peer‑review system evaluating these papers… did not scale. The inevitable happened: errors slipped through, and then multiplied.

The paper To Err Is Human fileciteturn0file0 turns a quiet suspicion into quantitative evidence: published AI papers are riddled with objective mistakes, and those mistakes are increasing over time, not stabilizing. In an era where models depend on clean abstractions and reproducible results, this trend is more than academic housekeeping — it’s a structural risk to the entire research stack.

And ironically, the fix might not be more humans. It might be LLMs themselves.

Background — From peer review to peer overwhelm

Peer review was designed for a slower world: fewer papers, longer cycles, and humans with time to think. But a modern AI paper is a hybrid creature — equal parts mathematics, algorithm design, experiment orchestration, and infrastructural engineering. Under time pressure, authors ship fast. Reviewers skim faster.

The result is predictable:

  • incorrect formulas, invalid derivations
  • wrong table entries, mismatched text–table claims
  • flawed proofs that made it to publication
  • contradictory assumptions that no one noticed

The paper categorizes these into four buckets (page 3):

Category Examples
Math/Formulas invalid derivations, wrong assumptions, incorrect properties
Text logically incorrect explanations, factually wrong statements
Table/Figure miscalculated values, mismatched captions
Cross‑reference wrong figure/table citations

And as Figure 1 (page 1) shows, the average number of mistakes is moving up, year after year, across NeurIPS, ICLR, and TMLR.

Analysis — What the paper actually does

The authors build a GPT‑5‑based Correctness Checker. Not a reviewer, not a summarizer — a focused, automated auditor that:

  1. Parses a full PDF.
  2. Identifies objective, ground‑truth‑verifiable mistakes.
  3. Uses a second LLM to filter false positives.
  4. Categorizes error types.
  5. Suggests corrections where possible.

Human researchers then validated the system:

  • Precision: 83.2% (263/316 flagged mistakes were real)
  • Recall: ~60% on controlled injected‑mistake experiments
  • Correction success rate: 75.8% of LLM‑proposed fixes were judged correct

This is not hand‑wavy “AI might help review” speculation. It’s an empirical demonstration: LLMs can, today, systematically detect and correct many objective errors across thousands of published papers.

Error frequencies

Based on 2,500 papers:

  • Math/Formula mistakes: 54.0%
  • Text mistakes: 31.4%
  • Table/Figure errors: 9.3%
  • Cross‑reference errors: 5.3%

Temporal trend

Consider NeurIPS (page 1):

  • 2021: 3.8 mistakes per paper
  • 2025: 5.9 mistakes per paper (+55%)

ICLR shows a similar climb (2018 → 2025: ~4.1 → 5.2 mistakes).

In other words: the research community is not just missing mistakes — it’s missing more mistakes each year.

Findings — An uncomfortable but necessary mirror

1. Errors are widespread — almost universal

99.2% of papers contained at least one mistake.

2. Substantive errors — not just cosmetic

Depending on the venue, 24%–36% of papers contained at least one mistake that could affect interpretation or reproducibility.

3. The LLM identified real mathematical faults

Consider these real examples from the paper:

  • Claiming the product of two PSD matrices is PSD (false; page 8).
  • An incorrect proof relying on injectivity of multiset functions (page 6–7).
  • Invalid applications of Radon‑Nikodym derivatives in control theory (page 9).
  • Replacing log integrals with integrals of logs — a classic but consequential error (page 10).

These aren’t typo‑level issues. They’re the kinds of errors that reshape conclusions.

4. Error detection is improving — but not perfect

LLMs excel at structured mathematical errors but struggle with:

  • narrative inconsistencies
  • cross‑reference misalignments
  • OCR‑induced weirdness

A hybrid human–AI workflow is therefore not optional; it’s optimal.

5. The rise of paper‑auditing agents

Because the checker costs < $0.50 per paper, the economics shift dramatically. For the first time, systematic, large‑scale auditing of published literature becomes feasible.

Visualization — Mistakes by venue and type

A simplified reproduction of the paper’s core insights:

Average Mistakes per Paper (Trend Summary)

Venue Earlier Year 2025 Increase
NeurIPS 3.8 (2021) 5.9 +55%
ICLR 4.1 (2018) 5.2 +27%
TMLR 5.0 (2022–23) 5.5 +10%

Mistake Distribution

Type % of Mistakes
Math/Formulas 54%
Text 31%
Table/Figure 9%
Cross‑reference 5%

The math‑heavy nature of AI research means the most brittle parts are also the most critical.

Implications — The research world is due for a systems upgrade

1. Peer review is no longer enough

Human reviewers cannot consistently detect the volume or complexity of modern AI‑paper mistakes. This is not a moral failing — it’s a scaling mismatch.

2. LLM‑assisted correctness checking will become standard

Just as spell‑check became mandatory, correctness‑check tools will be embedded into:

  • submission pipelines
  • conference workflows
  • preprint servers
  • journal editorial systems

3. Scientific rigor becomes machine‑augmented

Humans specialize in conceptual evaluation; machines specialize in pattern matching and mathematical consistency. Together, they approximate something like actual rigor.

4. Research reproducibility gains a new enforcement layer

Instead of relying on ad hoc replication attempts, systematic LLM auditing can:

  • flag methodological inconsistencies
  • detect miscalculations early
  • catch proofs that don’t actually work

5. A new class of autonomous research agents emerges

The “Correctness Checker” is the early prototype of an inevitable future:

  • agents that audit literature
  • agents that suggest formal corrections
  • agents that maintain knowledge bases of verified results
  • agents that detect citation cascades built on flawed premises

In other words: AI begins maintaining science, not just generating it.

Conclusion — A small tool with large consequences

To Err Is Human is both a diagnosis and a preview. The diagnosis: published AI papers contain more mistakes than anyone wants to admit, and the trend is worsening. The preview: LLM‑powered auditors will soon become embedded in the scientific process, not as judges, but as stabilizers.

Science is cumulative. Mistakes are contagious. And LLMs — when used wisely — may be the immune system the field didn’t know it needed.

Cognaptus: Automate the Present, Incubate the Future.