Error 404: Peer Review Not Found — How LLMs Are Quietly Rewriting Scientific Quality Control

Deadline.

That is the simplest way to understand why modern AI papers contain mistakes. Not because researchers suddenly forgot algebra. Not because reviewers are lazy. Not because the field has collectively decided that proofs are decorative furniture. The more boring explanation is also the more important one: the AI publication machine has scaled faster than the quality-control machinery around it.

A modern AI paper is not a tidy essay. It is a compressed bundle of mathematical claims, derivations, algorithm descriptions, benchmark tables, implementation choices, appendix proofs, and narrative framing. Then it is written under conference pressure, revised under reviewer pressure, and consumed under arXiv-speed pressure. Somewhere in that pipeline, a table stops matching the text, a probability argument quietly breaks, a theorem proof skips a condition, and everyone moves on because the next submission deadline is already growling in the corridor.

Bianchi, Kwon, Izzo, Zhang, and Zou’s paper, To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis, turns that ambient suspicion into measurable evidence.¹ The authors build a GPT-5-based AI Correctness Checker and use it to audit 2,500 published AI papers from ICLR, NeurIPS, and TMLR. The focus is deliberately narrow: objective mistakes with a verifiable right-or-wrong answer. Not novelty. Not taste. Not whether the paper is “important.” Just whether the formulas, tables, derivations, references, and factual claims hold together.

The headline result is uncomfortable: the checker finds an average of 4.66 objective mistakes per paper, with 99.2% of papers containing at least one flagged mistake. Some are minor. Some are not. Human validation confirms that many are real. And the error burden appears to be increasing over time.

This is not a paper about replacing peer reviewers with robots. That would be the lazy take, and naturally the internet will try it within minutes. The stronger reading is more operational: LLMs are becoming cheap, scalable diagnostic instruments for technical correctness. In other words, scientific quality control may be about to acquire something it has badly lacked: an automated first-pass inspection layer.

The evidence starts with scale, not philosophy

The paper’s most useful move is that it does not begin with a manifesto about AI reviewers. It begins with measurement.

The authors sample 1,600 ICLR papers from 2018–2025, 500 NeurIPS papers from 2021–2025, and 400 TMLR papers from 2022–2025. They use OpenReview because the papers are accessible through a consistent infrastructure, and they focus on single-column publication formats because OCR and document parsing are less likely to create artifacts. They also exclude papers larger than 10 MB and avoid older NeurIPS papers where formatting differences could make extraction noisier.

That sampling design matters. It is not a complete census of all AI literature. It is a controlled look at top-tier venues where formatting and access make large-scale auditing feasible. So the result should not be read as “all AI papers everywhere have exactly this error rate.” It should be read as: in a large, high-profile, relatively clean sample of published AI papers, objective mistakes are common enough that pretending otherwise is now a management choice.

The core numbers are blunt:

Result	What the paper directly shows	What it means	Boundary
Average mistakes per paper	4.66, standard error 0.04	Objective errors are not rare edge cases	Counts depend on the checker’s detection ability and document parsing
Papers with at least one flagged mistake	2,481 of 2,500, or 99.2%	Almost every audited paper contains at least one potential issue	A flag is not automatically a confirmed mistake
Venue averages	ICLR 4.49, NeurIPS 4.68, TMLR 5.28	The pattern is not isolated to one publication venue	Venue differences may reflect paper mix, formatting, review culture, or sampling
Papers with at least one potentially substantive mistake	ICLR 23.8%, NeurIPS 30.8%, TMLR 36.0%	Some issues plausibly affect interpretation or reproducibility	“Substantive” is harder to classify than “wrong cross-reference”

The numbers are not interesting because they embarrass authors. They are interesting because they reveal a systems problem. If nearly every paper in a top-venue sample contains at least one checker-flagged objective issue, the relevant question is no longer whether peer review is “good” or “bad.” The relevant question is whether a human-only review process is structurally suited to detecting small technical errors across thousands of dense papers.

That answer is becoming rather obvious. Charming, even. In the way a fire alarm is charming.

The trend is the part that should worry research managers

The paper does not merely report that mistakes exist. It reports that detected mistakes appear to be increasing.

For NeurIPS, average detected mistakes rise from 3.8 per paper in 2021 to 5.9 in 2025, a 55% increase. For ICLR, the average rises from 4.1 in 2018 to 5.2 in 2025, a 27% increase. For TMLR, the average rises from 5.0 in 2022–2023 to 5.5 in 2025.

The more important signal is not the exact slope. It is that the authors attempt to address an obvious alternative explanation: appendices and paper length. NeurIPS changed appendix practices over time, which could make later papers appear more error-prone simply because more material is included in the audited PDF. To test this, the authors rerun the NeurIPS analysis using only the first 10 pages of each paper.

That first-10-pages experiment is best read as a robustness or sensitivity test. It is not a second thesis. It asks whether the trend survives when the checker sees a more consistent content window across years. The answer is yes: both average mistakes and the fraction of papers with at least one potentially substantive mistake still increase.

Test	Likely purpose	What it supports	What it does not prove
Full corpus audit	Main evidence	Objective mistakes are widespread in the sampled published papers	True total mistakes in all papers
Temporal venue trends	Main evidence	Detected mistakes have increased across sampled years	The exact causal reason for the increase
NeurIPS first-10-pages audit	Robustness / sensitivity test	The NeurIPS trend is not simply an appendix-length artifact	That all venue trends are immune to every formatting factor
Human precision review	Validation evidence	Most checked flags in the validation sample were real mistakes	That every flag in the 2,500-paper corpus is true
Injected-error recall test	Controlled validation	The checker misses many real errors, so counts may be conservative	Recall on all naturally occurring published-paper mistakes
Fix evaluation	Operational extension	The checker can often propose usable corrections	That automated corrections are safe without human review

This distinction matters for business readers. A robustness test is not decorative. It tells you whether a finding is likely to survive a plausible alternative explanation. In this case, the paper’s authors are aware that document length and appendix inclusion could distort time trends, and they explicitly test one version of that concern. That does not make the result causal. It makes it harder to dismiss as a formatting mirage.

The checker is not a reviewer; it is a technical auditor

The AI Correctness Checker has a deliberately constrained job. It does not judge whether a paper is original, elegant, strategically important, or likely to win a best paper award. It looks for objective mistakes.

The pipeline has several modules. A first GPT-5 model searches for objective errors such as mathematical mistakes, logical contradictions, miscalculations, discrepancies between text and figures, and incorrect cross-references. A second GPT-5 model checks the proposed mistakes and removes false positives. The system also flags whether a mistake may be “potentially substantive,” meaning that it could change results, alter interpretation, or create non-obvious confusion for a reader.

A separate GPT-5-mini module assigns mistakes to one of four categories:

Category	Examples	Why it matters
Math / Formula	Wrong derivations, invalid proof steps, incorrect assumptions	Highest direct risk to theoretical claims and reproducibility
Text	Incorrect definitions, logically imprecise explanations, factual mistakes	Can mislead readers even when equations look polished
Table / Figure	Inconsistent values, wrong captions, mismatch between table and text	Can distort empirical comparison or implementation details
Cross-reference	Wrong figure, equation, table, or appendix reference	Often minor, but costly during reproduction or close reading

The distribution is revealing. Math and formula mistakes account for 54.0% of detected issues. Text mistakes account for 31.4%, table and figure mistakes for 9.3%, and cross-reference errors for 5.3%. Human assessment finds the category labels to be 84% accurate.

This is where the paper becomes more than a complaint about peer review. The most common mistake class is not “a typo in the caption.” It is mathematical or formula-level error. That includes incorrect derivations, invalid proof logic, and wrong assumptions. For an AI field that often converts mathematical claims into model architectures, training objectives, benchmarks, and production heuristics, this is not an academic housekeeping issue. It is upstream technical debt.

Human validation turns flags into evidence

An automated checker is only useful if we know how much to trust it. The authors therefore conduct human validation in two ways: precision testing and recall testing.

For precision, they sample 60 published papers that contain at least one potentially substantive mistake according to the checker. Across those papers, the checker identified 316 potential mistakes. Human experts reviewed those flags and confirmed that 263 were genuine mistakes. That gives the checker a precision of 83.2%.

That is strong enough to be useful, but not strong enough to be treated as final judgment. Roughly one in six flags in the validation set was not a real mistake. The right operational interpretation is: “This system can produce a high-quality triage queue.” It is not: “This system can automatically convict a paper.”

The severity classification is more subtle. Among the 263 confirmed mistakes, the checker marked 76 as substantive. Human reviewers marked 86 as substantive. The overlap was 62. That means the checker and humans often agreed, but not perfectly. The contingency table tells the story:

	Human: substantive	Human: minor
AI: substantive	62	14
AI: minor	24	163

This is exactly what we should expect. Whether a mistake is objectively wrong can often be settled by checking a formula, table, or reference. Whether that mistake is “substantive” depends on context: Does it affect a main theorem? Does the result survive under a corrected assumption? Would a typical reader be misled? Does the mistake alter reproducibility, or merely add friction?

So the checker is better understood as a correctness detector than a severity judge. It can help humans find the fire. Humans still need to decide whether the building is burning down or whether someone merely toasted bread with unusual ambition.

Recall is the quiet reason the counts may be conservative

Precision asks: when the checker flags an issue, how often is it right?

Recall asks the more painful question: how many real mistakes does it miss?

To estimate recall, the authors create a controlled injected-error experiment. They select five papers co-authored by at least one of the authors, create three mistake-injected copies of each paper, and insert six mistakes into each copy. That gives 15 modified papers and 90 injected mistakes. The checker is run three times independently on each copy, and the outputs are manually reviewed.

The overall recall is 60.0%. Detection varies by category: math/formula mistakes are detected most reliably at 66.7%, followed by table/figure mistakes at 61.9%, text mistakes at 55.9%, and cross-reference mistakes at 53.8%.

This result is easy to misread. A weak interpretation says: “Only 60% recall? Not good enough.” A better interpretation says: “The observed corpus-wide mistake counts may be lower bounds.” If the checker misses around 40% of injected mistakes in a controlled setting, then the average of 4.66 detected mistakes per paper is not likely to represent the total number of real objective mistakes. It represents what this checker can catch under this pipeline.

That does not mean the true error rate is exactly 4.66 divided by 0.60. Natural published-paper mistakes are not identical to injected mistakes, and the injected set comes from a small controlled sample. But directionally, the recall result makes the main finding harder, not easier, to dismiss. The checker is not overcounting its way to drama. It may be under-detecting its way to a polite version of the problem.

The examples show why “minor mistake” is the wrong mental category

The paper’s examples are where the result stops feeling statistical and starts feeling operational.

One flagged issue concerns a dataset size claim in an ICLR 2018 word-embedding paper: the original paper allegedly described a dataset as having 30 million pairs and using 15 million for training, while the underlying dataset had around 0.3 million pairs. Human reviewers confirmed the checker’s finding and marked it substantive. That is not a formatting nuisance. It changes a reader’s sense of data scale.

Another example involves a graph neural network proof. The checker identified a flawed injectivity argument: the proof inferred node-level correspondence from equality under a permutation-invariant multiset function. Human review went further, constructing a counterexample and concluding the theorem was false as stated. This is the kind of issue that peer review is supposed to catch and sometimes does not. There is no need to moralize. Dense appendix proofs are precisely where human attention goes to die quietly.

A third example concerns a claim that the product of two positive semidefinite covariance matrices is itself positive semidefinite. Human reviewers confirmed the mathematical problem and disagreed with the checker’s minor label, treating it as substantive. This is a useful reminder that detecting an error and grading its consequence are separate tasks.

A fourth example is more serious: an optimal-control paper used a change-of-measure argument involving a deterministic policy and a continuous uniform policy. The checker flagged the Radon-Nikodym derivative issue; human reviewers agreed and noted that the mistake invalidated three of four main theoretical results.

Then there is the familiar mathematical sin of moving too casually between a log of an integral and an integral of logs. In a few-shot learning paper, the checker identified an incorrect derivation of the main objective function. Human reviewers confirmed the issue and marked it substantive.

These examples matter because they prevent the reader from hiding behind the word “mistake.” A wrong figure reference and an invalid theorem are both mistakes. They do not carry the same consequence. The paper’s contribution is not that it collapses all errors into one scary number. It shows that scalable detection can surface a spectrum: from annoying inconsistencies to claims that may alter interpretation or reproducibility.

False positives are not bugs in the story; they define the workflow

The paper is refreshingly clear about false positives. The checker can be fooled.

One false positive came from non-standard notation: a paper used the letter $Z$ as a marginal probability vector in a Wasserstein-distance context. Because $Z$ is often used differently in the literature, the model flagged a non-error. In another case, OCR dropped a square root from a convergence rate, turning a correct $O(1/\sqrt{n})$ expression into an apparent $O(1/n)$ issue. A third false positive came from algorithm indentation, where the checker misread which else clause matched which if.

These are not peripheral details. They define the safe deployment model.

The checker should not be used as a public accusation engine. It should be used as a private or semi-private audit tool that produces review candidates. Its output should be treated as “possible mistakes requiring verification.” That is not a weakness; it is exactly how diagnostic tools work. A lint warning does not rewrite the codebase by itself. A static analyzer does not become CTO. Even in 2025, the machines still require adult supervision. Tragic, but manageable.

For companies, this distinction is practical. If an R&D team uses an LLM correctness checker on papers before building from them, the workflow should include human triage. The tool reduces the search space. It does not remove the need for technical judgment.

Correction is where the economics begin to shift

Detection is useful. Correction is where the operating model changes.

The paper evaluates automated fixes on a subset of 240 genuine mistakes confirmed by human reviewers. The checker proposed fixes for 207 of them, or 86.3%. For 33 cases, it returned “No immediate fix,” often because the issue involved a contradiction or required substantial rewriting. Human evaluators judged 157 of the 207 proposed fixes to be correct, giving a fix success rate of 75.8% among proposed fixes.

This is not “the LLM fixes science.” Please retire that sentence before it reproduces. The result is narrower and more useful: many objective mistakes are localized enough that an LLM can propose a correct patch.

The paper’s example is a proof of a matrix norm property. The original proof mishandled homogeneity by missing an absolute value and treated a triangle inequality as equality. The checker proposed replacing $\mu$ with $|\mu|$ in the homogeneity step and replacing equality with an inequality in the triangle step. Human reviewers judged the fix correct.

The important part is the operational segmentation:

Checker outcome	Likely meaning	Human action
Flag + simple fix	Localized objective issue	Verify and patch
Flag + “No immediate fix”	Possible deeper contradiction or under-specified argument	Escalate to technical reviewer
Flag + uncertain reasoning	Possible OCR, notation, or model mistake	Inspect source document
No flag	No detected issue	Do not assume correctness

The “No immediate fix” category is especially interesting. In business terms, it is a severity signal. A paper with many local fixes may need cleanup. A paper with an issue the checker cannot safely fix may need deeper technical due diligence.

What this means for business use of AI research

The direct paper result is about published AI literature. The business inference is about technical reliance.

Companies increasingly use AI papers as inputs to product decisions: selecting model architectures, choosing benchmark assumptions, designing RAG pipelines, evaluating agent frameworks, planning fine-tuning strategies, or deciding whether a vendor’s “research-backed” claim is credible. The problem is that research papers are often treated as clean knowledge objects once published. This paper argues, with evidence, that they are better treated as probabilistic technical artifacts.

A business-facing correctness workflow could look like this:

Ingest candidate papers relevant to a product, investment, or due-diligence question.
Run an automated correctness audit focused on objective mistakes.
Classify findings by type and severity: formula, table, text, cross-reference; minor or potentially substantive.
Route high-severity findings to human experts for validation.
Record verified issues in an internal knowledge base so downstream teams do not repeatedly rely on the same fragile claim.
Use the corrected view of the literature to update product assumptions, benchmark comparisons, or technical roadmaps.

The ROI is not mystical. It comes from reducing expert time spent hunting for problems and increasing expert time spent judging consequential ones.

Business setting	Where the checker helps	What remains human
Technical due diligence	Finds fragile proofs, inconsistent claims, and benchmark-table issues before investment or partnership decisions	Assess whether the issue changes valuation, feasibility, or vendor credibility
Internal R&D	Audits papers before engineers implement methods	Decide whether a flawed assumption can be repaired or avoided
Model evaluation	Checks benchmark papers and reported comparisons for objective inconsistencies	Judge whether the benchmark still reflects business reality
Research operations	Screens internal drafts before publication or client delivery	Decide novelty, strategic framing, and communication quality
Knowledge management	Tags papers with verified caveats and corrections	Maintain policy for evidence quality and reuse

The paper itself notes that the checker costs less than $0.50 per paper. That figure should not be mistaken for total workflow cost; human verification still costs time. But it changes the search economics. If automated scanning is cheap, organizations can audit many more papers than they could manually inspect from scratch.

This is the central business implication: LLMs make research-quality assurance more scalable before they make it fully automatic.

What the paper directly shows, and what Cognaptus infers

A useful article should not blur evidence with interpretation, so here is the clean separation.

Layer	Claim	Status
Direct paper finding	GPT-5-based checking identified widespread objective mistakes in 2,500 sampled AI papers	Directly supported
Direct paper finding	Human validation confirmed 263 of 316 flagged issues in a validation sample	Directly supported
Direct paper finding	Recall on injected mistakes was 60.0% overall	Directly supported
Direct paper finding	Detected mistakes and potentially substantive mistakes increased over time in the sampled venues	Directly supported
Cognaptus inference	Companies should treat published papers as auditable technical inputs, not automatically reliable foundations	Business interpretation
Cognaptus inference	LLM correctness checking can become part of R&D due diligence and internal research QA	Operational extrapolation
Still uncertain	Whether the same performance holds across all fields, formats, languages, and non-OpenReview publication ecosystems	Open boundary
Still uncertain	Whether automated correction can safely update scientific records without formal human governance	Open boundary

This separation matters because the paper is easy to overextend. It does not prove that LLMs understand every paper better than reviewers. It does not prove that AI conferences are collapsing. It does not prove that every substantive flag changes a paper’s conclusions. It does not even prove the true total number of mistakes, because recall is incomplete.

What it does show is enough: objective correctness checking can now be run at scale, with meaningful precision, useful recall, and practical correction ability. That changes the cost structure of scientific quality control.

The boundary: objective correctness is not scientific judgment

The paper is careful to exclude subjective review dimensions. That is a strength, not a limitation to be apologized for every three paragraphs.

Novelty, significance, clarity, experimental design quality, and conceptual elegance are not the target. The checker is not deciding whether a paper matters. It is checking whether certain claims are technically wrong.

That boundary has three consequences.

First, the checker complements peer review rather than replacing it. Human reviewers should spend more time on conceptual evaluation, methodological importance, and severity judgment. Machines can help with the tedious inspection of equations, tables, and references.

Second, flagged issues need verification. OCR errors and notation conventions can fool the system. The paper’s own false-positive examples make this explicit.

Third, severity remains partly subjective. A wrong derivative in an appendix may or may not change a paper’s main result. A flawed theorem may collapse the theoretical claim. A wrong cross-reference may simply annoy a reproducer into early retirement. The checker can surface the issue; humans must interpret consequence.

For business use, this means the safe product category is not “AI peer reviewer.” It is “research correctness triage.” Less glamorous, more useful. As usual.

Scientific quality control becomes a pipeline problem

The most important long-term implication is structural.

Science has traditionally relied on a few quality-control moments: peer review before publication, occasional post-publication critique, retractions in severe cases, and replication efforts when someone has the time, funding, and emotional resilience. That architecture does not scale well in a field producing papers at industrial speed.

LLM-based correctness checking suggests a different architecture: continuous audit.

Pre-submission audits can help authors catch errors before review. Conference-side audits can help reviewers focus on objective correctness flags. Post-publication audits can help maintain a living record of verified corrections. Internal corporate audits can help R&D teams decide which claims are safe to build on.

The paper does not build all of that infrastructure. But it provides evidence that one important component is now technically plausible: a low-cost checker that can identify many objective mistakes and propose many correct fixes.

This is where the “Error 404” title needs its own correction. Peer review is not missing. It is overloaded. The future is not peer review disappearing; it is peer review becoming part of a larger quality-control system where LLMs perform routine inspection and humans handle judgment.

The glamour version says AI will replace reviewers.

The useful version says AI will make reviewers harder to waste.

Conclusion: the literature now needs its own debugging stack

The most honest reading of \astTo Err Is Human\ast is not cynical. It is infrastructural.

Published AI papers contain objective mistakes. Many are minor. Some are substantive. The number of detected mistakes appears to be rising. A GPT-5-based checker can identify many of them with 83.2% precision in a human-validated sample, catch 60.0% of injected mistakes, and propose correct fixes for 75.8% of the fixes it attempts. The system is imperfect, but it is already useful as triage.

For companies, the message is straightforward: do not consume AI research as scripture. Treat it as inspectable input. Before a paper becomes a product decision, investment thesis, technical roadmap, or benchmark claim, run it through an audit layer. Then let humans decide what the findings mean.

Science is cumulative. Mistakes are cumulative too. The difference is that now we can start building tools that catch them before they become load-bearing beams in someone else’s system.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Federico Bianchi, Yongchan Kwon, Zachary Izzo, Linjun Zhang, and James Zou, “To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis,” arXiv:2512.05925, 2025, https://arxiv.org/abs/2512.05925. ↩︎

The evidence starts with scale, not philosophy#

The trend is the part that should worry research managers#

The checker is not a reviewer; it is a technical auditor#

Human validation turns flags into evidence#

Recall is the quiet reason the counts may be conservative#

The examples show why “minor mistake” is the wrong mental category#

False positives are not bugs in the story; they define the workflow#

Correction is where the economics begin to shift#

What this means for business use of AI research#

What the paper directly shows, and what Cognaptus infers#

The boundary: objective correctness is not scientific judgment#

Scientific quality control becomes a pipeline problem#

Conclusion: the literature now needs its own debugging stack#