First Proofs, No Training Wheels

Opening — Why this matters now

AI models are now fluent in contest math, symbolic manipulation, and polished explanations. That’s the easy part. The harder question—the one that actually matters for science—is whether these systems can do research when the answer is not already in the training set. The paper First Proof arrives as a deliberately uncomfortable experiment: ten genuine research-level mathematics questions, all solved by humans, none previously public, and all temporarily withheld from the internet.

In other words, no training wheels. No StackExchange crumbs. No arXiv leakage.

Background — What benchmarks usually get wrong

Most existing math benchmarks optimize for convenience rather than realism. Problems are either old, automatically gradable, or designed with short symbolic answers. This makes them ideal for reinforcement learning—but weak proxies for how mathematics actually advances.

Real mathematical research rarely looks like a contest problem. It looks like this:

A lemma emerges mid-proof.
The question is local, technical, and annoying.
The answer exists, but only after days of false starts.
Multiple proofs (or counterexamples) are possible.

The authors of First Proof—a group of senior mathematicians across topology, probability, algebra, and numerical analysis—explicitly target this final stage: given a well-posed statement, can an AI system produce a correct proof on its own?

Analysis — What the paper actually does

The contribution is methodological rather than algorithmic.

The authors assemble ten research-level questions that satisfy four strict constraints:

Each arose naturally in the authors’ own research.
Each has a human-generated solution of roughly ≤5 pages.
None of the solutions have appeared publicly.
Each is comprehensible to current LLMs from a one-page problem statement.

The answers are encrypted and hosted externally, with a delayed release. During this window, researchers can probe frontier models under near-realistic conditions—including unrestricted internet access—while still avoiding direct data contamination.

This is not positioned as a formal benchmark. The authors are explicit about its limits:

Too few questions for statistical rigor.
Human grading required.
Answers are not unique.

That honesty is part of the point.

Findings — What happens when models try

The paper reports preliminary one-shot tests using GPT‑5.2 Pro and Gemini 3 DeepThink. The results are politely summarized but clearly disappointing.

Capability tested	Observed behavior
Problem restatement	Generally strong
Identifying relevant theory	Inconsistent
Constructing a full proof	Rare
Logical completeness	Frequently breaks
One-shot success	Low

The authors intentionally avoid iterative prompting or retries. This is a feature, not a bug: the goal is to measure autonomous research competence, not collaborative tutoring.

Implications — What this means beyond math

Although framed in mathematics, the implications spill far wider.

1. Search is not reasoning

Modern models are excellent at rediscovering known facts. This experiment isolates what happens when retrieval fails—and the gap is stark.

2. Agentic workflows may be essential

The paper quietly undermines the fantasy of “one prompt → one breakthrough.” Iteration, memory, hypothesis revision, and self-critique look less like optional extras and more like structural requirements.

3. Benchmarks need friction

Easy grading produces misleading confidence. Human-graded, messy, partially specified tasks may be the only way to evaluate real research ability—even if they scale poorly.

4. Business relevance is indirect but real

For enterprises betting on AI for R&D, legal reasoning, or scientific discovery, this paper is a warning label. Tool use is here. Autonomous discovery is not.

Conclusion — A proof of limits

First Proof does not claim that AI cannot do research. It claims something subtler—and more useful: we currently don’t measure the right thing.

By shifting evaluation from polished answers to unpublished lemmas, the authors expose a capability gap that hype conveniently ignores. If future systems close this gap, it will not be because benchmarks were easy—but because they were honest.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — What benchmarks usually get wrong#

Analysis — What the paper actually does#

Findings — What happens when models try#

Implications — What this means beyond math#

1. Search is not reasoning#

2. Agentic workflows may be essential#

3. Benchmarks need friction#

4. Business relevance is indirect but real#

Conclusion — A proof of limits#