Proof is where AI systems stop performing confidence and start owing the reader money.

A model can restate a theorem elegantly. It can cite the right neighborhood of literature. It can produce LaTeX with the visual manners of a publishable paper. None of that is a proof. It is proof-shaped material. Sometimes useful. Sometimes impressive. Sometimes a very expensive fog machine.

That is why First Proof is more interesting than another math benchmark announcement.1 The paper does not introduce a new model, a new training recipe, or a heroic claim that frontier AI has finally entered the faculty lounge and taken the chalk. Instead, it asks a narrower and more uncomfortable question: when the problem is real, unpublished, well-specified, and already solved by a human expert, can a frontier model produce a rigorous proof on its own?

The answer, at least in this preliminary experiment, is: sometimes it gets close, occasionally it is genuinely useful, and often it fails in ways that are more revealing than a wrong final number.

That is the useful part. The paper is not a leaderboard. It is not a dramatic obituary for AI mathematics. It is closer to a diagnostic room with better lighting.

The experiment removes the usual escape routes

The authors assemble ten research-level mathematics questions from active work by professional mathematicians. The fields range across stochastic analysis, representation theory, algebraic combinatorics, spectral graph theory, equivariant homotopy theory, lattice groups, symplectic geometry, tensor analysis, and numerical linear algebra.

These are not contest problems with neat numerical answers. They are the sort of local lemmas and technical subquestions that appear inside real research projects. They are also deliberately constrained. Each question had a human-generated solution of roughly five pages or less. Each could be understood from a relatively short statement. Each was unpublished when the first version of the experiment was released. The solutions were encrypted and released later.

This design matters because most AI math evaluation leaks realism at exactly the wrong places.

If a problem is public, retrieval and memorization can masquerade as reasoning. If the answer is a short expression, automatic grading becomes convenient but loses the texture of proof. If the task is a contest problem, it measures a useful skill, but not necessarily the skill researchers care about when they ask whether AI can help with actual mathematical work.

First Proof tries to isolate the final, most measurable stage of research mathematics: the question has already been selected, the conceptual framework already exists, and the job is to produce a correct proof or counterexample. That is still not the whole of research. It excludes problem selection, theory-building, taste, and the slow process of deciding which lemma is worth proving. In other words, it does not test whether AI can be a mathematician. It tests whether AI can finish a hard mathematical job after the adult has already set the table.

A little cruel, perhaps. Also fair.

The evidence is in the failure modes, not the score

The authors tested frontier systems, including GPT-5.2 Pro and Gemini 3.0 Deep Think, with prompts asking for rigorous, complete LaTeX-style mathematical answers. The paper reports internal observations from one-shot attempts. The authors did not repeatedly interact with the systems or coax better answers out of them, because the immediate goal was to see what happens under a clean protocol.

A simple “solved / not solved” table would undersell the paper. The valuable evidence is the taxonomy of failures.

Observed behavior Where it appears in the paper What it means for evaluation
Quoting or leaning on a sketch as if it were a full proof Question 1, stochastic analysis Retrieval can produce scholarly-looking shortcuts that fail expert standards
Solving a weaker or altered problem Questions 2 and 3 Models may preserve surface form while dropping the hardest condition
Producing plausible local steps without a viable global proof Questions 4 and 8 Local correctness does not guarantee proof architecture
Hallucinating lemmas or sources Question 5 Citation fluency is not citation reliability
Refusing or partially answering usefully Question 6 A non-solution can still contain valid mathematical information
Producing an essentially correct solution Question 9 The systems are not merely broken; the boundary is uneven
Producing a better computational insight but weak attribution Question 10 AI may surface useful methods while still failing research norms

This is why the evidence-first framing is better than a guided tour through the ten problems. The math itself is too specialized for a general AI industry article. The failure patterns are not.

They show what happens when the benchmark stops rewarding answer-shaped output and starts asking for proof-grade reliability.

Retrieval dressed as proof is still not proof

The first question, from Martin Hairer, asks about quasi-shift invariance of a measure arising in a three-dimensional stochastic field theory setting. The human answer shows that the relevant shifted and unshifted measures are mutually singular.

One model behavior is especially useful for business readers: GPT-Pro reportedly found a short proof sketch that had appeared on the author’s homepage and treated it as if it were a detailed proof. The authors explicitly reject this as below mathematical publication standards.

That failure is not a small formatting issue. It is a category error.

A sketch can be valuable. In expert hands, it can point to the right mechanism. But when a system presents a sketch as a proof, the evaluator has to do the missing work. The apparent productivity gain moves from “AI solved the problem” to “AI made the expert find the hole faster.” That may still be useful, but it is not autonomous research.

For enterprises, the analogy is straightforward. A legal AI that cites a memo fragment as if it were binding analysis has not completed legal reasoning. A drug-discovery assistant that repeats a plausible mechanistic story without validating the experimental conditions has not completed scientific reasoning. A financial AI that retrieves a working paper and acts as though its assumptions match the client’s portfolio has not completed investment analysis.

Retrieval is not reasoning. Retrieval plus polished prose is not reasoning with better shoes.

The most dangerous failures preserve the surface while changing the task

Question 2 asks for a representation-theoretic object satisfying a condition uniformly across all relevant representations. The paper notes that some model attempts constructed something depending on the representation. That solves an easier, known problem, not the problem asked. In the best attempt, ChatGPT 5.2 Pro identified a suitable high-level choice and reduced the task to a key nonvanishing claim, but then tried to force the integrand to be constant on its support, a strategy the authors explain is not viable.

This is the sort of failure that bad benchmarks often miss. The model is not babbling. It has found the neighborhood. It has used relevant machinery. It may even have reproduced parts of the human route. Then it quietly drops the hardest quantifier.

That matters because real expert work is often hidden inside such quantifiers: for all cases, uniformly over a class, without depending on the instance, compatible across local choices, bounded independently of dimension. The difference between “there exists for this case” and “there exists uniformly” is not a footnote. It is the problem.

Question 3 shows a related pattern. The problem asks for a nontrivial Markov chain whose stationary distribution is given by a formula involving interpolation ASEP and interpolation Macdonald polynomials, with the transition probabilities not defined using those polynomials. Some LLM attempts used Metropolis-Hastings, which can manufacture a Markov chain for almost any desired distribution precisely by using the distribution in the transition rule. That is mathematically valid in a broad sense and useless for the stated question.

This is AI’s favorite loophole: answer a nearby question with excellent manners.

In business evaluation, this suggests a practical rule. Do not only check whether the output is plausible. Check whether the output preserves the constraints that made the task valuable. A model that solves the task after removing the business-critical condition has not solved the task. It has negotiated with reality without authorization.

Local correctness can still fail globally

One of the most instructive cases is Question 8, on Lagrangian smoothing. The models reportedly identified the existence of local smoothings near every vertex, using essentially the same basic linear algebra as the human solution. That is not trivial. It shows real local competence.

Then the proof tried to glue the local pieces into a global smoothing. The error was not that every sentence was nonsense. The error was compatibility. The model did not adequately handle how local coordinate choices near vertices and edges interact.

This distinction is central to serious reasoning systems. Many difficult tasks are not hard because no local step can be found. They are hard because the local steps must compose.

Software engineering has the same disease. A coding assistant may write correct functions that fail as a system because state, interfaces, concurrency, or deployment assumptions do not align. A business-process automation tool may correctly automate three local workflows and still fail the organization because approvals, exceptions, and audit trails do not compose. A research assistant may correctly summarize several papers and still produce a literature review with incompatible definitions.

The local-to-global gap is where demo intelligence often goes to retire.

Hallucinated scholarship is not a citation problem; it is a control problem

Question 5, from equivariant stable homotopy theory, is a subtler case. The best solutions from Gemini and ChatGPT 5.2 Pro contained essentially correct statements of the relevant definition and characterization. Their proofs followed the outline of prior work, but the details were sketched, garbled, or missing hypotheses. Some runs hallucinated lemmas, cited nonexistent results, or even confabulated an entire paper.

It is tempting to treat this as a citation-quality problem. Add better retrieval. Add source checking. Add a bibliography verifier. Those tools help. But the deeper problem is control over dependency structure.

A proof is not a pile of true-ish claims. It is a controlled dependency graph. Each lemma must exist, apply under the stated hypotheses, and connect to the next step. The fact that a model can name the right literature neighborhood does not mean it can manage the dependencies inside that literature.

For AI product teams, the lesson is not “ban citations.” The lesson is to force traceable claims. Every external theorem used by an AI research assistant should be attached to a source, a statement, applicability conditions, and the exact step it supports. Otherwise, the model is not building an argument. It is decorating one.

The successes matter because they make the boundary sharper

A weak article would say: “AI failed many questions, therefore AI cannot do research mathematics.” Convenient. Also wrong.

The paper contains genuine positive signals.

For Question 9, on algebraic relations among tensors, the best LLM answer was essentially correct. It constructed the same kind of algebraic relations as the human answer: minors of the four matrix flattenings of a block tensor assembled from the original tensors. The proof route differed from the author’s, and other attempts failed, but the best answer crossed an important threshold.

For Question 10, on an efficient preconditioned conjugate gradient method for an unaligned infinite-dimensional tensor subproblem, the best LLM solution was not merely correct. The author says it was better than the provided solution in lowering computational complexity and contained an insight that was obvious in hindsight but had not been seen by the human solver. The model did not provide adequate attribution, and the core idea appears to exist in earlier literature, but the result is still important.

This is the uncomfortable middle ground. Frontier models are not reliable autonomous theorem provers. They are also not useless parrots. They may fail at proof discipline while still surfacing relevant strategies, reductions, computational tricks, or literature-adjacent methods.

For business use, that middle ground is where most value currently lives. Not “replace the expert.” Not “ignore the tool.” Instead: use the system as a proposal generator, route finder, counterexample scout, implementation assistant, or reviewer that produces candidates for expert validation.

The problem is not that AI has no value. The problem is pretending that candidate generation and verified completion are the same product category.

What the human solutions reveal about the benchmark

Appendix B matters because it confirms that these questions were not vague riddles thrown at models for entertainment. The human solutions are technical but bounded. They show the intended standard: construct an event to separate mutually singular measures; use Fourier-analytic identities to control Rankin–Selberg integrals; build an interpolation $q$-Push TASEP; prove a finite free Stam inequality; characterize slice connectivity via geometric fixed points; construct light vertex sets through spectral barriers; extend Fowler’s theorem to involutions; globalize local Lagrangian smoothings; characterize tensor relations through flattening minors; and reduce a tensor subproblem through structured matrix-vector products and PCG.

That list is not there to impress the reader with difficulty. Well, maybe a little. Mathematicians deserve hobbies too.

Its real function is to show that the experiment is testing proof production against expert artifacts, not against vibes. The answers are structured, checkable by specialists, and often short enough to be within the output limits of current systems. The models are not being asked to invent a new field. They are being asked to finish a well-specified expert task.

That makes the failures more meaningful. If the model cannot reliably handle this stage, we should be cautious about claims that it can autonomously perform the broader, less specified stages of research.

The paper is preliminary by design

The authors are careful about what First Proof is not.

It is not a statistically large benchmark. Ten questions cannot support fine-grained performance ranking. It does not define a formal grading rubric. Proofs can differ, counterexamples can vary, and expert judgment is still required. The one-shot testing protocol is clean but not necessarily optimal; interactive use might produce better results. The questions also impose artificial length constraints because current public systems have practical limits.

Most importantly, the paper tests only the final proof stage of research. It does not evaluate whether AI can choose the right problem, invent definitions, build a theory, or decide which lemma matters. Those are not side quests. They are often the actual research.

So the right conclusion is not “AI failed at mathematics.” The right conclusion is narrower: when tested on private, naturally arising, proof-based research tasks under a one-shot protocol, current frontier systems show uneven capability and frequent proof-quality failure.

That narrower conclusion is also more useful.

What Cognaptus infers for business evaluation

The paper directly shows a pattern in research mathematics. Cognaptus infers a broader evaluation lesson for high-stakes AI work: benchmark design should make shortcuts expensive.

Many enterprise AI evaluations still reward surface fluency. They ask models to summarize, classify, draft, or solve tasks where a human reviewer can quickly recognize an acceptable answer. That is fine for low-risk automation. It is not enough for research, legal analysis, financial reasoning, engineering design, medical synthesis, or strategic decision support.

For those domains, evaluation should include tasks with four properties:

Evaluation property Why it matters
Private or freshly generated tasks Reduces memorization and benchmark contamination
Naturally arising expert problems Prevents overfitting to artificial puzzle formats
Human-verifiable reasoning artifacts Forces the model to expose its dependencies and proof path
Constraint-preserving grading Detects cases where the model solves an easier nearby task

The cost is obvious. Human grading does not scale cleanly. Private expert tasks are hard to produce. The evaluation process becomes slower and less convenient. Tragic, really: realism has once again refused to be cheap.

But this is the price of measuring capabilities that matter. Easy grading produces scalable numbers. Hard grading produces knowledge.

A practical framework for AI research assistants

For companies building AI research assistants, First Proof suggests a simple maturity ladder.

First, test retrieval discipline. Can the system distinguish a full proof from a sketch, a theorem from an informal note, and a relevant source from a neighboring citation?

Second, test constraint preservation. Give the model tasks where the central difficulty lies in a quantifier, a uniformity condition, a nontriviality requirement, or a compatibility constraint. Then check whether the solution quietly removes it.

Third, test compositional reasoning. Ask for outputs where local steps must glue into a global structure: multi-module code, multi-jurisdiction legal analysis, multi-assumption financial models, or multi-paper scientific synthesis.

Fourth, test attribution and dependency control. Require every major claim to carry a source, an applicability condition, and a role in the argument.

Fifth, separate candidate generation from verified completion. A system that produces promising directions should be evaluated as a research collaborator. A system that produces final answers should be evaluated as an accountable reasoner. Mixing the two is how dashboards become fiction.

The uncomfortable benchmark is the useful benchmark

The most valuable contribution of First Proof is not that it embarrasses frontier models. Embarrassment is not a methodology. The contribution is that it makes a more honest test visible.

Public benchmarks are easy to discuss but vulnerable to contamination. Short-answer benchmarks are easy to grade but often miss proof quality. Contest problems are useful but not representative of research workflow. Private expert tasks are expensive, irritating, and hard to scale.

Naturally, they are also closer to the thing we actually want to measure.

The paper’s title borrows from baking: first proof is the fermentation stage before shaping the dough. That metaphor is better than the authors may have intended. AI evaluation also needs fermentation. It needs time inside messy expert practice before we shape it into leaderboards, claims, and product promises.

For now, First Proof tells us that frontier systems can sometimes find the right road, sometimes bring back useful tools, and sometimes walk confidently into a wall while citing the wall incorrectly.

That is not a reason to dismiss AI research assistants. It is a reason to evaluate them with less theater and more adult supervision.

Cognaptus: Automate the Present, Incubate the Future.


  1. Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams, “First Proof,” arXiv:2602.05192v2, 2026. https://arxiv.org/abs/2602.05192 ↩︎