Opening — Why This Matters Now

Every few months, a headline declares that AI can now “solve Olympiad math” or “prove theorems at gold-medal level.” Investors cheer. Researchers argue. Skeptics mutter something about data contamination.

But here’s the uncomfortable question: are we measuring real mathematical reasoning—or just performance on carefully curated, increasingly familiar datasets?

If AI is going to assist (or compete with) research mathematicians, we need benchmarks that reflect actual research production—not textbook exercises, not competition problems, not one-off curated challenges.

LemmaBench proposes exactly that: a live, continuously updatable benchmark built directly from newly uploaded research papers on arXiv. And the results are… sobering.


Background — The Benchmark Problem in Mathematical AI

Most math benchmarks fall into three categories:

Category Examples Limitation
School-level datasets GSM8k, MATH Saturated; near-perfect scores
Olympiad-level problems OmniMath, FrontierMath Limited scope; contamination risk
Formal theorem benchmarks miniF2F, Lean datasets Structured, but domain-constrained

These benchmarks are useful—but they are not research mathematics.

Research mathematics has three defining traits:

  1. Sparse distribution — Few similar examples exist.
  2. Context dependence — Lemmas depend on definitions scattered throughout a paper.
  3. Non-curated production — Problems arise organically from human inquiry.

LemmaBench addresses all three.


Analysis — How LemmaBench Works

Instead of designing problems, the authors extract them directly from live research.

Step 1: Extract Lemmas from arXiv

Using LaTeX parsing and regex detection, the system identifies lemma environments in newly uploaded papers.

Why lemmas instead of theorems?

  • More granular
  • More tractable
  • Better aligned with current LLM capability ceilings

Step 2: Make Them Self-Contained

Here lies the real innovation.

Research lemmas often assume prior definitions. So the pipeline retrieves necessary definitions and assumptions using two strategies:

Retrieval Mode Mechanism Token Cost Accuracy
Full Context Feed all prior content to LLM High (5–10×) High
Vector Retrieval Embedding search over paragraphs Lower Lower

Empirically, full-context retrieval significantly outperforms vector retrieval in producing truly self-contained statements.

Self-containedness is validated using an LLM-as-a-judge, then partially audited by human mathematicians.

Step 3: Proof Generation

Models are asked to:

  • Produce structured, step-wise proofs
  • Cite hypotheses explicitly
  • Avoid undeclared references

Then an independent LLM judge evaluates each proof step.

A proof is accepted only if all steps pass.

This is not generosity. It’s a filter.


Findings — The Numbers That Matter

Extraction Quality

Across 376 lemmas (September 2025 iteration):

  • ~64% were judged self-contained after processing.
  • Human validation precision ranged from 75% to 96%, depending on model.

Full-context retrieval proved superior across models.


Proof Success Rates (Pass@1)

September 2025 Benchmark (~240 usable lemmas)

Generator → Judge Proof Acceptance Human Confidence
GPT-5 → GPT-5 12.3% 83%
Gemini 2.5 → Gemini 2.5 7% 67%
DeepSeek-R → DeepSeek-R 11.9% 80%

February 2026 Benchmark (358 usable lemmas)

Generator Judge Proof Acceptance
GPT-5 GPT-5 15%
GPT-5 Gemini 3 Pro 19%
GPT-5 Claude 4.5 Opus 35%

Even under generous judging, research-level lemma proving accuracy remains roughly 10–15% under strict evaluation.

That is not failure.

It is clarity.


What This Actually Means

Let’s interpret this without hype.

1. We Are Not at Human-Level Theorem Proving

Despite Olympiad claims, LLMs solve about 1 in 7 research lemmas under contamination-resistant conditions.

That’s promising—but nowhere near autonomous mathematician status.

2. Judge Models Matter

Different LLM judges disagree substantially.

One model’s 35% becomes another’s 15%.

This highlights a governance issue:

Benchmark accuracy is not a single number. It is a negotiation between generator and evaluator.

3. Live Benchmarks Change the Incentive Structure

Unlike static datasets:

  • Training on old data doesn’t compromise future evaluation.
  • Domain mix can be steered dynamically.
  • Contamination becomes structurally harder.

This shifts evaluation from “optimize against leaderboard” to “adapt to evolving research reality.”

That’s strategically significant.


Implications for AI Strategy and Governance

For Research Labs

Live benchmarks reduce overfitting incentives and force real reasoning improvements.

For Investors

A 15% pass@1 at research level suggests:

  • Models are useful assistants.
  • They are not autonomous researchers.
  • Tooling and hybrid workflows remain critical.

For Policy & Assurance

If AI is to assist in scientific discovery, evaluation must:

  • Be contamination-resistant
  • Be updateable
  • Include human cross-validation

LemmaBench moves evaluation closer to that standard.


Structural Trade-Off: Generality vs Verifiability

Other research-level benchmarks restrict themselves to problems with verifiable answers.

LemmaBench instead accepts open-ended lemma proving, relying on LLM judges and partial human audit.

This increases coverage but introduces judge-dependence.

It’s a pragmatic compromise.

In governance terms: higher realism, lower determinism.


Conclusion — From Solving Problems to Joining Research

LemmaBench reframes the conversation.

The question is no longer:

“Can AI solve Olympiad math?”

The question is:

“Can AI operate inside the living stream of mathematical research?”

Right now, the answer is: partially.

Fifteen percent is not revolution.

But it is measurable, contamination-resistant progress.

And perhaps more importantly—it is an evaluation method aligned with reality rather than spectacle.

That may be the more durable innovation.


Cognaptus: Automate the Present, Incubate the Future.