LemmaBench: When AI Finally Meets Real Mathematics

Opening — Why This Matters Now

Every few months, a headline declares that AI can now “solve Olympiad math” or “prove theorems at gold-medal level.” Investors cheer. Researchers argue. Skeptics mutter something about data contamination.

But here’s the uncomfortable question: are we measuring real mathematical reasoning—or just performance on carefully curated, increasingly familiar datasets?

If AI is going to assist (or compete with) research mathematicians, we need benchmarks that reflect actual research production—not textbook exercises, not competition problems, not one-off curated challenges.

LemmaBench proposes exactly that: a live, continuously updatable benchmark built directly from newly uploaded research papers on arXiv. And the results are… sobering.

Background — The Benchmark Problem in Mathematical AI

Most math benchmarks fall into three categories:

Category	Examples	Limitation
School-level datasets	GSM8k, MATH	Saturated; near-perfect scores
Olympiad-level problems	OmniMath, FrontierMath	Limited scope; contamination risk
Formal theorem benchmarks	miniF2F, Lean datasets	Structured, but domain-constrained

These benchmarks are useful—but they are not research mathematics.

Research mathematics has three defining traits:

Sparse distribution — Few similar examples exist.
Context dependence — Lemmas depend on definitions scattered throughout a paper.
Non-curated production — Problems arise organically from human inquiry.

LemmaBench addresses all three.

Analysis — How LemmaBench Works

Instead of designing problems, the authors extract them directly from live research.

Step 1: Extract Lemmas from arXiv

Using LaTeX parsing and regex detection, the system identifies lemma environments in newly uploaded papers.

Why lemmas instead of theorems?

More granular
More tractable
Better aligned with current LLM capability ceilings

Step 2: Make Them Self-Contained

Here lies the real innovation.

Research lemmas often assume prior definitions. So the pipeline retrieves necessary definitions and assumptions using two strategies:

Retrieval Mode	Mechanism	Token Cost	Accuracy
Full Context	Feed all prior content to LLM	High (5–10×)	High
Vector Retrieval	Embedding search over paragraphs	Lower	Lower

Empirically, full-context retrieval significantly outperforms vector retrieval in producing truly self-contained statements.

Self-containedness is validated using an LLM-as-a-judge, then partially audited by human mathematicians.

Step 3: Proof Generation

Models are asked to:

Produce structured, step-wise proofs
Cite hypotheses explicitly
Avoid undeclared references

Then an independent LLM judge evaluates each proof step.

A proof is accepted only if all steps pass.

This is not generosity. It’s a filter.

Findings — The Numbers That Matter

Extraction Quality

Across 376 lemmas (September 2025 iteration):

~64% were judged self-contained after processing.
Human validation precision ranged from 75% to 96%, depending on model.

Full-context retrieval proved superior across models.

Proof Success Rates (Pass@1)

September 2025 Benchmark (~240 usable lemmas)

Generator → Judge	Proof Acceptance	Human Confidence
GPT-5 → GPT-5	12.3%	83%
Gemini 2.5 → Gemini 2.5	7%	67%
DeepSeek-R → DeepSeek-R	11.9%	80%

February 2026 Benchmark (358 usable lemmas)

Generator	Judge	Proof Acceptance
GPT-5	GPT-5	15%
GPT-5	Gemini 3 Pro	19%
GPT-5	Claude 4.5 Opus	35%

Even under generous judging, research-level lemma proving accuracy remains roughly 10–15% under strict evaluation.

That is not failure.

It is clarity.

What This Actually Means

Let’s interpret this without hype.

1. We Are Not at Human-Level Theorem Proving

Despite Olympiad claims, LLMs solve about 1 in 7 research lemmas under contamination-resistant conditions.

That’s promising—but nowhere near autonomous mathematician status.

2. Judge Models Matter

Different LLM judges disagree substantially.

One model’s 35% becomes another’s 15%.

This highlights a governance issue:

Benchmark accuracy is not a single number. It is a negotiation between generator and evaluator.

3. Live Benchmarks Change the Incentive Structure

Unlike static datasets:

Training on old data doesn’t compromise future evaluation.
Domain mix can be steered dynamically.
Contamination becomes structurally harder.

This shifts evaluation from “optimize against leaderboard” to “adapt to evolving research reality.”

That’s strategically significant.

Implications for AI Strategy and Governance

For Research Labs

Live benchmarks reduce overfitting incentives and force real reasoning improvements.

For Investors

A 15% pass@1 at research level suggests:

Models are useful assistants.
They are not autonomous researchers.
Tooling and hybrid workflows remain critical.

For Policy & Assurance

If AI is to assist in scientific discovery, evaluation must:

Be contamination-resistant
Be updateable
Include human cross-validation

LemmaBench moves evaluation closer to that standard.

Structural Trade-Off: Generality vs Verifiability

Other research-level benchmarks restrict themselves to problems with verifiable answers.

LemmaBench instead accepts open-ended lemma proving, relying on LLM judges and partial human audit.

This increases coverage but introduces judge-dependence.

It’s a pragmatic compromise.

In governance terms: higher realism, lower determinism.

Conclusion — From Solving Problems to Joining Research

LemmaBench reframes the conversation.

The question is no longer:

“Can AI solve Olympiad math?”

The question is:

“Can AI operate inside the living stream of mathematical research?”

Right now, the answer is: partially.

Fifteen percent is not revolution.

But it is measurable, contamination-resistant progress.

And perhaps more importantly—it is an evaluation method aligned with reality rather than spectacle.

That may be the more durable innovation.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Benchmark Problem in Mathematical AI#

Analysis — How LemmaBench Works#

Step 1: Extract Lemmas from arXiv#

Step 2: Make Them Self-Contained#

Step 3: Proof Generation#

Findings — The Numbers That Matter#

Extraction Quality#

Proof Success Rates (Pass@1)#

September 2025 Benchmark (~240 usable lemmas)#

February 2026 Benchmark (358 usable lemmas)#

What This Actually Means#

1. We Are Not at Human-Level Theorem Proving#

2. Judge Models Matter#

3. Live Benchmarks Change the Incentive Structure#

Implications for AI Strategy and Governance#

For Research Labs#

For Investors#

For Policy & Assurance#

Structural Trade-Off: Generality vs Verifiability#

Conclusion — From Solving Problems to Joining Research#