Opening — Why This Matters Now
Every few months, a headline declares that AI can now “solve Olympiad math” or “prove theorems at gold-medal level.” Investors cheer. Researchers argue. Skeptics mutter something about data contamination.
But here’s the uncomfortable question: are we measuring real mathematical reasoning—or just performance on carefully curated, increasingly familiar datasets?
If AI is going to assist (or compete with) research mathematicians, we need benchmarks that reflect actual research production—not textbook exercises, not competition problems, not one-off curated challenges.
LemmaBench proposes exactly that: a live, continuously updatable benchmark built directly from newly uploaded research papers on arXiv. And the results are… sobering.
Background — The Benchmark Problem in Mathematical AI
Most math benchmarks fall into three categories:
| Category | Examples | Limitation |
|---|---|---|
| School-level datasets | GSM8k, MATH | Saturated; near-perfect scores |
| Olympiad-level problems | OmniMath, FrontierMath | Limited scope; contamination risk |
| Formal theorem benchmarks | miniF2F, Lean datasets | Structured, but domain-constrained |
These benchmarks are useful—but they are not research mathematics.
Research mathematics has three defining traits:
- Sparse distribution — Few similar examples exist.
- Context dependence — Lemmas depend on definitions scattered throughout a paper.
- Non-curated production — Problems arise organically from human inquiry.
LemmaBench addresses all three.
Analysis — How LemmaBench Works
Instead of designing problems, the authors extract them directly from live research.
Step 1: Extract Lemmas from arXiv
Using LaTeX parsing and regex detection, the system identifies lemma environments in newly uploaded papers.
Why lemmas instead of theorems?
- More granular
- More tractable
- Better aligned with current LLM capability ceilings
Step 2: Make Them Self-Contained
Here lies the real innovation.
Research lemmas often assume prior definitions. So the pipeline retrieves necessary definitions and assumptions using two strategies:
| Retrieval Mode | Mechanism | Token Cost | Accuracy |
|---|---|---|---|
| Full Context | Feed all prior content to LLM | High (5–10×) | High |
| Vector Retrieval | Embedding search over paragraphs | Lower | Lower |
Empirically, full-context retrieval significantly outperforms vector retrieval in producing truly self-contained statements.
Self-containedness is validated using an LLM-as-a-judge, then partially audited by human mathematicians.
Step 3: Proof Generation
Models are asked to:
- Produce structured, step-wise proofs
- Cite hypotheses explicitly
- Avoid undeclared references
Then an independent LLM judge evaluates each proof step.
A proof is accepted only if all steps pass.
This is not generosity. It’s a filter.
Findings — The Numbers That Matter
Extraction Quality
Across 376 lemmas (September 2025 iteration):
- ~64% were judged self-contained after processing.
- Human validation precision ranged from 75% to 96%, depending on model.
Full-context retrieval proved superior across models.
Proof Success Rates (Pass@1)
September 2025 Benchmark (~240 usable lemmas)
| Generator → Judge | Proof Acceptance | Human Confidence |
|---|---|---|
| GPT-5 → GPT-5 | 12.3% | 83% |
| Gemini 2.5 → Gemini 2.5 | 7% | 67% |
| DeepSeek-R → DeepSeek-R | 11.9% | 80% |
February 2026 Benchmark (358 usable lemmas)
| Generator | Judge | Proof Acceptance |
|---|---|---|
| GPT-5 | GPT-5 | 15% |
| GPT-5 | Gemini 3 Pro | 19% |
| GPT-5 | Claude 4.5 Opus | 35% |
Even under generous judging, research-level lemma proving accuracy remains roughly 10–15% under strict evaluation.
That is not failure.
It is clarity.
What This Actually Means
Let’s interpret this without hype.
1. We Are Not at Human-Level Theorem Proving
Despite Olympiad claims, LLMs solve about 1 in 7 research lemmas under contamination-resistant conditions.
That’s promising—but nowhere near autonomous mathematician status.
2. Judge Models Matter
Different LLM judges disagree substantially.
One model’s 35% becomes another’s 15%.
This highlights a governance issue:
Benchmark accuracy is not a single number. It is a negotiation between generator and evaluator.
3. Live Benchmarks Change the Incentive Structure
Unlike static datasets:
- Training on old data doesn’t compromise future evaluation.
- Domain mix can be steered dynamically.
- Contamination becomes structurally harder.
This shifts evaluation from “optimize against leaderboard” to “adapt to evolving research reality.”
That’s strategically significant.
Implications for AI Strategy and Governance
For Research Labs
Live benchmarks reduce overfitting incentives and force real reasoning improvements.
For Investors
A 15% pass@1 at research level suggests:
- Models are useful assistants.
- They are not autonomous researchers.
- Tooling and hybrid workflows remain critical.
For Policy & Assurance
If AI is to assist in scientific discovery, evaluation must:
- Be contamination-resistant
- Be updateable
- Include human cross-validation
LemmaBench moves evaluation closer to that standard.
Structural Trade-Off: Generality vs Verifiability
Other research-level benchmarks restrict themselves to problems with verifiable answers.
LemmaBench instead accepts open-ended lemma proving, relying on LLM judges and partial human audit.
This increases coverage but introduces judge-dependence.
It’s a pragmatic compromise.
In governance terms: higher realism, lower determinism.
Conclusion — From Solving Problems to Joining Research
LemmaBench reframes the conversation.
The question is no longer:
“Can AI solve Olympiad math?”
The question is:
“Can AI operate inside the living stream of mathematical research?”
Right now, the answer is: partially.
Fifteen percent is not revolution.
But it is measurable, contamination-resistant progress.
And perhaps more importantly—it is an evaluation method aligned with reality rather than spectacle.
That may be the more durable innovation.
Cognaptus: Automate the Present, Incubate the Future.