When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

Opening — Why this matters now

Formal mathematics has quietly become one of the most revealing stress tests for modern AI. Not because theorems are commercially lucrative, but because they are unforgiving. Proof assistants do not care about vibes, rhetorical fluency, or confident hallucinations. Either the proof compiles, or it does not.

Until recently, success in this domain required highly specialized models, intricate pipelines, and months of reinforcement learning. Numina-Lean-Agent proposes something more unsettling: maybe all of that specialization was unnecessary.

Background — The era of the specialist prover

The dominant paradigm in neural theorem proving has been vertical integration. Train a model specifically for Lean. Surround it with a hand-crafted search strategy. Add Monte Carlo tree search, subgoal decomposition, or reinforcement learning on proof traces. The result is impressive—but brittle.

These systems work because everything is tuned for one task: proving theorems. The downside is obvious. They are hard to extend, hard to reproduce, and largely useless outside formal math.

Analysis — What Numina-Lean-Agent actually changes

Numina-Lean-Agent flips the premise. Instead of building a math prover that sometimes codes, it builds a coding agent that sometimes proves theorems.

At its core is a general-purpose coding agent (Claude Code), paired with a tool layer (Numina-Lean-MCP) that exposes Lean as an environment rather than a black box. The agent does not predict tactics in isolation. It reads compiler errors, queries proof goals, retrieves existing theorems, runs code, revises plans, and retries—exactly as a human formalizer would.

Three design choices matter:

Design Choice	Why It Matters
No training	Performance scales with the base model, not with new datasets
MCP tooling	Tools are modular, inspectable, and replaceable
General agent	The same system plans, debugs, retrieves, and reasons

This is not just architectural minimalism. It is a bet that agentic competence beats task-specific optimization.

Findings — Performance without specialization

On Putnam 2025, Numina-Lean-Agent solves all 12 problems. No parallelism. No internet search. No task-specific training. That alone would be notable.

More interesting is how it solves them. Compared to other agentic systems, its proofs are often shorter, suggesting that explicit interaction with Lean’s state—rather than blind search—pays dividends.

System	Problems Solved	Training Required
Seed-Prover 1.5	11 / 12	Yes
AxiomProver	12 / 12	Yes
Numina-Lean-Agent	12 / 12	No

This is not brute force. It is feedback-driven iteration.

Beyond benchmarks — From automation to collaboration

The most revealing section of the paper is not the leaderboard. It is the formalization of the Brascamp–Lieb inequalities.

Here, the agent does something older systems struggle with: it plans. It generates a blueprint—definitions, lemmas, dependencies—then revises that blueprint when Lean pushes back. When stuck, it consults other models. When a statement is wrong, it corrects it.

This is not theorem proving as search. It is theorem proving as project management.

Implications — What this means for AI systems

Numina-Lean-Agent suggests a broader lesson: the future may belong to agents that are competent generalists, not hyper-optimized specialists.

For businesses, this matters because the same architecture applies elsewhere. Compliance systems. Code migration. Contract verification. Any domain where correctness is binary and environments push back.

If a single agent can coordinate tools, revise plans, and reason under hard constraints, the question is no longer “Can AI do X?” but “Why are we still building X-specific models at all?”

Conclusion — The quiet shift

Numina-Lean-Agent does not feel revolutionary. It feels inevitable. Strip away the training loops. Expose the environment. Let the agent think, fail, and try again.

Formal mathematics just happened to notice first.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The era of the specialist prover#

Analysis — What Numina-Lean-Agent actually changes#

Findings — Performance without specialization#

Beyond benchmarks — From automation to collaboration#

Implications — What this means for AI systems#

Conclusion — The quiet shift#