Opening — Why this matters now
Formal mathematics has quietly become one of the most revealing stress tests for modern AI. Not because theorems are commercially lucrative, but because they are unforgiving. Proof assistants do not care about vibes, rhetorical fluency, or confident hallucinations. Either the proof compiles, or it does not.
Until recently, success in this domain required highly specialized models, intricate pipelines, and months of reinforcement learning. Numina-Lean-Agent proposes something more unsettling: maybe all of that specialization was unnecessary.
Background — The era of the specialist prover
The dominant paradigm in neural theorem proving has been vertical integration. Train a model specifically for Lean. Surround it with a hand-crafted search strategy. Add Monte Carlo tree search, subgoal decomposition, or reinforcement learning on proof traces. The result is impressive—but brittle.
These systems work because everything is tuned for one task: proving theorems. The downside is obvious. They are hard to extend, hard to reproduce, and largely useless outside formal math.
Analysis — What Numina-Lean-Agent actually changes
Numina-Lean-Agent flips the premise. Instead of building a math prover that sometimes codes, it builds a coding agent that sometimes proves theorems.
At its core is a general-purpose coding agent (Claude Code), paired with a tool layer (Numina-Lean-MCP) that exposes Lean as an environment rather than a black box. The agent does not predict tactics in isolation. It reads compiler errors, queries proof goals, retrieves existing theorems, runs code, revises plans, and retries—exactly as a human formalizer would.
Three design choices matter:
| Design Choice | Why It Matters |
|---|---|
| No training | Performance scales with the base model, not with new datasets |
| MCP tooling | Tools are modular, inspectable, and replaceable |
| General agent | The same system plans, debugs, retrieves, and reasons |
This is not just architectural minimalism. It is a bet that agentic competence beats task-specific optimization.
Findings — Performance without specialization
On Putnam 2025, Numina-Lean-Agent solves all 12 problems. No parallelism. No internet search. No task-specific training. That alone would be notable.
More interesting is how it solves them. Compared to other agentic systems, its proofs are often shorter, suggesting that explicit interaction with Lean’s state—rather than blind search—pays dividends.
| System | Problems Solved | Training Required |
|---|---|---|
| Seed-Prover 1.5 | 11 / 12 | Yes |
| AxiomProver | 12 / 12 | Yes |
| Numina-Lean-Agent | 12 / 12 | No |
This is not brute force. It is feedback-driven iteration.
Beyond benchmarks — From automation to collaboration
The most revealing section of the paper is not the leaderboard. It is the formalization of the Brascamp–Lieb inequalities.
Here, the agent does something older systems struggle with: it plans. It generates a blueprint—definitions, lemmas, dependencies—then revises that blueprint when Lean pushes back. When stuck, it consults other models. When a statement is wrong, it corrects it.
This is not theorem proving as search. It is theorem proving as project management.
Implications — What this means for AI systems
Numina-Lean-Agent suggests a broader lesson: the future may belong to agents that are competent generalists, not hyper-optimized specialists.
For businesses, this matters because the same architecture applies elsewhere. Compliance systems. Code migration. Contract verification. Any domain where correctness is binary and environments push back.
If a single agent can coordinate tools, revise plans, and reason under hard constraints, the question is no longer “Can AI do X?” but “Why are we still building X-specific models at all?”
Conclusion — The quiet shift
Numina-Lean-Agent does not feel revolutionary. It feels inevitable. Strip away the training loops. Expose the environment. Let the agent think, fail, and try again.
Formal mathematics just happened to notice first.
Cognaptus: Automate the Present, Incubate the Future.