When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

A coder does not trust a program because it sounds plausible. A coder runs it, reads the error message, changes the implementation, tests again, searches the library, asks a colleague, splits the problem, and keeps going until the machine stops complaining.

That mundane loop is the interesting part of Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.¹ The headline result is easy to market: with Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 Putnam 2025 problems in Lean, matching the reported perfect score of AxiomProver. Nice. The trophy cabinet sparkles.

But the more important result is not “an AI solved Putnam.” That sentence is too small and too noisy. The paper is really about a shift in architecture: from specialist theorem-proving models toward general coding agents embedded in a verifiable mathematical workbench.

The agent is not merely writing proof text. It is using Lean as a live execution environment, theorem libraries as searchable infrastructure, informal proof tools as planning assistants, discussion models as outside consultants, and subagents as context-management devices. In other words, it behaves less like a single trained prover and more like a proof engineer who happens to be made of API calls and mild overconfidence.

For business readers, that distinction matters. The transferable lesson is not that every company needs automated Putnam solvers. Unless your accounts payable department has recently been blocked by a Brascamp–Lieb inequality, probably not. The lesson is that difficult reasoning becomes more reliable when a general agent is surrounded by verifiable feedback, retrieval, decomposition, and human-editable plans.

That pattern travels much further than mathematics.

The breakthrough is not the score; it is the loop

The paper proposes a simple but strategically important idea: use a general coding agent as a formal mathematical reasoner, instead of building a heavily specialized theorem prover pipeline.

That sounds almost too casual. Formal theorem proving is usually treated as a specialist field because the task is unforgiving. A proof assistant such as Lean does not reward verbal elegance. It accepts only machine-checkable arguments inside a precise logical system. A proof can be philosophically convincing and still fail because a type conversion is wrong, a lemma is missing, or the statement is slightly malformed.

Earlier neural theorem-proving work often focused on formal provers trained for proof search, tactic prediction, whole-proof generation, or reinforcement learning over formal environments. The paper does not deny that lineage. It uses it as the foil. The authors argue that recent agentic systems show strong performance, but many remain task-specific, hard to reproduce, or coupled to trained formal provers.

Numina-Lean-Agent takes another route. It combines Claude Code with Numina-Lean-MCP, a Model Context Protocol layer that gives the coding agent access to Lean-related tools. The agent can inspect proof states, run Lean snippets, retrieve theorems, call an informal prover, and ask external models for discussion-style support.

That means the “model” is not the product. The working system is the loop:

Problem statement
      ↓
General coding agent
      ↓
Lean tools, theorem retrieval, informal proof, discussion models
      ↓
Machine feedback from Lean
      ↓
Revised statement, lemma, tactic, or blueprint
      ↓
Verified Lean artifact

This is why a mechanism-first reading is better than a benchmark-first reading. The 12/12 result is impressive, but the paper’s real claim is architectural: if the environment can verify outputs and return structured errors, a general agent can become useful in a domain that used to require more specialized machinery.

That is not the death of expertise. It is the weakening of a particular moat: the idea that every high-performing formal reasoning system must be a bespoke trained prover.

Lean is not a judge at the end; it is the agent’s working memory

A shallow view of formal verification imagines Lean as a final examiner: the model writes a proof, Lean checks it, and the story ends. Numina-Lean-Agent uses Lean differently. Lean becomes the environment in which reasoning happens.

The paper describes three main Lean-facing capabilities through Lean-LSP-MCP.

First, the agent can inspect the semantic structure of a Lean project. It can query goals, read diagnostic messages, and understand what remains to be proved. This matters because an LLM guessing the proof state is basically a tourist reading a subway map upside down. Lean tells it where it actually is.

Second, the agent can execute code and explore strategies. It can run isolated snippets and try multiple proof attempts at a proof node. That turns proof construction into a trial-feedback-optimization loop. The important word is not “trial.” It is “feedback.” Random trying is cheap. Reliable correction is expensive.

Third, the agent can search theorem libraries. The paper mentions local search, Loogle-style search, and LeanDex, a semantic theorem search tool for Lean v4.26.0 across libraries such as mathlib and FLT. This reduces hallucination risk because the agent can ground its proof steps in existing objects instead of inventing theorem names with the confidence of a junior analyst after two espressos.

The mechanism is familiar to software teams. Developers rarely solve everything from memory. They inspect stack traces, read documentation, search libraries, run tests, and refactor. Numina-Lean-Agent imports that workflow into formal mathematics.

The difference is that Lean’s feedback is unusually strict. A compiler error in ordinary software might still allow a flawed design to pass through a test suite. Lean’s kernel checks the proof object itself. For high-stakes reasoning, this is the attractive feature: the agent is not merely self-critiquing; it is negotiating with a verifier that does not care about tone.

The informal prover is an ablation lesson, not decorative reasoning

The paper’s evaluation includes a useful component test: how much does informal reasoning help?

Table 2 reports three variants on Putnam 2025. Without the informal prover, Numina-Lean-Agent solves 4 out of 12 problems. With the informal prover, it solves 11 out of 12. With the subagent mechanism, it reaches 12 out of 12.

This is not just a ranking table. It is evidence about where the reasoning work sits.

System variant	Likely purpose of the test	Result reported	What it supports	What it does not prove
Without informal prover	Ablation / component contribution	4/12	Formal tool interaction alone is not enough for many hard problems	That informal reasoning is always necessary in every formal domain
With informal prover	Main component evidence	11/12	Informal proof planning sharply expands coverage	That all informal proofs are reliable without verification
With subagent	Variant / context-management extension	12/12	Modular delegation can rescue at least one hard failure mode	That subagents universally scale to all long-horizon proofs

The informal prover itself is not simply another model call. The paper implements a generator-verifier loop using Gemini-3-Pro-Preview. The generator proposes informal solutions. The verifier checks them. When the verifier finds problems, feedback is returned for another iteration, up to a maximum of 20 iterations. To improve reliability, each candidate solution is evaluated independently three times, and only accepted if all three verification passes agree.

The paper also compares two strategies on Putnam 2025 B4: iterative refinement versus independent sampling. Under the same number of total calls, independent sampling fails to complete the formal proof within 10 rounds, while iterative refinement completes it in 5.

That test is small, but its purpose is clear. It is not a grand theorem about cognition. It is an efficiency comparison showing that feedback-driven correction can outperform parallel-looking randomness when the agent needs a usable informal plan.

For business use, this is the part worth copying. A lot of enterprise AI workflows still behave like independent sampling: ask the model several times, hope one answer is good, then let a human compare the mess. Numina-Lean-Agent suggests a better pattern for hard tasks:

Generate a structured candidate.
Verify it against an external standard.
Feed the failure back into the next attempt.
Continue until the artifact, not the prose, passes.

The prose can be charming. The artifact pays the bill.

The Putnam result is strong, but the cost profile is the real diagnostic

The main benchmark result is clear. Numina-Lean-Agent solves all 12 Putnam 2025 problems using formal statements from Seed-Prover 1.5. The evaluation disables internet search access for API calls and runs operations sequentially, without parallelization.

Against other systems, the reported score is:

System	Putnam 2025 solved
Aristotle	10/12
Seed-Prover 1.5	11/12
AxiomProver	12/12
Numina-Lean-Agent	12/12

The score matters because it shows that a general coding-agent architecture can compete with specialized or closed-source agentic proving systems on this benchmark. But the cost and time table prevents the lazy interpretation.

Most problems are assigned an approximate budget of $50. Two hard cases are much larger: A5 receives around $1,000, and B6 around $300. The paper explicitly says these are relative computational-effort indicators rather than exact accounting.

The time table tells the same story. Numina-Lean-Agent is not uniformly faster. It spends 2,040 minutes on A5 and 797 minutes on B6. Those are not “AI solved math instantly” numbers. Those are “the machine brought snacks and stayed overnight” numbers.

This is not a weakness of the paper. It is useful evidence. A perfect score can hide operational heterogeneity. Some problems are easy for the architecture; some require expensive search, decomposition, and rescue mechanisms. The more relevant business question is not “did it work?” but “which tasks become predictable, which tasks become expensive, and what workflow catches the expensive ones before they become budget archaeology?”

The proof-length comparison adds another layer. Numina-Lean-Agent produces shorter Lean proofs than AxiomProver and Seed-Prover 1.5 on many problems, especially A3, B1, and B5. But Aristotle, a step-based prover, often produces shorter proofs. The authors also note that Numina-Lean-Agent proofs are generally longer than Aristotle’s, while comparing favorably against other agentic provers under a similar setting.

So the correct reading is balanced:

\ast the system is competitive at solving; \ast it can be concise compared with some agentic systems; \ast it is not always fast; \ast it is not always elegant; \ast and its strongest claim is flexibility, not universal efficiency.

That is exactly the kind of result business leaders should prefer over a magical demo. Magical demos do not come with a cost table. Cost tables are where adulthood begins.

A5 shows why subagents are context firewalls, not organizational theater

Putnam 2025 A5 receives a separate section in the paper because it exposes a familiar failure mode: long context damages focus.

The authors report that the model repeatedly got stuck on a key lemma involving alternating permutations. Their diagnosis is not that the base model lacked all mathematical ability. The problem was context management. As the proof trajectory grew longer, the model’s instruction-following degraded, and it struggled to focus on the critical subgoal.

The solution was a subagent mechanism. The system decomposed the proof into subgoals and isolated the key lemma. A subagent first called GPT-5.2 to generate an informal hint, then formalized the lemma under that guidance. This could be iterated until the lemma was successfully formalized.

This is an important design pattern. In many enterprise systems, “multi-agent” is used as a decorative label for several chatbots passing messages around. Here, the subagent has a specific operational function: protect a difficult subproblem from the noise of the parent context.

A better term would be context firewall.

The parent agent carries the global proof. The subagent receives a narrower objective. The narrow objective reduces distraction, makes the proof state easier to reason about, and allows a separate informal hint to guide formalization.

In business workflows, the equivalent appears in compliance review, software migration, due diligence, and technical research. The global task is too large for one context window to manage safely. But the work can be split into verifiable subclaims:

Global objective
  ├── claim / module / lemma A
  ├── claim / module / lemma B
  ├── claim / module / lemma C
  └── integration proof / report / release decision

The point is not to imitate the language of “agents.” The point is to assign each subagent a bounded artifact and a verifier. Without the verifier, decomposition becomes bureaucracy. With the verifier, decomposition becomes control.

The Brascamp–Lieb case study moves from contest problems to proof engineering

Benchmark problems test whether the system can solve known formal statements. The Brascamp–Lieb case study tests something different: whether the agent can participate in a longer formalization project with humans.

The paper describes a collaborative effort involving a mathematician, a Lean formalization expert, and Numina-Lean-Agent. Over less than two weeks of intermittent collaboration, the team completed more than 8,000 lines of Lean code. During the process, the agent autonomously introduced about 70 new definitions, lemmas, and theorems. The authors note that the full Lean code was still being simplified by human experts.

This is not a benchmark in the same sense as Putnam 2025. It is closer to an implementation case study. Its evidentiary role is different.

Evidence	Likely purpose	What it shows	Boundary
Putnam 2025 12/12	Main benchmark evidence	Competitive formal-problem solving	Narrow contest benchmark; high cost on hard cases
B4 informal-prover comparison	Ablation / efficiency test	Iterative feedback beats independent sampling in that example	One problem, not universal proof
A5 subagent mechanism	Variant / failure-mode repair	Modular context isolation can unblock a hard lemma	Expensive and problem-specific
Brascamp–Lieb formalization	Exploratory extension / case study	Agent can assist sustained formalization with humans	Not autonomous end-to-end production proof engineering
Appendix statement repair	Implementation detail / qualitative evidence	Agent can revise malformed statements during formalization	Does not imply general mathematical judgment

The Brascamp–Lieb workflow uses a blueprint: a design-document-style artifact containing definitions, notation, intermediate lemmas, the final theorem, and dependency links. The blueprint is not a one-shot plan written before the “real” work begins. It is recursively revised as the agent hits Lean errors and proof-state mismatches.

This is the most business-relevant part of the paper.

A blueprint is basically a technical project plan that can be compiled against reality. The agent tries to prove a lemma. Lean rejects it. The rejection reveals that the statement is underspecified, the assumptions are too weak, the granularity is wrong, or the formal interface differs from the informal mathematical idea. The blueprint is updated. The work continues.

That is much better than the usual AI-workflow fantasy in which the model produces a perfect plan, executes it, and politely invoices nobody.

Real expert work is not linear. Specifications are wrong. Terms need definitions. Edge cases appear. Interfaces do not match. The valuable agent is not the one that pretends otherwise. The valuable agent is the one that can revise the plan when the verifier proves the plan was naïve.

Statement repair is where general agents start to matter

Specialized provers are often designed to prove a given formal statement. Numina-Lean-Agent, according to the paper, can sometimes reason about the validity of the statement itself during formalization and revise it when needed.

The appendix gives a concrete example. In a degenerate case where the base space has dimension zero, the agent adds an extra condition: when $n$ is empty, $\beta = 0$ is required for the inequality to hold. In the Lean snippet, this appears as an added hypothesis hβ_empty : β = 0.

This is not just a proof tactic. It is statement repair.

That matters because many real-world enterprise tasks fail before execution begins. The question is malformed. The policy is ambiguous. The data contract omits an edge case. The compliance rule has an exception nobody represented. The software specification says “all users” while quietly meaning “all active users with valid region codes and non-null billing status.”

A specialist executor can fail or search. A general agent embedded in a verifier can sometimes ask: is the task statement itself wrong?

That capability should not be romanticized. The paper presents examples, not a universal guarantee. But the design direction is important. In high-stakes workflows, the most valuable AI system is not always the one that answers the question fastest. Sometimes it is the one that discovers the question was not yet answerable.

What business can borrow without pretending to do mathematics

Most companies will not formalize Brascamp–Lieb inequalities. They will, however, face tasks with similar structure:

\ast complex rules; \ast ambiguous specifications; \ast long dependency chains; \ast reusable libraries; \ast partial automation; \ast expert supervision; \ast and a need for verifiable outputs.

That describes software verification, regulatory compliance, contract logic, financial model validation, technical documentation, internal audit, security policy checking, and research operations.

The paper directly shows a formal-math system. Cognaptus would infer a broader design pattern, not a direct product claim.

Layer in Numina-Lean-Agent	Business analogue	Practical value	Risk if copied badly
Lean kernel and diagnostics	External verifier, compiler, rules engine, test suite, policy checker	Grounds the agent in executable reality	A weak verifier turns the loop into self-confirming prose
Theorem retrieval / LeanDex	Internal knowledge base with semantic retrieval and source grounding	Reduces invented references and repeated work	Retrieval without validation still hallucinates authority
Informal prover	Draft planner or reasoning assistant	Generates candidate reasoning paths before formal execution	Plausible plans may be wrong if not checked
Discussion partner	Specialist model or human-like reviewer role	Provides alternate reasoning routes at bottlenecks	Multiple opinions are not evidence
Subagents	Context-isolated work packages	Protects hard subproblems from global-context clutter	Multi-agent theater without artifact boundaries
Blueprint	Editable project plan with dependencies	Makes long-horizon work inspectable and repairable	Static plans become stale quickly
Human expert collaboration	Domain expert and implementation expert review	Handles formulation, taste, and edge cases	Humans become rubber stamps if review is not structured

The business implication is not “replace experts.” It is “make expert work more compilable.”

A compliance analyst should not merely ask an LLM whether a policy satisfies a regulation. A better system would extract obligations, map them to internal controls, test each control against source evidence, flag mismatches, and maintain an editable dependency graph. A software team should not merely ask an agent to “migrate this codebase.” A better system would decompose modules, run tests, inspect failures, update migration assumptions, and isolate hard submodules under narrower agents.

The structure is the product.

The specialist prover does not die; the specialist-only architecture gets demoted

The title says “quiet death of the specialist prover,” but the careful interpretation is narrower. Specialist proving is not dead. Lean expertise is not dead. Formal mathematics is not suddenly easy. If anything, the paper shows how much infrastructure is required to make a general agent useful.

What is dying is the assumption that the central intelligence must be a specialist prover trained for a narrow proving pipeline.

Numina-Lean-Agent suggests a different division of labor:

\ast the base model supplies broad coding and reasoning ability; \ast MCP supplies tool access; \ast Lean supplies verification; \ast retrieval supplies library knowledge; \ast informal proving supplies high-level strategy; \ast subagents supply context control; \ast human experts supply formulation and taste.

This is more modular than a monolithic specialist system. It also means capability can improve when the underlying coding model improves, without retraining the whole prover. That is one of the paper’s explicit motivations.

For enterprise AI strategy, this matters because specialist models are expensive to build, hard to maintain, and often trapped inside narrow workflows. General agents with domain-specific tools may not beat specialists in every domain. But they can become competitive faster, adapt better, and reuse the same orchestration logic across adjacent tasks.

That is the quiet death: not a dramatic funeral for specialists, but a slow budgetary migration away from one-off model training and toward tool-rich agent environments.

Very elegant. Very unromantic. Usually how software wins.

The boundaries are operational, not philosophical

The paper is unusually useful because its limitations are concrete.

First, generated Lean code can be too long or poorly structured. The authors distinguish between filling local sorry gaps inside an already well-structured proof and proving an entire lemma from scratch. The agent performs better on the former. On larger lemmas, it may succeed but produce verbose, less elegant code.

Second, type-level issues remain a major bottleneck. The paper gives an example where the system failed not because of the central mathematical argument, but because of a type conversion from Real to NNReal. Informal mathematics often hides these constraints. Formal proof assistants do not.

Third, there is a gap between functional correctness and formal elegance. A proof may pass Lean’s compiler and still be unsatisfactory to experienced Mathlib contributors because it relies on verbose, low-level tactics rather than idiomatic abstraction.

These limitations transfer directly to business AI.

An agent may produce a working compliance mapping that is unreadable. A code agent may pass tests while creating ugly architecture. A research assistant may assemble correct citations but fail to produce a maintainable knowledge structure. A contract-analysis agent may catch the explicit clause and miss the type-level equivalent: the hidden representation mismatch between legal wording and operational data.

The practical boundary is therefore not “AI cannot reason.” That is too crude. The better boundary is:

Passing verification ≠ producing expert-quality structure.

Verification catches correctness failures. It does not automatically produce elegance, maintainability, or institutional fit.

Human experts remain necessary not only to approve outputs, but to shape the representation space: the definitions, abstractions, statement forms, dependency cuts, and acceptable proof style. In business language, they define what “good work” means beyond “no red error messages.”

The enterprise lesson is verifiable autonomy, not autonomous vibes

The most common mistake after reading this paper would be to say: “Great, let’s attach tools to a general model and let it reason.”

No. That is not the lesson.

The lesson is that agentic reasoning becomes useful when the system has four properties:

\ast\astA verifier that can reject outputs with authority.\ast\ast Lean plays this role in the paper. In business, it might be a compiler, unit-test suite, policy engine, accounting reconciliation, database constraint, or audit checklist.
\ast\astA retrieval layer that returns usable objects, not decorative context.\ast\ast LeanDex and theorem search help the agent find existing formal material. In business, the equivalent is structured retrieval from policies, contracts, code, tickets, controls, and data dictionaries.
\ast\astA decomposition layer that turns large goals into bounded artifacts.\ast\ast The blueprint and subagent mechanism prevent long-horizon reasoning from dissolving into context fog.
\ast\astA feedback loop that updates the plan, not just the answer.\ast\ast When Lean rejects a proof, the system can revise lemmas, assumptions, or the blueprint. In business, this means failed checks should update the workflow specification itself.

That last property is where many AI deployments fail. They bolt an LLM onto a process but leave the process static. The model generates. A human fixes. Nothing learns except the human’s resentment.

Numina-Lean-Agent shows a more serious pattern: the agent’s failed attempt becomes information about the task structure. That information is fed back into the plan.

For Cognaptus-style automation, this is the difference between chat automation and process intelligence. Chat automation answers. Process intelligence repairs the path by which answers are produced.

Conclusion: the coder-agent era begins where the compiler says no

Numina-Lean-Agent is not important because it makes formal mathematics effortless. It clearly does not. A5 is expensive. Some proofs are verbose. Type-level details still trip the agent. Human experts still simplify, guide, and judge.

The paper is important because it shows that a general coding agent, placed inside a verifiable tool environment, can compete with specialized theorem-proving systems and participate in a serious formalization workflow. That is a larger architectural signal.

The future of high-stakes AI may not be a single brilliant model producing final answers. It may look more like a stubborn coding session: run the artifact, read the failure, retrieve the right library object, ask for a second opinion, split the hard part, revise the blueprint, and try again.

Not glamorous. Not mystical. Much more useful.

Specialist provers are not disappearing tomorrow. But the center of gravity is moving. The valuable system is no longer just the prover. It is the agentic workbench around the prover: the verifier, the tools, the retrieval, the blueprint, the subagents, and the humans who still know when the proof is correct but ugly.

The machine proves the theorem only after the environment teaches it how to be wrong productively.

That, for once, is progress worth taking seriously.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Junqi Liu, Zihao Zhou, Zekai Zhu, Marco Dos Santos, Weikun He, Jiawei Liu, Ran Wang, Yunzhou Xie, Junqiao Zhao, Qiufeng Wang, Lihong Zhi, Jia Li, and Wenda Li, “Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics,” arXiv:2601.14027, 2026. https://arxiv.org/abs/2601.14027 ↩︎

The breakthrough is not the score; it is the loop#

Lean is not a judge at the end; it is the agent’s working memory#

The informal prover is an ablation lesson, not decorative reasoning#

The Putnam result is strong, but the cost profile is the real diagnostic#

A5 shows why subagents are context firewalls, not organizational theater#

The Brascamp–Lieb case study moves from contest problems to proof engineering#

Statement repair is where general agents start to matter#

What business can borrow without pretending to do mathematics#

The specialist prover does not die; the specialist-only architecture gets demoted#

The boundaries are operational, not philosophical#

The enterprise lesson is verifiable autonomy, not autonomous vibes#

Conclusion: the coder-agent era begins where the compiler says no#