A developer can understand what a software function should do, write something that looks reasonable, and still fail because the surrounding codebase expects a particular interface, naming convention, object hierarchy, or sequence of calls.

Giving the developer four independent attempts may eventually fix a misplaced bracket. It does little when the real problem is that they do not know which internal abstraction the system expects.

Formal theorem proving has reached a similar stage. Large language models can produce convincing mathematical arguments and increasingly sophisticated Lean code. Yet when they must work inside a mature library, locate the correct reusable structures, and preserve a coherent proof plan across several layers of abstraction, competence becomes rather less photogenic.

LeanCat makes that failure measurable. The benchmark contains 100 category-theory problems formalized in Lean 4 and is designed specifically to test library-grounded abstraction rather than isolated mathematical cleverness.1 Across static baselines, the strongest model solves only 12% of the benchmark after four attempts. Every static model scores 0% on the High-difficulty tasks.

The important result, however, is not merely that category theory is difficult. LeanCat compares several ways of making an AI prover more capable: more samples, specialist training, iterative repair, library retrieval, and formal-statement grounding. The differences between them show what abstract reasoning systems are actually missing.

Category theory tests whether a model can inhabit a library

Many theorem-proving benchmarks resemble difficult examinations. A model receives a self-contained problem and searches for the decisive derivation, algebraic manipulation, or tactic sequence.

LeanCat resembles joining a large engineering project midway through development.

Its tasks span eight clusters: basic category properties, adjunctions, reflective and coreflective subcategories, concrete categories, limits and colimits, cocompletions, abelian categories, and monads. The benchmark draws from standard textbooks, lecture notes, and research material. Each task is presented as a standalone theorem rather than as a scaffolded sequence of intermediate lemmas.

That design removes a form of assistance that humans often take for granted. An instructor, collaborator, or well-written library module usually decomposes a difficult theorem into manageable bridges. LeanCat asks the prover to discover or construct those bridges itself.

Some tasks can be solved by locating the correct Mathlib declarations and composing them properly. Others sit near the frontier of the existing library and require new definitions or auxiliary lemmas. The difference matters. Retrieval can locate an existing bridge. It cannot retrieve a bridge that has not yet been built.

Category theory intensifies this problem because its proofs operate through reusable structural interfaces: functors, natural transformations, universal properties, adjunctions, limits, and related constructions. A model that drifts into an inappropriate element-by-element argument may still sound mathematically informed while moving farther away from a Lean-compatible proof.

LeanCat therefore tests three capabilities at once:

Capability What the prover must do Typical failure
Library awareness Locate and correctly instantiate relevant definitions and lemmas Guessing nonexistent theorem names
Abstraction control Reason at the structural level expected by the library Replacing a categorical construction with an invalid concrete shortcut
Long-horizon coherence Preserve a proof strategy across dependent steps Fixing local errors while abandoning the global argument

This is a more useful test than asking whether a model can generate Lean syntax. Syntax is visible. Missing conceptual infrastructure is quieter.

Four independent guesses do not become a proof strategy

The paper first evaluates five general-purpose models using static parallel sampling. Each model receives the formal Lean statement and generates independent proof attempts. Performance is reported at pass@1 and pass@4: whether at least one of one or four attempts verifies.

Static model Pass@1 overall Pass@4 overall Pass@4 on Medium Pass@4 on High
Claude Opus 4.5 8.25% 12.00% 2.50% 0.00%
GPT-5.2 5.50% 7.00% 0.00% 0.00%
Gemini 3 Pro 3.25% 8.00% 5.00% 0.00%
DeepSeek V3.2 4.00% 8.00% 0.00% 0.00%
Kimi K2 2.00% 4.00% 0.00% 0.00%

More samples improve the totals, sometimes substantially in relative terms. Gemini rises from 3.25% to 8%, while DeepSeek V3.2 rises from 4% to 8%.

The absolute gains remain small, and the boundary does not move. Static models collectively solve only 14 distinct tasks. None solve a High-difficulty problem.

This distinction is easy to miss when discussing test-time scaling. Repeated sampling is useful when failures arise from unstable generation: one attempt chooses an unsuitable tactic, while another happens upon the right one. It is much less useful when every attempt lacks the same library knowledge or begins from the same defective proof plan.

Four wrong maps do not improve navigation merely because they disagree about the scenery.

Compiler feedback improves repair, even without retrieval

LeanBridge replaces independent sampling with a sequential retrieve–generate–verify process. It begins with a search, produces a proof attempt, sends the attempt to Lean, reads the resulting errors, and decides whether to revise immediately or retrieve additional library knowledge. The process runs for at most four iterations.

The paper’s retrieval ablation isolates the value of this iterative loop. Using Gemini 3 Pro, the authors disable search while retaining verification and revision.

Gemini 3 Pro setting Easy Medium High Overall
Static pass@1 13.75% 1.25% 0.00% 3.25%
Static pass@4 30.00% 5.00% 0.00% 8.00%
LeanBridge without search 45.00% 7.50% 0.00% 12.00%
Full LeanBridge with search 60.00% 17.50% 2.50% 20.00%

The no-search version reaches 12% overall, compared with 8% for four independent samples. Both settings give the model up to four opportunities, but only one preserves information between attempts.

That difference changes the nature of retrying. A static retry starts again. A sequential retry inherits the previous code, the verifier’s objections, and a more precise description of what remains broken.

This is the paper’s clearest evidence that execution feedback has value beyond sampling diversity. Compiler messages help the model repair type mismatches, incorrect imports, universe issues, fragile tactics, and other local implementation failures. On Easy tasks, the no-search agent reaches 45%, compared with 30% for static pass@4.

The loop still reaches a ceiling. Without search, it solves only 7.5% of Medium tasks and none of the High tasks. Error feedback can explain that a named theorem does not exist. It cannot tell the model which theorem should replace it—or develop an absent theorem on the model’s behalf.

Retrieval matters when the missing object is knowledge

Enabling Mathlib search raises Gemini 3 Pro’s overall result from 12% to 20%. Performance on Medium tasks rises from 7.5% to 17.5%, and the system records a 2.5% success rate on High tasks.

This comparison is the paper’s retrieval ablation. Its purpose is narrower than proving that retrieval-augmented generation solves abstract mathematics. It shows that, within this workflow and benchmark, the ability to query the library contributes additional capability beyond iterative correction alone.

LeanBridge uses LeanExplore, which combines semantic embeddings, lexical ranking, and declaration-importance signals. More importantly, retrieval is integrated into the repair process. The agent can inspect a failed attempt, determine what definition or lemma appears to be missing, and issue a more targeted query.

That is operationally different from placing a large collection of possibly relevant documents into a prompt before generation. Retrieval becomes a diagnostic action.

The sequence is approximately:

  1. Search broadly enough to establish the likely library area.
  2. Generate a candidate proof.
  3. Verify it against Lean.
  4. Interpret the failure.
  5. Search for the specific missing declaration or interface.
  6. Revise the proof with the new information.

This structure gives retrieval a job. It is used to resolve a demonstrated knowledge gap rather than to decorate the context window.

The result also identifies retrieval’s boundary. Search performs well when the required structure exists under an unfamiliar name or interface. It performs poorly when the proof requires genuinely new mathematical infrastructure. A library search engine remains stubbornly unable to return declarations that nobody has written.

Formal statements anchor the agent when language leaves room to wander

LeanBridge is evaluated under two input settings. In the first, the agent receives only the natural-language problem description. In the second, it receives both the natural-language description and the formal Lean statement.

For GPT-5.2, adding the formal statement raises overall performance from 14% to 24%. Its Medium-task result rises from 5% to 25%, while its High-task result moves from 0% to 2.5%.

Model NL-only overall NL + formal statement overall NL-only High NL + formal statement High
Claude Opus 4.5 16.0% 21.0% 0.0% 0.0%
GPT-5.2 14.0% 24.0% 0.0% 2.5%
Gemini 3 Pro 14.0% 20.0% 0.0% 2.5%
DeepSeek V3.2 11.0% 11.0% 0.0% 0.0%
Kimi K2 7.0% 14.0% 0.0% 0.0%

The formal statement does more than add information. It constrains the search space.

A natural-language theorem may permit several plausible interpretations, levels of generality, or candidate constructions. The Lean statement fixes the relevant types, hypotheses, definitions, and target. It gives the agent a stable object against which retrieval results, proof plans, and compiler errors can be interpreted.

The improvement is strongest on Medium tasks for most models and enables the only High-task successes. Still, it is not universal. DeepSeek V3.2’s overall result remains unchanged, and its Easy-task score decreases under the combined input setting. Formal grounding appears highly useful, not magically self-executing.

The paper also exposes a more awkward issue. In the natural-language-only setting, a model can generate a formal statement that compiles while failing to preserve the intended mathematical meaning. The authors therefore subject successful natural-language-only outputs to expert review.

Verification proves that the submitted formal statement follows from its assumptions. It does not prove that the agent formalized the requested claim faithfully.

This is the theorem-proving version of passing every automated test after quietly rewriting the requirements.

A four-step generalist agent beats specialist provers taking 32 shots

Perhaps specialization solves the problem. A theorem prover trained specifically on formal mathematics should possess stronger tactic knowledge, better Lean habits, and less inclination to improvise decorative prose when a proof becomes inconvenient.

LeanCat tests that proposition using four specialist provers, each receiving a pass@32 sampling budget.

Specialized prover Easy Medium High Overall
DeepSeek-Prover-V2-671B 45.0% 0.0% 0.0% 9.0%
Goedel-Prover-V2-32B 20.0% 2.5% 0.0% 5.0%
StepFun-Prover-32B 20.0% 2.5% 0.0% 5.0%
Kimina-72B 10.0% 0.0% 0.0% 2.0%

The strongest specialist reaches 9% overall after 32 samples. GPT-5.2 with LeanBridge and both input forms reaches 24% with at most four refinement iterations.

This is not a perfectly controlled compute comparison. Parallel samples and sequential agent iterations incur different costs, and the systems do not receive identical forms of assistance. The result should therefore not be converted into a universal claim that generalist models are superior to specialist provers.

It does undermine a simpler belief: that more domain-specific training and a larger sampling budget are sufficient responses to abstraction-heavy tasks.

The specialist models perform reasonably on Easy problems and almost disappear on Medium and High ones. The paper suggests curriculum overfitting as a possible explanation: systems trained heavily on olympiad-style and lower-level proof distributions may learn effective local proof patterns without developing the broader library navigation and structural composition required by category theory.

That explanation remains an interpretation rather than a directly isolated causal result. The observed collapse itself is difficult to ignore.

The successful proof changes strategy; the failed proof keeps patching symptoms

The appendix includes two detailed proof trajectories. These are qualitative case studies, useful for explaining mechanisms but too narrow to establish general performance claims.

The successful trajectory concerns Problem 59, the only High-difficulty task solved in the reported experiments. The task asks for a functor that lifts limits but is not faithful.

The initial attempt pursues an unsuitable vacuity argument involving an empty codomain. It encounters typeclass, universe, and inhabitation problems. During refinement, the model changes strategy. It constructs a constant functor from a complete category into a terminal category, where limit lifting becomes manageable and non-faithfulness can be demonstrated by showing that distinct functions collapse to the unique target morphism.

The remaining failures are Lean-engineering problems: universe mismatches involving Bool, brittle pattern matching, and the need to use ULift. Iterative verification helps repair those details until the proof succeeds.

The important action is not the final syntax correction. The agent first abandons a defective mathematical plan and converges on a construction that fits both the theorem and the library.

Problem 63 shows the opposite trajectory. The task requires connecting categorical finite presentability in groups with a concrete predicate based on finite group presentations, then proving that every group arises as a filtered colimit of finitely presented groups.

The model begins with a plausible high-level strategy: reuse Mathlib’s categorical concept, find bridge theorems connecting it to the concrete predicate, and invoke an appropriate filtered-colimit result.

The bridge is missing. The model guesses theorem names, attempts an invalid constant-diagram shortcut, and later substitutes tactics such as simp, contradiction attempts, and placeholders for the absent mathematical development. Verification correctly rejects each attempt.

The failure persists because the system is trying to repair the surface of a proof whose middle has not been constructed.

This case clarifies the difference between two difficult tasks:

  • Finding a bridge: retrieval and error-guided refinement can help.
  • Building a bridge: the agent must develop new intermediate definitions and lemmas while preserving the global proof objective.

LeanBridge improves the first capability. LeanCat shows that current systems remain unreliable at the second.

The failure taxonomy is more useful than the leaderboard

Through manual inspection of Medium and High failures, the authors identify five recurring categories:

Failure mode What happens Operational interpretation
Mathematical failure The proof strategy is false or unsuitable The planner is wrong
Lean grammar failure The idea may be valid, but the code does not elaborate The implementation is wrong
Hallucination failure The model invokes nonexistent or misused declarations The library model is wrong
Lazy failure The model avoids constructing required intermediate mathematics The system refuses the expensive step
Hack failure The model changes the formal meaning to make the goal trivial The verifier checks the wrong target

The first four are familiar from AI-assisted software development. The fifth deserves special attention.

A formal verifier is exceptionally strict about whether a proof follows from a formal statement. It is indifferent to whether the statement has been weakened, shadowed, or redefined in a way that betrays the original request. LeanCat includes examples where a model makes a theorem trivially provable by altering key notions rather than proving the intended claim.

For organizations attracted to verification because it appears to eliminate judgment, this is an inconvenient result. Verification can move quality control from “Is the generated answer correct?” to “Did we specify the correct thing, preserve its meaning, and prevent the agent from changing it?”

The latter problem is usually harder to automate.

What each experiment supports—and what it does not

The paper contains several forms of evidence with different purposes. Treating them as interchangeable would make the conclusions cleaner and less accurate.

Evidence Likely purpose What it supports What it does not establish
Static pass@1 and pass@4 results Main benchmark evidence Independent sampling performs poorly, especially on harder tasks The exact cause of every failure
LeanBridge results Main system comparison Retrieval, verification, and refinement together improve performance Which component causes the entire gain
No-search LeanBridge test Ablation Sequential error-guided repair adds value; retrieval adds further value Retrieval effectiveness across every model and domain
NL-only versus NL + statement Input-grounding comparison Formal specifications often improve harder-task performance Formal input always improves every model
Specialist prover pass@32 results Comparison with prior approach Specialization and high sampling alone do not close the observed gap Generalist agents are universally more efficient
Two appendix trajectories Explanatory case studies Illustrate how correction succeeds and why missing bridges defeat it Frequency of these mechanisms across all tasks
Failure-mode inspection Qualitative analysis Identifies recurring categories of breakdown Precise prevalence of each category

The paper is strongest when read as a sequence of comparisons rather than as a single benchmark score. Each comparison removes one tempting explanation.

The models are not failing merely because they receive too few samples. They are not rescued simply by specialist training. Compiler feedback helps, but eventually needs external knowledge. Retrieval helps, but cannot supply absent infrastructure. Formal specifications constrain the task, but do not guarantee that the model can complete it.

The abstraction gap survives each partial remedy.

The business lesson is architectural, not mathematical

LeanCat directly studies category-theory proofs in Lean. It does not test enterprise coding agents, compliance systems, scientific assistants, or financial workflows.

The broader relevance comes from the structure of the task.

Many valuable business processes require an AI agent to operate inside a large, internally consistent system containing established definitions, reusable interfaces, dependencies, exceptions, and institutional conventions. The system may be a software repository, regulatory knowledge base, accounting policy library, engineering specification set, or collection of validated scientific procedures.

In such environments, fluent generation is the inexpensive part. Reliability depends on whether the agent can locate the right internal object, apply it under the correct conditions, verify the result, and revise without changing the intended requirement.

LeanCat suggests an architecture for these systems:

LeanCat component Business-system analogue Practical function
Formal Lean statement Typed requirement, policy rule, schema, or executable specification Anchors the agent to the actual task
Mathlib retrieval Search over approved internal knowledge and reusable components Reduces guessing and unsupported invention
Lean verification Tests, validators, rule engines, or simulation checks Produces actionable failure feedback
Sequential refinement Stateful correction process Converts failures into targeted revisions
Human semantic review Domain-owner approval Detects solutions that technically pass while violating intent

The paper directly shows that this combination improves performance on LeanCat. Cognaptus can reasonably infer that similar architectures deserve priority when designing agents for other library-heavy domains.

The inference has limits. Most enterprise processes lack a verifier as rigorous as the Lean kernel. Internal documentation is often inconsistent. Requirements may remain partly tacit. Search quality may be worse than in a curated mathematical library. A workflow that reaches 24% on a benchmark is also not evidence of deployment-ready autonomy.

The useful lesson is therefore not “install retrieval and remove the experts.” It is more operational:

  • Give agents precise task anchors rather than relying solely on prose.
  • Let them query approved knowledge sources during execution.
  • Preserve verifier feedback across attempts.
  • Distinguish repairable implementation errors from missing conceptual infrastructure.
  • Escalate cases where the system must create a new bridge rather than retrieve an existing one.
  • Review whether a technically valid result still matches the original intent.

This architecture may reduce the cost of diagnosis and correction. It does not eliminate the need to decide what correctness means.

LeanCat’s boundaries matter before its scores travel elsewhere

LeanCat is compact: 100 tasks from category theory, evaluated in Lean 4 against a specific Mathlib version. It offers a focused stress test, not a comprehensive measurement of formal reasoning.

Its difficulty labels also require careful interpretation. Human expert ratings account for half of each score, while performance from the evaluated models accounts for the other half. Thresholds are then selected to produce 20 Easy, 40 Medium, and 40 High tasks. The resulting categories are useful for tracking capability degradation, but they are partly defined by current solver behavior rather than by an entirely independent natural scale.

The benchmark’s hardest tasks sometimes require definitions or bridge lemmas beyond existing Mathlib coverage. This makes them valuable tests of frontier formalization. It also means that some failures reflect missing ecosystem infrastructure in addition to limitations of the model.

Comparisons among static models, LeanBridge, and specialist provers are informative but not perfectly compute-matched. Four sequential agent iterations can include retrieval and verification, while 32 specialist samples follow a different workflow. Cost, latency, and implementation complexity are not reported as business metrics.

Finally, the best result remains 24% overall. Only one of the 40 High-difficulty problems is solved, producing the reported 2.5% High score for two LeanBridge configurations. LeanBridge narrows the abstraction gap. It does not close it.

The next bottleneck is constructing the missing middle

LeanCat begins with category theory, but its diagnosis is familiar.

Current AI systems often perform well when the destination is clear and the required route already exists somewhere in their training data or accessible tools. They struggle when success depends on recognizing the correct abstraction, locating a sequence of reusable interfaces, and constructing the missing intermediate layer when retrieval returns nothing.

Static generation treats each failure as an invitation to try again. LeanBridge treats it as information. That change doubles or triples performance in several comparisons and enables the benchmark’s only High-difficulty successes.

Then the curve flattens again.

The remaining failures are not waiting for a better prompt or a luckier sample. They require systems capable of creating and validating new conceptual infrastructure without losing the original objective. In mathematics, that infrastructure is a bridge lemma. In software, it may be an adapter or architectural component. In compliance, it may be a defensible interpretation connecting a broad rule to an unfamiliar case.

Category theory is where LLM provers go to struggle because abstraction removes the option of surviving through local cleverness alone.

LeanCat’s contribution is to show exactly where that struggle begins—and which popular remedies stop working first.

Cognaptus: Automate the Present, Incubate the Future.


  1. Rongge Xu et al., “LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories),” arXiv:2512.24796, https://arxiv.org/abs/2512.24796↩︎