Opening — Why this matters now

Formal theorem proving has entered its confident phase. We now have models that can clear olympiad-style problems, undergraduate algebra, and even parts of the Putnam with respectable success rates. Reinforcement learning, tool feedback, and test-time scaling have done their job.

And then LeanCat arrives — and the success rates collapse.

This paper introduces LeanCat, a benchmark deliberately designed to do what most existing datasets avoid: force models to reason inside abstraction, not around it. The result is uncomfortable but clarifying. Even the strongest contemporary systems solve fewer than 1 in 10 category-theoretic statements on the first attempt. High-difficulty problems? Zero percent.

If formal reasoning is meant to scale toward research mathematics, this is the checkpoint where optimism meets structure.

Background — Context and prior art

Most formal benchmarks reward cleverness. MiniF2F, ProofNet, PutnamBench — all valuable, but largely focused on short derivations, symbolic manipulation, or domain-local tricks. These problems rarely require deep engagement with large formal libraries or long-horizon abstraction management.

Modern mathematics, however, does not work that way.

It is written in layers: interfaces, reusable constructions, universal properties, and diagrams that encode meaning implicitly. Category theory is the lingua franca of this style. To formalize it in Lean is not to solve puzzles, but to navigate Mathlib as an ecosystem.

LeanCat positions itself precisely here, complementing algebra-focused benchmarks like FATE by shifting attention from computation to structure engineering.

Analysis — What the paper does

LeanCat (Part I: 1-Categories) consists of 100 fully formalized Lean 4 theorem statements, all drawn from standard category theory sources and curated to stress abstraction rather than ingenuity.

Benchmark design highlights

  • Eight topic clusters covering the backbone of category theory
  • Statement-level tasks only — no scaffolding, no guided lemmas
  • Mathlib-native: models must work with existing definitions and interfaces
  • Library gaps included: some problems require auxiliary results not yet present

Topic coverage

Cluster Problem Range What it Tests
Basic Properties 1–18 Core categorical reasoning
Adjunctions 19–29 Universal properties, abstraction shifts
Reflective Subcategories 30–33 Interface-level structure
Concrete Categories 34–41 Abstract–concrete translation
Limits & Colimits 42–73 Multi-step composition, frontier gaps
Cocompletions 74–78 New definitions + theorem building
Abelian Categories 79–90 Heavy structural bookkeeping
Monads 91–100 Two-level reasoning (laws + structure)

This is not a benchmark for syntax. It is a benchmark for formal mathematical maturity.

Findings — Results with visualization

Baseline performance (formal proofs)

Model Pass@1 Pass@4
Claude Opus 4.5 8.25% 12.00%
GPT‑5.2 5.50% 7.00%
DeepSeek Reasoner 4.00% 9.00%
Gemini 3 Pro 3.25% 8.00%
Kimi K2 2.00% 4.00%

Difficulty matters — sharply.

Difficulty Best Pass@1 Best Pass@4
Easy 32.5% 50.0%
Medium 4.17% 4.76%
High 0.0% 0.0%

Retries help marginally, not structurally. This is not a sampling problem.

The natural-to-formal gap

One of the paper’s most revealing findings is the NL–FL collapse: models often generate plausible mathematical arguments that fail to compile in Lean. The abstraction exists in language — but not in executable structure.

This gap is not cosmetic. It reflects a failure to:

  • Select the right interfaces
  • Respect definitional equality
  • Maintain coherence across multi-step constructions

Implications — Next steps and significance

LeanCat surfaces three bottlenecks that matter far beyond category theory:

  1. Library awareness Models hallucinate lemmas, misuse definitions, or fail to retrieve the right abstraction layer.

  2. Abstraction control Successful proofs stay categorical. Failed ones drift into element-wise reasoning that Lean cannot tolerate without heavy scaffolding.

  3. Long-horizon planning Many attempts solve local goals and then stall — unable to integrate them into a global proof.

The paper also evaluates LeanBridge, a retrieve–analyze–generate–verify loop with Mathlib search. It helps — but does not eliminate the core abstraction gap.

For business and AI systems, the lesson is blunt: tool-augmented reasoning is necessary but not sufficient. Without structural understanding, retrieval only accelerates failure.

Conclusion — Where LeanCat leaves us

LeanCat is not pessimistic. It is diagnostic.

It shows that today’s LLM provers are competent technicians but unreliable architects. They can manipulate symbols, but struggle to inhabit frameworks. And until that changes, “research-level formalization” will remain aspirational.

The value of LeanCat is that it makes this failure mode measurable — and therefore fixable.

Cognaptus: Automate the Present, Incubate the Future.