Opening — Why this matters now
LLMs can write code, explain proofs, and occasionally hallucinate both with equal confidence. So the obvious next question—posed almost mischievously in this paper—is whether an LLM can code a theorem prover that itself relies on LLMs. Not as a demo. Not as a toy. But as a fully automatic, kernel-checked prover that runs on a laptop and outperforms Isabelle’s industrial-grade automation in at least some regimes.
That question matters because formal verification is quietly becoming strategic infrastructure. As AI systems are deployed into finance, safety-critical software, and regulation-heavy environments, “it seems correct” is no longer good enough. You need proofs. Preferably without hiring a small army of proof engineers.
Background — Why Isabelle automation stalls
Interactive theorem provers like Isabelle/HOL sit at an awkward equilibrium. They offer extraordinary guarantees—every proof is checked against a small trusted kernel—but extracting those guarantees is expensive. Even with tools like Sledgehammer and Nitpick, proof development often degenerates into a careful choreography of tactics, auxiliary lemmas, and context management.
The fundamental problem is combinatorial. Higher-order logic plus large libraries equals a search space that explodes faster than classical automation can prune. Sledgehammer is excellent at blasting through routine goals, but it still relies on brittle heuristics and external ATPs that struggle when structure matters more than raw logical firepower.
LLMs, by contrast, excel at pattern completion. They are not sound reasoners—but they are very good proposal engines. The trick is forcing those proposals through a verifier that never compromises.
Analysis — What Isabellm actually does
The system introduced in the paper—Isabellm—is not a single clever prompt. It is a layered architecture designed to turn unreliable language models into disciplined components inside a symbolic machine.
Layer 1: Stepwise prover (LLM as bounded action generator)
At the core is a beam-search prover that treats theorem proving as sequential decision-making. From a given proof state, an LLM proposes small proof commands—typically apply tactics or single-line finishers. Every proposal is immediately checked by Isabelle. Accepted steps advance the state; rejected ones vanish without ceremony.
Key design choices:
| Design choice | Why it matters |
|---|---|
| Grammar-constrained outputs | Prevents LLM creativity from becoming syntactic sabotage |
| Beam search over proof states | Explores alternatives without exponential blowup |
| Proof-state fingerprints | Deduplicates semantically identical states |
| Dynamic temperature | Escapes local minima without constant randomness |
In effect, the LLM is demoted from “proof author” to “probabilistic move generator.” Isabelle remains the only authority.
Learning where it counts: reranking and premises
Rather than asking the LLM to remember the entire Isabelle library, Isabellm introduces two learned but modest components:
- Tactic rerankers: Lightweight ML models predict which candidate commands are likely to succeed, reducing wasted Isabelle calls.
- Premise selection: Small encoders retrieve relevant lemmas, injected as soft hints rather than hard constraints.
This is not DeepMind-style end-to-end learning. It is pragmatic, cheap, and brutally verifier-grounded.
Layer 2: Proof planner (where ambition meets reality)
Above the stepwise prover sits a planner that attempts something bolder: generating structured Isar proofs with explicit subgoals, then filling and repairing gaps using a CEGIS-style loop.
The planner uses:
- Outline generation: LLM-sampled Isar skeletons with
sorryplaceholders - Micro-RAG: Symbolic, lexicon-based hints mined from Isabelle corpora
- Gap filling: Calling the stepwise prover on localized subgoals
- Repair: Block-level regeneration guided by verifier errors
On paper, this is elegant. In practice, it is where the system strains.
Findings — Where the system wins (and where it breaks)
The empirical result is surprisingly sharp:
- The stepwise prover can solve lemmas that defeat Sledgehammer, even with generous timeouts.
- The planner’s fill-and-repair mechanism largely fails beyond trivial cases.
This is not a tuning issue. The paper is refreshingly candid: even GPT-5.2 Extended Thinking and Gemini 3 Pro struggle to implement and maintain the planner’s complex algorithmic invariants once the codebase grows beyond ~20 files.
In other words, LLMs are good at suggesting proof steps. They are bad at maintaining global consistency across deeply structured symbolic systems.
Implications — What this means for AI and verification
Three lessons stand out.
- LLMs shine as search heuristics, not solvers. When tightly constrained and aggressively verified, they add genuine power.
- Verifier-in-the-loop data is gold. Every failed attempt becomes training signal. This is how the system improves without fantasy.
- Planning remains the hard frontier. Global proof structure, context-sensitive syntax, and repair under failure expose limits that bigger models alone do not fix.
For businesses and regulators, the takeaway is subtle but important: near-term value lies in augmentation, not autonomy. Systems like Isabellm already reduce human effort in formal reasoning—but only when architecture disciplines the model.
Conclusion — Engineering beats vibes
Isabellm demonstrates that LLM-powered theorem proving is not a moonshot—it already works, in narrow but meaningful domains. More importantly, it shows how it works: by stripping language models of authority and embedding them inside unforgiving symbolic loops.
The system also exposes a boundary. Vibe coding gets you a prover. It does not get you a planner you can trust.
That boundary is exactly where the next decade of AI systems engineering will be decided.
Cognaptus: Automate the Present, Incubate the Future.