Opening — Why this matters now
Neural theorem proving has entered its industrial phase. With reinforcement learning pipelines, synthetic data factories, and search budgets that would make a chess engine blush, models like DeepSeek‑Prover‑V1.5 are widely assumed to have internalized everything there is to know about formal proof structure.
This paper politely disagrees.
Under tight inference budgets—no massive tree search, no thousand-sample hail‑Mary—the author shows that simple, almost embarrassingly old‑fashioned structural hints still deliver large gains. Not new models. Not more data. Just better scaffolding.
That should make anyone building cost‑sensitive AI systems uncomfortable—in a productive way.
Background — Scale solved reasoning… allegedly
Modern neural provers follow a familiar arc:
- Pretrain a large language model
- Fine‑tune it on tactic scripts
- Reinforce it through self‑play and verification loops
- Throw compute at inference
The implicit promise is clear: structure will emerge. Given enough RL, the model will “just know” when to intro, when to simp, when to linarith, and when to give up.
Reality is messier. Even elite provers still fail on:
- trivial syntax slips
- hallucinated identifiers
- tactic dead‑ends reached too early
In other words: they fail structurally, not mathematically.
This paper asks a refreshingly unglamorous question: what if we simply reminded them how proofs usually start?
Analysis — What the paper actually does
The intervention is minimal by design.
Instead of prompting the model once and sampling 16 completions, the author introduces a Structured Intermediate Representation (IR):
[ q = (x, s) ]
Where:
- $x$ is the theorem statement
- $s$ is a tactic skeleton — a short, fixed prefix like
simp,intro, orconstructor
At inference time:
- 15 predefined skeletons are cycled through
- each skeleton seeds one generation
- total samples remain k = 16
- token budget remains 1024 tokens
No search. No learning. No adaptivity.
Just a fixed schedule of structural nudges.
Skeletons used (abridged)
| Category | Examples |
|---|---|
| Setup | intro, intros, constructor |
| Simplification | simp, norm_num |
| Algebraic | linarith, nlinarith, ring |
| Heuristic | aesop, simp → try aesop |
The model still samples stochastically. The only change is where it is forced to begin.
Findings — The numbers that matter
Headline result
| Method | Solved | Pass@16 |
|---|---|---|
| Unguided baseline | 37 / 244 | 15.16% |
| Structured IR | 53 / 244 | 21.72% |
That is a 43% relative improvement under identical inference constraints.
Paired outcome analysis
| Outcome | Count |
|---|---|
| Solved by both | 34 |
| Structured‑only solves | 19 |
| Baseline‑only solves | 3 |
This asymmetry matters.
The structured method doesn’t merely trade one class of problems for another—it strictly dominates the baseline in low‑compute regimes.
A McNemar test confirms the improvement is statistically significant (p < 0.001).
Failure modes (the quiet twist)
Despite the performance gap, failure distributions look nearly identical:
| Error Type | Baseline | Structured |
|---|---|---|
| Unsolved goals | 32.9% | 30.4% |
| Syntax errors | 30.0% | 29.3% |
| Unknown identifiers | 18.4% | 19.9% |
Translation: the skeletons don’t magically prevent errors.
They simply increase the probability that one attempt lands on the right proof shape.
Implications — Why this is bigger than Lean
This paper quietly undermines a common assumption in AI engineering:
“If the model is good enough, structure becomes unnecessary.”
In practice:
- Structure is under‑utilized, not obsolete
- RL optimizes expected reward, not worst‑case brittleness
- Fixed guidance can outperform adaptive sampling when budgets are tight
For businesses, this generalizes cleanly:
| Domain | Analogy |
|---|---|
| Code generation | Forcing file layout before writing functions |
| Agents | Explicit plan‑then‑act separation |
| RAG | Query decomposition before retrieval |
| Automation | Workflow templates instead of free‑form prompts |
Cheap structure beats expensive flailing.
Limitations — What this does not claim
- This does not beat leaderboard systems running k=1024
- Proofs are truncated by a 1024‑token ceiling
- Skeletons are hand‑crafted, not learned
- Results are from a single evaluation run
But that is precisely the point: under realistic constraints, this works.
Conclusion — The unfashionable lesson
Scaling works. But it leaks.
Even the best reinforcement‑trained theorem provers still benefit from something very human: being told how to start.
Sometimes the difference between 15% and 22% accuracy is not more intelligence—but better manners.
Cognaptus: Automate the Present, Incubate the Future.