Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Opening — Why this matters now

Neural theorem proving has entered its industrial phase. With reinforcement learning pipelines, synthetic data factories, and search budgets that would make a chess engine blush, models like DeepSeek‑Prover‑V1.5 are widely assumed to have internalized everything there is to know about formal proof structure.

This paper politely disagrees.

Under tight inference budgets—no massive tree search, no thousand-sample hail‑Mary—the author shows that simple, almost embarrassingly old‑fashioned structural hints still deliver large gains. Not new models. Not more data. Just better scaffolding.

That should make anyone building cost‑sensitive AI systems uncomfortable—in a productive way.

Background — Scale solved reasoning… allegedly

Modern neural provers follow a familiar arc:

Pretrain a large language model
Fine‑tune it on tactic scripts
Reinforce it through self‑play and verification loops
Throw compute at inference

The implicit promise is clear: structure will emerge. Given enough RL, the model will “just know” when to intro, when to simp, when to linarith, and when to give up.

Reality is messier. Even elite provers still fail on:

trivial syntax slips
hallucinated identifiers
tactic dead‑ends reached too early

In other words: they fail structurally, not mathematically.

This paper asks a refreshingly unglamorous question: what if we simply reminded them how proofs usually start?

Analysis — What the paper actually does

The intervention is minimal by design.

Instead of prompting the model once and sampling 16 completions, the author introduces a Structured Intermediate Representation (IR):

[ q = (x, s) ]

Where:

$x$ is the theorem statement
$s$ is a tactic skeleton — a short, fixed prefix like simp, intro, or constructor

At inference time:

15 predefined skeletons are cycled through
each skeleton seeds one generation
total samples remain k = 16
token budget remains 1024 tokens

No search. No learning. No adaptivity.

Just a fixed schedule of structural nudges.

Skeletons used (abridged)

Category	Examples
Setup	`intro`, `intros`, `constructor`
Simplification	`simp`, `norm_num`
Algebraic	`linarith`, `nlinarith`, `ring`
Heuristic	`aesop`, `simp → try aesop`

The model still samples stochastically. The only change is where it is forced to begin.

Findings — The numbers that matter

Headline result

Method	Solved	Pass@16
Unguided baseline	37 / 244	15.16%
Structured IR	53 / 244	21.72%

That is a 43% relative improvement under identical inference constraints.

Paired outcome analysis

Outcome	Count
Solved by both	34
Structured‑only solves	19
Baseline‑only solves	3

This asymmetry matters.

The structured method doesn’t merely trade one class of problems for another—it strictly dominates the baseline in low‑compute regimes.

A McNemar test confirms the improvement is statistically significant (p < 0.001).

Failure modes (the quiet twist)

Despite the performance gap, failure distributions look nearly identical:

Error Type	Baseline	Structured
Unsolved goals	32.9%	30.4%
Syntax errors	30.0%	29.3%
Unknown identifiers	18.4%	19.9%

Translation: the skeletons don’t magically prevent errors.

They simply increase the probability that one attempt lands on the right proof shape.

Implications — Why this is bigger than Lean

This paper quietly undermines a common assumption in AI engineering:

“If the model is good enough, structure becomes unnecessary.”

In practice:

Structure is under‑utilized, not obsolete
RL optimizes expected reward, not worst‑case brittleness
Fixed guidance can outperform adaptive sampling when budgets are tight

For businesses, this generalizes cleanly:

Domain	Analogy
Code generation	Forcing file layout before writing functions
Agents	Explicit plan‑then‑act separation
RAG	Query decomposition before retrieval
Automation	Workflow templates instead of free‑form prompts

Cheap structure beats expensive flailing.

Limitations — What this does not claim

This does not beat leaderboard systems running k=1024
Proofs are truncated by a 1024‑token ceiling
Skeletons are hand‑crafted, not learned
Results are from a single evaluation run

But that is precisely the point: under realistic constraints, this works.

Conclusion — The unfashionable lesson

Scaling works. But it leaks.

Even the best reinforcement‑trained theorem provers still benefit from something very human: being told how to start.

Sometimes the difference between 15% and 22% accuracy is not more intelligence—but better manners.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Scale solved reasoning… allegedly#

Analysis — What the paper actually does#

Skeletons used (abridged)#

Findings — The numbers that matter#

Headline result#

Paired outcome analysis#

Failure modes (the quiet twist)#

Implications — Why this is bigger than Lean#

Limitations — What this does not claim#

Conclusion — The unfashionable lesson#