Skeletons in the Proof Closet: When Lean Provers Need Hints, Not More Compute

Compute is a very convenient alibi.

When an AI system fails, the modern reflex is to ask for more of it: more samples, more tokens, more search, more GPUs, more patience from whoever is paying the invoice. This habit is not always wrong. Sometimes the model really does need another attempt. Sometimes the winning answer is hiding in sample number 47.

But sometimes sample number 47 is just sample number 3 wearing a different hat.

That is the uncomfortable message of Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study.¹ The paper studies Lean theorem provers, not customer-service chatbots or enterprise workflow agents. Still, the mechanism is familiar enough to matter outside formal mathematics: reinforcement learning can create useful capability, then compress that capability into a narrow policy that keeps trying the same kind of move. The system looks stochastic. The budget says exploration. The behavior says rut.

The headline result is clean. On miniF2F-test, DeepSeek-Prover-V1.5-RL solves 38 out of 244 theorems at $k=16$ with ordinary independent sampling, 42 at $k=32$, and still 42 at $k=64$. Doubling the sampling budget from 32 to 64 produces zero additional solved theorems. A deterministic schedule of tactic skeletons, however, reaches 55, 58, and 60 solved theorems at the same three budgets.

That is not just a better prompt. It is a diagnostic: the prover did not merely need more attempts. It needed to be forced into different strategic openings.

The failure mode is narrow exploration, not weak syntax

Lean proof generation is a useful laboratory because failure is brutally legible. A proposed proof either passes the Lean kernel or it does not. There is no fuzzy “mostly correct” answer, no polite benchmark grader, no hidden human tolerance for fluent nonsense. This makes the paper’s setup unusually clean for studying inference-time behavior.

The authors compare two ways of spending the same number of model calls per theorem.

The baseline condition uses ordinary i.i.d. sampling: same theorem statement, empty prompt prefix, different random seeds. This is the familiar “try again” strategy.

The structured condition uses a deterministic grid of 15 Lean tactic skeletons crossed with 8 short goal-hint comments, giving 120 possible prompt configurations. At the tested budgets $k \in {16, 32, 64}$, the system simply walks through the first entries of that grid. The skeletons include common openings such as simp, intro, constructor, norm_num, linarith, ring, and combinations like simp followed by try aesop.

This matters because the intervention is small. The paper is not introducing a new prover architecture, a tree search engine, or a retrieval system. It is asking a simpler question: what happens if the first step of the proof is diversified on purpose?

The answer is that the RL-trained model recovers proofs that random sampling fails to find.

Condition	Pass@16	Pass@32	Pass@64
V1.5-RL, i.i.d. sampling	38/244	42/244	42/244
V1.5-RL, skeleton schedule	55/244	58/244	60/244
Absolute gap	+17	+16	+18

The shape of this table is more important than the raw score. The baseline improves from 16 to 32 attempts, then stops. The skeleton schedule keeps finding new proofs. If sampling were genuinely exploring different proof strategies, the extra 32 attempts from $k=32$ to $k=64$ should have had some chance of finding at least one additional theorem. Instead, they add nothing.

That is the paper’s central mechanism: the model’s output distribution is not diverse enough along the axis that matters. It may vary in surface form, but it keeps choosing a small set of strategic openings.

Skeletons are not “prompt magic”; they are controlled first moves

The easy but wrong reading is: “Tactic skeletons improve Lean proving.” True, but too shallow.

The better reading is: “Tactic skeletons reveal that the RL-trained prover underexplores the tactic space.”

This distinction matters for business readers because it separates product value from demo theater. A prompt trick that raises a benchmark is useful but fragile. A diagnostic that explains why the benchmark rises is operationally more valuable. It tells teams where to spend engineering effort.

The paper’s skeleton schedule acts like a forced-routing layer. Instead of asking the model to choose the first proof strategy from its learned policy, the prompt nudges or fixes the first strategic opening. Once the model is moved into a different part of the proof space, it can sometimes complete the proof. The capability was not absent. The policy was not reaching it.

That is why the authors treat the schedule as a probe rather than a proposed final system. A hand-designed 15-skeleton grid is not the endpoint. It is a cheap instrument for detecting whether the model’s failures come from lack of proof ability or lack of exploration.

For AI engineering, that distinction is gold. Many agent failures look like reasoning failures when they are actually search-control failures. The agent knows a useful operation, API call, tool sequence, or decomposition pattern. It simply does not choose it under its default policy.

Naturally, the model then asks for another random sample and does the same thing again. Very diligent. Not very useful.

The ablations separate structure from decoration

The paper does not stop at the topline comparison. It runs controlled ablations to distinguish structural guidance from generic prompt variation.

At $k=16$, four conditions are compared:

Condition	What changes	Solved / 244	Interpretation
C2 irrelevant comments	Content-free Lean comments such as “approach alpha”	25	Surface perturbation can hurt
A-RL baseline	Empty prompt, random seeds	38	Ordinary sampling baseline
C1 paraphrases	Semantically similar instruction rewordings	38	Same topline, hidden redistribution
B-RL skeletons	Tactic-level skeleton schedule	55	Structural guidance helps

This is the paper’s first guardrail against overclaiming. The gain is not merely “prompt diversity.” If diversity alone were enough, irrelevant comments or paraphrases should help. They do not. Irrelevant comments degrade performance to 25 out of 244. Paraphrases match the baseline topline at 38, although later stratification shows they reshuffle which theorems are solved.

The dedicated natural-language-only condition is especially informative. In condition C3, all 16 attempts use empty tactic prefixes but cycle through natural-language goal-hint comments. C3 solves 48 out of 244, landing between the baseline’s 38 and the full skeleton schedule’s 55.

So the ladder is not random:

Intervention	Solved / 244	What it suggests
Baseline sampling	38	Random seeds alone underexplore
Natural-language goal hints	48	Semantic tactic suggestions help
Literal tactic skeletons + hints	55	Stronger structural steering helps more

This is a structural-content gradient. A vague rewording of the task is not enough. A hint that names a useful direction helps. A concrete first-tactic stem helps further.

That gradient is the article’s main business translation. If an AI agent keeps failing, adding “please reason carefully” is usually decorative. Adding a task-specific strategic scaffold may change the search trajectory.

The recovered proofs are exactly where mode collapse should hurt

The paper’s most useful interpretive move is its empirical difficulty stratification.

Mathematical difficulty is not the same as formalization difficulty. A theorem that is hard for a human mathematician may be easy for a prover if the formal statement aligns with available tactics. A seemingly simple theorem may become awkward in Lean because of how it is encoded. So the authors define a leave-one-out formalization-difficulty score: for each theorem, count how many other hint-free prover baselines solve it.

For V1.5-RL, the benchmark is divided into five buckets:

Bucket	Meaning	Theorems
Frontier	No other tested baseline solves it	117
Hard	One other baseline solves it	24
Medium	Two other baselines solve it	13
Easy	Three other baselines solve it	53
Trivial	All four other baselines solve it	37

The word “trivial” here does not mean mathematically trivial. It means trivial for the tested field of provers excluding V1.5-RL. That is exactly the bucket where a mode-collapsed model should be embarrassed. If other systems routinely solve a theorem and V1.5-RL does not, the problem may not be raw capability. It may be that the model keeps approaching the formalization through the wrong doorway.

The skeleton intervention’s net gains concentrate in the easy and trivial buckets:

Perturbation	Hard	Medium	Easy	Trivial	Net
Skeletons	+1	+1	+9	+6	+17
Paraphrases	0	0	+3	-3	0
Irrelevant comments	+1	0	-6	-8	-13

This table does a lot of work.

First, skeletons do not move the frontier for V1.5-RL. They do not suddenly prove theorems no other model can reach. Their main effect is to recover problems that other provers already handle. That supports the mode-collapse diagnosis: the intervention repairs underexploration more than it creates new mathematical competence.

Second, paraphrases are not a null in the deep sense. They move performance around within easy and trivial buckets but produce no net gain. This matters because topline equality can hide behavioral change. In production AI systems, two prompt policies can have the same aggregate success rate while failing on different customer cases. The dashboard smiles; the operations team suffers.

Third, irrelevant comments degrade the easy and trivial buckets. Noise is not exploration. Decoration is not diversity. The model is sensitive to surface perturbations, but sensitivity alone is not useful.

RL creates the prover — and narrows it

The strongest causal clue comes from the base model comparison.

DeepSeek-Prover-V1.5 has a paired non-RL base variant, making it possible to ask whether the collapse existed before RL or was introduced by RL fine-tuning. The base model proves zero theorems across all tested budgets and conditions. Skeletons do not help it. Sampling does not help it. The authors report that the base model often emits sorry or echoes the theorem statement rather than producing valid Lean proofs.

That result is sharper than the simple “RL causes mode collapse” slogan. It says RL plays two roles at once.

RL creates proof capability. Without it, the base model has no useful Lean proof behavior on this benchmark.

RL also concentrates that capability into a narrow inference-time policy. With it, the model can prove theorems, but it underexplores the tactic space unless structurally redirected.

This dual role is the uncomfortable part. We cannot treat collapse as a removable side effect that can be fixed without touching the training objective. In this domain, the same post-training process that makes the model useful also appears to make its useful behavior narrower.

For enterprise AI, the analogy is straightforward but should be used carefully. RL-style post-training, preference optimization, and reward shaping often make systems more reliable on target tasks. They can also make behavior more stereotyped. A tuned agent may become better at following the “approved” strategy while becoming worse at trying alternative valid strategies when the approved one fails.

That is not always bad. Narrowness can be safety. Narrowness can be consistency. Narrowness can be brand compliance, if one enjoys that kind of thing. But in search-heavy domains such as theorem proving, code repair, workflow automation, compliance checking, and tool-using agents, narrowness can become expensive failure.

Cross-model results show a pattern, not a universal law

The paper extends the comparison beyond V1.5-RL to two additional 7B Lean provers: DeepSeek-Prover-V2-7B, which is RL-trained, and Goedel-Prover-SFT, which is supervised-fine-tuned.

The aggregate pass@16 results are mixed in exactly the way serious empirical work tends to be mixed:

Model	Post-training	Baseline A	Skeleton B	Difference
V1.5-Base	None	0/244	0/244	0
Goedel-Prover	SFT	90/244	85/244	-5
V1.5-RL	RL	38/244	55/244	+17
V2-7B	RL	126/244	127/244	+1

This is not a cartoon result where “RL bad, skeleton good” explains everything. V2-7B is nearly flat on aggregate. Goedel-Prover, despite being stronger than V1.5-RL in absolute pass@16, loses theorems under the skeleton schedule. The paper interprets this as evidence that different post-training recipes produce different forms of underexploration.

The robustness check across three seeds strengthens the asymmetry between V1.5-RL and Goedel-Prover. V1.5-RL gains under skeletons in all three seeds, with a mean improvement of $+12.3 \pm 4.2$ theorems. Goedel-Prover loses under skeletons in all three seeds, with a mean change of $-10.0 \pm 4.4$ theorems.

The authors also note a subtler V2 result: although V2’s aggregate gain is only +1, the structured condition solves three frontier theorems that none of the i.i.d. baselines in the matrix reach. This means aggregate score hides where the value appears. For product teams, that is a familiar problem. A change that barely moves average success rate may still unlock high-value edge cases.

The practical conclusion is not “always use tactic skeletons.” It is more precise: when a model has been reward-shaped into a narrow high-performing policy, structural diversity may be a better use of inference budget than more random samples. But the right structural intervention depends on the model and training pipeline.

The first-tactic measurement turns metaphor into evidence

Mode collapse can become a lazy phrase. The paper avoids that by directly measuring first-tactic diversity.

For each miniF2F theorem, the authors collect 64 raw outputs from V1.5-RL under i.i.d. sampling and extract the first tactic head. The result is blunt:

Unique first-tactic heads across 64 samples	Theorems
1	120 (49.2%)
2	69 (28.3%)
3	34 (13.9%)
4	12 (4.9%)
5–7	9 (3.7%)

The median theorem receives only two distinct first-tactic heads across 64 stochastic samples. Almost half the benchmark receives exactly one. The authors also report mean first-tactic-head entropy of 0.48 bits per theorem, versus 6.0 bits under uniform sampling across 64 possibilities. Aggregated across all 13,159 samples, three tactic heads — have, rw, and norm_num — account for 65.3% of strategic openings.

This is the paper’s cleanest diagnostic bridge from behavior to mechanism. The plateau is not merely an unlucky benchmark curve. The model is actually concentrating its first move.

The skeleton schedule guarantees 15 distinct first-tactic heads by construction. It is not smarter in any deep mystical sense. It simply prevents the model from spending 64 samples walking through the same narrow corridor.

The business value is inference design, not theorem proving trivia

Most companies are not deploying Lean theorem provers. That is fine. The relevant object is not Lean. The relevant object is the failure pattern.

Many AI systems now operate as search policies: they choose tools, generate code patches, draft SQL queries, navigate documents, create compliance arguments, or decompose business tasks into subtasks. In those settings, repeated sampling is often treated as exploration. This paper says: check that assumption.

A practical inference stack should separate three layers:

Layer	Question to ask	Operational response
Output variation	Are responses different in wording only?	Do not count paraphrase as exploration
Strategy variation	Are first moves, tools, or plans genuinely different?	Track tactic/tool/plan diversity
Verifier feedback	Which strategies actually succeed?	Route future attempts toward successful structural families

For Cognaptus-style automation work, the lesson is especially relevant to agent design. A business process agent may fail not because it lacks knowledge, but because its default policy overuses one procedural template. A document-review agent may always start with summarization when extraction is needed. A code agent may always patch the local error when the architecture is wrong. A research agent may always search web pages when it should inspect tables, appendices, or source repositories.

More samples will not fix that if the samples keep selecting the same opening move.

The product implication is to build structural diversity into the system deliberately. This can mean plan skeletons, tool-routing schedules, task-specific checklists, learned hint retrieval, or entropy-aware post-training. The cheap version is a fixed scaffold. The better version is a per-task router that predicts which scaffold is worth trying.

The paper does not prove that all enterprise agents need skeleton schedules. It does show why “increase $k$” is an incomplete scaling strategy. Inference compute should buy different hypotheses, not merely more fluent repetitions of the same hypothesis.

What the paper directly shows, and what we should infer carefully

Here is the disciplined split.

The paper directly shows that DeepSeek-Prover-V1.5-RL plateaus under i.i.d. sampling on miniF2F-test, while a fixed skeleton schedule recovers additional proofs at the same call budget. It shows that this improvement is not explained by generic prompt variation, because paraphrases do not improve the topline and irrelevant comments hurt. It shows that V1.5-Base has no proof capability under the same setup, so RL is both the source of capability and the locus of collapse in this controlled comparison. It also shows cross-model asymmetry: V1.5-RL benefits, V2-7B has a flat aggregate but frontier wins, and Goedel-Prover-SFT loses under the skeleton schedule.

Cognaptus infers that AI systems using repeated inference should measure strategy diversity, not just sample count. For tool-using agents, code assistants, and formal verification systems, the operational question is whether additional calls explore new first moves. If they do not, the compute budget is being spent on ritual.

What remains uncertain is the generality. The strongest controlled result is for one matched model family, DeepSeek-Prover-V1.5. The first-tactic distribution measurement is reported for V1.5-RL only. V2 lacks the same $k=32$ and $k=64$ scaling characterization in the paper. The study is also focused on miniF2F-test under a pinned Lean 4 environment. Other proof benchmarks, other model sizes, and other search architectures may produce different magnitudes.

There is also a subtle technical boundary: some skeletons, such as aesop and norm_num, are closer to powerful one-shot decision procedures than ordinary proof stems. Part of the gain may come from making the model use underselected built-in solvers, not only from diversifying the abstract first tactic. For product teams, that boundary is not fatal. It simply says the intervention may combine two effects: better strategy diversity and better tool invocation.

Both are useful. They are just not the same.

The next step is learned structure, not hand-written superstition

The obvious follow-up is not to freeze a 15-skeleton grid and declare victory. That would be very enterprise software of us, but still premature.

The better direction is adaptive structural routing. Given a theorem, task, document, ticket, or codebase, the system should predict which strategic openings are plausible and allocate attempts across them. In theorem proving, that may mean learned skeleton retrieval. In workflow agents, it may mean routing among extraction-first, verification-first, search-first, tool-call-first, and decomposition-first plans. In code repair, it may mean forcing some attempts to inspect tests, some to inspect architecture, and some to minimize the patch.

Training can also address the issue earlier. If RL creates both capability and collapse, then entropy-regularized RL, skeleton-conditioned RL, or reward designs that preserve strategic diversity may be worth more than another round of best-of-$N$ inference. But that is a harder engineering problem. It asks the training process to preserve alternatives that are not always immediately rewarded.

The paper’s practical elegance is that it starts cheaper. Before redesigning training, measure whether your inference budget is actually exploring. Count first moves. Count tool choices. Count plan families. Compare success overlap. Look for easy cases that other systems solve but your tuned model misses. These are not glamorous metrics. They are the receipts.

The broader lesson is simple: stochasticity is not diversity. More samples are not more thinking if the policy keeps reaching for the same move. Reinforcement learning can teach a model to prove; it can also teach the model to become oddly stubborn about how proof should begin.

Sometimes the prover does not need more compute.

It needs someone to open a different door in the proof closet.

Cognaptus: Automate the Present, Incubate the Future.

Zachary Burton, “Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study,” arXiv:2601.16172v2, May 2026. https://arxiv.org/abs/2601.16172 ↩︎

The failure mode is narrow exploration, not weak syntax#

Skeletons are not “prompt magic”; they are controlled first moves#

The ablations separate structure from decoration#

The recovered proofs are exactly where mode collapse should hurt#

RL creates the prover — and narrows it#

Cross-model results show a pattern, not a universal law#

The first-tactic measurement turns metaphor into evidence#

The business value is inference design, not theorem proving trivia#

What the paper directly shows, and what we should infer carefully#

The next step is learned structure, not hand-written superstition#