Opening — Why this matters now

Neural theorem proving has entered its industrial phase. With reinforcement learning pipelines, synthetic data factories, and search budgets that would make a chess engine blush, models like DeepSeek‑Prover‑V1.5 are widely assumed to have internalized everything there is to know about formal proof structure.

This paper politely disagrees.

Under tight inference budgets—no massive tree search, no thousand-sample hail‑Mary—the author shows that simple, almost embarrassingly old‑fashioned structural hints still deliver large gains. Not new models. Not more data. Just better scaffolding.

That should make anyone building cost‑sensitive AI systems uncomfortable—in a productive way.


Background — Scale solved reasoning… allegedly

Modern neural provers follow a familiar arc:

  1. Pretrain a large language model
  2. Fine‑tune it on tactic scripts
  3. Reinforce it through self‑play and verification loops
  4. Throw compute at inference

The implicit promise is clear: structure will emerge. Given enough RL, the model will “just know” when to intro, when to simp, when to linarith, and when to give up.

Reality is messier. Even elite provers still fail on:

  • trivial syntax slips
  • hallucinated identifiers
  • tactic dead‑ends reached too early

In other words: they fail structurally, not mathematically.

This paper asks a refreshingly unglamorous question: what if we simply reminded them how proofs usually start?


Analysis — What the paper actually does

The intervention is minimal by design.

Instead of prompting the model once and sampling 16 completions, the author introduces a Structured Intermediate Representation (IR):

[ q = (x, s) ]

Where:

  • $x$ is the theorem statement
  • $s$ is a tactic skeleton — a short, fixed prefix like simp, intro, or constructor

At inference time:

  • 15 predefined skeletons are cycled through
  • each skeleton seeds one generation
  • total samples remain k = 16
  • token budget remains 1024 tokens

No search. No learning. No adaptivity.

Just a fixed schedule of structural nudges.

Skeletons used (abridged)

Category Examples
Setup intro, intros, constructor
Simplification simp, norm_num
Algebraic linarith, nlinarith, ring
Heuristic aesop, simp → try aesop

The model still samples stochastically. The only change is where it is forced to begin.


Findings — The numbers that matter

Headline result

Method Solved Pass@16
Unguided baseline 37 / 244 15.16%
Structured IR 53 / 244 21.72%

That is a 43% relative improvement under identical inference constraints.

Paired outcome analysis

Outcome Count
Solved by both 34
Structured‑only solves 19
Baseline‑only solves 3

This asymmetry matters.

The structured method doesn’t merely trade one class of problems for another—it strictly dominates the baseline in low‑compute regimes.

A McNemar test confirms the improvement is statistically significant (p < 0.001).

Failure modes (the quiet twist)

Despite the performance gap, failure distributions look nearly identical:

Error Type Baseline Structured
Unsolved goals 32.9% 30.4%
Syntax errors 30.0% 29.3%
Unknown identifiers 18.4% 19.9%

Translation: the skeletons don’t magically prevent errors.

They simply increase the probability that one attempt lands on the right proof shape.


Implications — Why this is bigger than Lean

This paper quietly undermines a common assumption in AI engineering:

“If the model is good enough, structure becomes unnecessary.”

In practice:

  • Structure is under‑utilized, not obsolete
  • RL optimizes expected reward, not worst‑case brittleness
  • Fixed guidance can outperform adaptive sampling when budgets are tight

For businesses, this generalizes cleanly:

Domain Analogy
Code generation Forcing file layout before writing functions
Agents Explicit plan‑then‑act separation
RAG Query decomposition before retrieval
Automation Workflow templates instead of free‑form prompts

Cheap structure beats expensive flailing.


Limitations — What this does not claim

  • This does not beat leaderboard systems running k=1024
  • Proofs are truncated by a 1024‑token ceiling
  • Skeletons are hand‑crafted, not learned
  • Results are from a single evaluation run

But that is precisely the point: under realistic constraints, this works.


Conclusion — The unfashionable lesson

Scaling works. But it leaks.

Even the best reinforcement‑trained theorem provers still benefit from something very human: being told how to start.

Sometimes the difference between 15% and 22% accuracy is not more intelligence—but better manners.

Cognaptus: Automate the Present, Incubate the Future.