Opening — Why this matters now

The AI industry is currently obsessed with scale: more parameters, more tokens, more test-time compute. But a recent paper, Tool Building as a Path to “Superintelligence”, quietly suggests something more structural.

The real bottleneck may not be model size.

It may be a single number: γ (gamma) — the probability that, at each reasoning step, the model proposes the correct next move.

If γ stays constant as problems get deeper, search-based reasoning scales polynomially. If γ collapses with depth, no amount of brute-force search will save you.

In other words: superintelligence might be a systems engineering question, not a scaling law.

This paper builds a benchmark specifically to test that claim — and the results are uncomfortable for anyone who thinks larger transformers alone will carry us forward.


Background — The Diligent Learner and the Fragility of γ

The Diligent Learner framework models reasoning as validator-guided search. At each partial solution prefix ( h ), a model proposes candidate next steps. A validator filters incorrect extensions.

The core requirement is simple but brutal:

$$ \Pr[a \text{ is a good next step}] \geq \gamma $$

If γ remains bounded below by a constant, the search cost grows roughly as:

$$ O\left(T_{\max} \cdot \frac{\log(T_{\max}/\delta)}{\gamma}\right) $$

Polynomial. Manageable.

But if γ decays with reasoning depth, the guarantee collapses.

Most existing reasoning benchmarks cannot isolate this. They:

  • Score only final answers.
  • Allow multiple valid reasoning paths.
  • Permit statistical shortcuts or memorization.

So we never truly measure γ.

This paper designs a benchmark that does.


Analysis — A Benchmark Designed to Break Shortcuts

The authors construct a GF(2) Boolean circuit reconstruction task in Algebraic Normal Form (ANF). At each step ( g ), the model must recover exactly one new monomial term.

Crucially:

  • There is exactly one correct next extension.

  • Success requires integrating:

    • The revealed prefix (history)
    • Fresh, step-specific evidence (new samples)
  • Data-only and history-only shortcuts are information-theoretically defeated.

The oracle is adversarially designed:

  • History-only solvers gain no information about the next support.
  • Data-only solvers see statistically obfuscated labels.
  • Only a solver that subtracts the prefix mask from new evidence can recover the signal.

This transforms reasoning into a clean experimental probe of γ.

Estimator Classes

The benchmark compares four estimator types:

Estimator Access Expected γ Behavior
A (Diligent) Prefix + Data Sustained γ
B (Data-only) Data only Collapses
C (History-only) Prefix only Random baseline
D (Partial) Imperfect access Degrades with depth

Simulation results confirm the theory:

  • Only Estimator A maintains high γ across depths.
  • All others approach chance as ( g ) increases.

The separation is not philosophical — it is measurable.


Findings — Small LLMs vs Frontier Models

Small LLMs: Depth-Induced Collapse

When evaluated on Qwen3 variants (4B and 30B), the results are clear:

  • Performance declines superlinearly with depth.
  • “Thinking” variants help at shallow levels.
  • Beyond moderate depth (e.g., ( g = 15 )), γ approaches trivial baseline.

The authors model this as an effective prefix utilization problem:

Model Effective Prefix Utilization Interpretation
30B Thinking ~77% of prefix Proportional scaling
30B Instruct ~15% Weak scaling
4B variants Minimal Capacity-limited

Small models behave like partial-information estimators.

They do not fully condition on the growing prefix. As depth grows, mask cancellation fails.

Frontier Models: Tool Usage Changes Everything

Frontier models (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro) were tested at depths up to ( g = 127 ).

Two regimes emerged:

Condition γ at Large Depth
No tools Degrades sharply
Tools allowed Near unity, even at 127

Tool-enabled models maintained almost constant γ.

That is the central result.

Tool use effectively separates:

  1. Constraint inference (what to compute)
  2. Execution (how to compute it)

Without tools, transformers must both discover constraints and simulate execution internally. With tools, they externalize computation.

This dramatically stabilizes γ.


Implications — Superintelligence as Architecture, Not Just Scale

The paper reframes the “superintelligence” debate.

The bottleneck is not merely:

  • Parameter count
  • Training data
  • Test-time search budget

It is architectural.

Specifically:

Can the system maintain non-vanishing γ across arbitrarily long reasoning horizons?

Tool use provides a structural answer:

  • External memory
  • External computation
  • Deterministic validation

This reduces the internal burden on the transformer’s weights.

Business Implications

For AI operators and infrastructure builders, three conclusions follow:

  1. Tool integration is not optional for reliable long-horizon reasoning.
  2. Evaluating models solely on final-answer benchmarks is strategically naive.
  3. Systems that externalize execution may dominate purely end-to-end architectures.

This shifts the investment thesis:

  • Infrastructure > isolated model capability
  • Orchestration > raw parameter growth
  • Tool ecosystems > monolithic intelligence

The path to scalable reasoning may look more like distributed systems engineering than cognitive mysticism.


Conclusion — The Gamma Test

The authors do not claim to have built superintelligence.

They did something subtler and more valuable:

They built a test that exposes whether γ survives depth.

Small models fail. Frontier models succeed — but primarily when allowed to build and use tools.

If the Diligent Learner hypothesis holds, then the future of AI will not be determined by who trains the largest model.

It will be determined by who designs the best toolchain.

Superintelligence, it seems, may be less about divine emergence and more about disciplined engineering.

And γ is watching.

Cognaptus: Automate the Present, Incubate the Future.