Opening — Why this matters now
The AI industry is currently obsessed with scale: more parameters, more tokens, more test-time compute. But a recent paper, Tool Building as a Path to “Superintelligence”, quietly suggests something more structural.
The real bottleneck may not be model size.
It may be a single number: γ (gamma) — the probability that, at each reasoning step, the model proposes the correct next move.
If γ stays constant as problems get deeper, search-based reasoning scales polynomially. If γ collapses with depth, no amount of brute-force search will save you.
In other words: superintelligence might be a systems engineering question, not a scaling law.
This paper builds a benchmark specifically to test that claim — and the results are uncomfortable for anyone who thinks larger transformers alone will carry us forward.
Background — The Diligent Learner and the Fragility of γ
The Diligent Learner framework models reasoning as validator-guided search. At each partial solution prefix ( h ), a model proposes candidate next steps. A validator filters incorrect extensions.
The core requirement is simple but brutal:
$$ \Pr[a \text{ is a good next step}] \geq \gamma $$
If γ remains bounded below by a constant, the search cost grows roughly as:
$$ O\left(T_{\max} \cdot \frac{\log(T_{\max}/\delta)}{\gamma}\right) $$
Polynomial. Manageable.
But if γ decays with reasoning depth, the guarantee collapses.
Most existing reasoning benchmarks cannot isolate this. They:
- Score only final answers.
- Allow multiple valid reasoning paths.
- Permit statistical shortcuts or memorization.
So we never truly measure γ.
This paper designs a benchmark that does.
Analysis — A Benchmark Designed to Break Shortcuts
The authors construct a GF(2) Boolean circuit reconstruction task in Algebraic Normal Form (ANF). At each step ( g ), the model must recover exactly one new monomial term.
Crucially:
-
There is exactly one correct next extension.
-
Success requires integrating:
- The revealed prefix (history)
- Fresh, step-specific evidence (new samples)
-
Data-only and history-only shortcuts are information-theoretically defeated.
The oracle is adversarially designed:
- History-only solvers gain no information about the next support.
- Data-only solvers see statistically obfuscated labels.
- Only a solver that subtracts the prefix mask from new evidence can recover the signal.
This transforms reasoning into a clean experimental probe of γ.
Estimator Classes
The benchmark compares four estimator types:
| Estimator | Access | Expected γ Behavior |
|---|---|---|
| A (Diligent) | Prefix + Data | Sustained γ |
| B (Data-only) | Data only | Collapses |
| C (History-only) | Prefix only | Random baseline |
| D (Partial) | Imperfect access | Degrades with depth |
Simulation results confirm the theory:
- Only Estimator A maintains high γ across depths.
- All others approach chance as ( g ) increases.
The separation is not philosophical — it is measurable.
Findings — Small LLMs vs Frontier Models
Small LLMs: Depth-Induced Collapse
When evaluated on Qwen3 variants (4B and 30B), the results are clear:
- Performance declines superlinearly with depth.
- “Thinking” variants help at shallow levels.
- Beyond moderate depth (e.g., ( g = 15 )), γ approaches trivial baseline.
The authors model this as an effective prefix utilization problem:
| Model | Effective Prefix Utilization | Interpretation |
|---|---|---|
| 30B Thinking | ~77% of prefix | Proportional scaling |
| 30B Instruct | ~15% | Weak scaling |
| 4B variants | Minimal | Capacity-limited |
Small models behave like partial-information estimators.
They do not fully condition on the growing prefix. As depth grows, mask cancellation fails.
Frontier Models: Tool Usage Changes Everything
Frontier models (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro) were tested at depths up to ( g = 127 ).
Two regimes emerged:
| Condition | γ at Large Depth |
|---|---|
| No tools | Degrades sharply |
| Tools allowed | Near unity, even at 127 |
Tool-enabled models maintained almost constant γ.
That is the central result.
Tool use effectively separates:
- Constraint inference (what to compute)
- Execution (how to compute it)
Without tools, transformers must both discover constraints and simulate execution internally. With tools, they externalize computation.
This dramatically stabilizes γ.
Implications — Superintelligence as Architecture, Not Just Scale
The paper reframes the “superintelligence” debate.
The bottleneck is not merely:
- Parameter count
- Training data
- Test-time search budget
It is architectural.
Specifically:
Can the system maintain non-vanishing γ across arbitrarily long reasoning horizons?
Tool use provides a structural answer:
- External memory
- External computation
- Deterministic validation
This reduces the internal burden on the transformer’s weights.
Business Implications
For AI operators and infrastructure builders, three conclusions follow:
- Tool integration is not optional for reliable long-horizon reasoning.
- Evaluating models solely on final-answer benchmarks is strategically naive.
- Systems that externalize execution may dominate purely end-to-end architectures.
This shifts the investment thesis:
- Infrastructure > isolated model capability
- Orchestration > raw parameter growth
- Tool ecosystems > monolithic intelligence
The path to scalable reasoning may look more like distributed systems engineering than cognitive mysticism.
Conclusion — The Gamma Test
The authors do not claim to have built superintelligence.
They did something subtler and more valuable:
They built a test that exposes whether γ survives depth.
Small models fail. Frontier models succeed — but primarily when allowed to build and use tools.
If the Diligent Learner hypothesis holds, then the future of AI will not be determined by who trains the largest model.
It will be determined by who designs the best toolchain.
Superintelligence, it seems, may be less about divine emergence and more about disciplined engineering.
And γ is watching.
Cognaptus: Automate the Present, Incubate the Future.