Opening — Why this matters now
For years, scaling AI meant one thing: train bigger models, then freeze them. At inference time, we search harder, sample wider, and hope brute force compensates for epistemic limits. This paper challenges that orthodoxy. It argues—quietly but decisively—that search alone is no longer enough. If discovery problems are truly out-of-distribution, then the model must be allowed to learn at test time.
Background — From Best-of-N to brittle ceilings
Prior approaches such as Best-of-N sampling and evolutionary prompt search (e.g., AlphaEvolve) treat the model as immutable. They explore the solution space, but the policy itself never improves. This creates a structural ceiling: no matter how much compute is spent, the model cannot internalize what it discovers. Search helps it guess better—but never understand.
What the paper does — Test-Time Training to Discover (TTT-Discover)
The authors propose TTT-Discover, a framework that reframes a single test problem as a reinforcement learning environment. Instead of optimizing average reward across tasks, the objective is singular: find one solution that beats the current state of the art.
Key design choices:
- The policy continues training during inference
- Learning is biased toward the most promising candidates
- State reuse replaces long rollouts, keeping training tractable
- Continuous reward signals enable fine-grained improvement
In short, the model is no longer a passive sampler. It becomes an adaptive problem-solver.
Results — Learning beats search
Across mathematics, GPU kernel engineering, algorithmic contests, and biology, TTT-Discover consistently surpasses both human and AI baselines.
| Domain | Metric | Best Prior | TTT-Discover |
|---|---|---|---|
| Erdős Min Overlap | Lower is better | 0.380927 | 0.380876 |
| GPU Kernel (H100) | Runtime ↓ | 1371 μs | 1161 μs |
| AtCoder Heuristic | Score ↑ | 566,997 | 567,062 |
| Single-cell Denoising | Score ↑ | 0.64 | 0.71 |
Crucially, these gains are not cosmetic. In kernel optimization, the learned solutions are structurally different—deeper fusion, mixed precision, and non-obvious operator ordering.
Why this is different
Standard RL optimizes policies. TTT-Discover optimizes outcomes. The policy is disposable; the discovery is not. This inversion explains why test-time learning works here where classical RL would be wasteful.
The framework also aligns with Sutton’s “bitter lesson”: learning eventually dominates heuristics. Even at inference time.
Implications — Discovery as a service
This paper quietly redefines the cost curve of scientific progress. With open models and a few hundred dollars of test-time compute, state-of-the-art results are now reproducible.
For businesses, this suggests a shift:
- Agents should not just act—they should adapt
- Inference budgets may matter more than training budgets
- Competitive advantage moves from data hoarding to learning loops
Conclusion
TTT-Discover is not another scaling trick. It is a philosophical correction. If problems are novel, models must be allowed to change while solving them. Search guesses. Learning remembers.
Cognaptus: Automate the Present, Incubate the Future.