Learning to Discover at Test Time: When Search Learns Back

Opening — Why this matters now

For years, scaling AI meant one thing: train bigger models, then freeze them. At inference time, we search harder, sample wider, and hope brute force compensates for epistemic limits. This paper challenges that orthodoxy. It argues—quietly but decisively—that search alone is no longer enough. If discovery problems are truly out-of-distribution, then the model must be allowed to learn at test time.

Background — From Best-of-N to brittle ceilings

Prior approaches such as Best-of-N sampling and evolutionary prompt search (e.g., AlphaEvolve) treat the model as immutable. They explore the solution space, but the policy itself never improves. This creates a structural ceiling: no matter how much compute is spent, the model cannot internalize what it discovers. Search helps it guess better—but never understand.

What the paper does — Test-Time Training to Discover (TTT-Discover)

The authors propose TTT-Discover, a framework that reframes a single test problem as a reinforcement learning environment. Instead of optimizing average reward across tasks, the objective is singular: find one solution that beats the current state of the art.

Key design choices:

The policy continues training during inference
Learning is biased toward the most promising candidates
State reuse replaces long rollouts, keeping training tractable
Continuous reward signals enable fine-grained improvement

In short, the model is no longer a passive sampler. It becomes an adaptive problem-solver.

Results — Learning beats search

Across mathematics, GPU kernel engineering, algorithmic contests, and biology, TTT-Discover consistently surpasses both human and AI baselines.

Domain	Metric	Best Prior	TTT-Discover
Erdős Min Overlap	Lower is better	0.380927	0.380876
GPU Kernel (H100)	Runtime ↓	1371 μs	1161 μs
AtCoder Heuristic	Score ↑	566,997	567,062
Single-cell Denoising	Score ↑	0.64	0.71

Crucially, these gains are not cosmetic. In kernel optimization, the learned solutions are structurally different—deeper fusion, mixed precision, and non-obvious operator ordering.

Why this is different

Standard RL optimizes policies. TTT-Discover optimizes outcomes. The policy is disposable; the discovery is not. This inversion explains why test-time learning works here where classical RL would be wasteful.

The framework also aligns with Sutton’s “bitter lesson”: learning eventually dominates heuristics. Even at inference time.

Implications — Discovery as a service

This paper quietly redefines the cost curve of scientific progress. With open models and a few hundred dollars of test-time compute, state-of-the-art results are now reproducible.

For businesses, this suggests a shift:

Agents should not just act—they should adapt
Inference budgets may matter more than training budgets
Competitive advantage moves from data hoarding to learning loops

Conclusion

TTT-Discover is not another scaling trick. It is a philosophical correction. If problems are novel, models must be allowed to change while solving them. Search guesses. Learning remembers.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From Best-of-N to brittle ceilings#

What the paper does — Test-Time Training to Discover (TTT-Discover)#

Results — Learning beats search#

Why this is different#

Implications — Discovery as a service#

Conclusion#