Learning to Discover at Test Time: When Search Learns Back

A leaderboard usually treats an AI model like a very fast intern: give it a problem, let it try many times, keep the best answer, and politely ignore the graveyard of failed attempts.

That is useful. It is also a little strange.

A human engineer does not merely try 25,600 variations of a GPU kernel while keeping the same brain. After the first few failures, she learns which bottlenecks matter. After a lucky partial success, she changes how she thinks about the problem. After enough attempts, the search process is no longer just sampling. It has become learning.

That is the core move in Learning to Discover at Test Time, which introduces Test-Time Training to Discover, or TTT-Discover.¹ The paper’s claim is not simply that more samples improve discovery. We already knew that, and if we needed another reminder, the AI industry has enough “scale it and pray” diagrams to tile a conference hall. The sharper claim is this: for certain discovery problems, the model should continue training on the very test problem it is trying to solve.

This matters because the goal is not to produce a generally better model. The goal is to find one exceptional state: a better mathematical construction, a faster kernel, a higher-scoring algorithm, or a stronger denoising routine. The policy is not the final product. The best discovered solution is.

That small distinction changes the whole design.

Discovery cares about the maximum, not the average

Standard reinforcement learning usually asks for a policy that performs well on average. Train the policy, deploy the policy, and hope it behaves reliably across many future situations. In that setting, expected reward is a natural target.

Discovery is different. If a system produces 999 mediocre kernels and one record-breaking kernel, the discovery problem is happy. The 999 kernels can go to the museum of computational embarrassment. They served their purpose.

The paper formalizes this by defining a discovery as finding a candidate state $s$ whose reward $R(s)$ exceeds the best-known reward $r_{\text{sota}}$. In a kernel task, reward may be inverse runtime. In mathematics, it may be a certified bound. In AtCoder-style algorithm engineering, it may be the official contest score. In single-cell denoising, it may be a benchmark score based on reconstruction metrics.

That turns the usual RL target sideways.

The system does not need a policy that is broadly safe, smooth, and high-performing across all generated attempts. It needs a process that occasionally reaches the frontier. Average quality still helps indirectly, because terrible proposals waste budget, but average quality is not the scoreboard. The scoreboard is the best valid artifact found.

This is why the paper’s mechanism-first reading is more important than its result table. The result table is impressive, but the mechanism explains why the paper is not just another evolutionary-search benchmark.

TTT-Discover has two moving parts: biased learning and smarter reuse

TTT-Discover combines two ideas:

Train the model during the test problem with an objective biased toward top outcomes.
Reuse promising prior states without becoming trapped by them.

The first part addresses the objective problem. The second part addresses the search-horizon problem.

The entropic objective teaches the model to chase exceptional attempts

A normal expected-reward objective treats many near-good attempts as useful. That is fine when the policy itself is the deliverable. But in discovery, the system wants the tail of the distribution: the rare attempt that crosses the current frontier.

The paper uses an entropic utility objective:

$$ J_\beta(\theta; s) = \log \mathbb{E}\ast{\tau \sim \pi\ast\theta(\cdot \mid s)} \left[ e^{\beta r(\tau;s)} \right] $$

As $\beta$ becomes large, this objective behaves more like a maximum over rewards. That is exactly the direction discovery needs. But setting $\beta$ too high early can destabilize training, while setting it too low later makes the model insufficiently sensitive to small frontier improvements. The authors therefore adapt $\beta$ by constraining the KL divergence of the induced policy rather than using one fixed value everywhere.

In plain language: TTT-Discover turns up the pressure toward exceptional rollouts, but it tries not to turn the model into a hallucinating lottery machine.

This is not decorative math. It is the paper’s first real contribution. The objective is designed around the fact that discovery rewards the best artifact, not the average behavior of the model.

PUCT reuse extends the effective horizon without worshipping yesterday’s best answer

Search also has a horizon problem. If every attempt starts from scratch, the model may repeatedly rediscover partial ideas but fail to refine them. If every attempt starts from the current best solution, the model may over-exploit one local pattern and miss a very different route.

TTT-Discover handles this with a PUCT-style state reuse rule. The buffer stores prior candidate states. At each step, the system chooses which prior state to expand using a score that combines three forces:

Reuse component	What it means	Why it matters for discovery
Maximum child reward	How good the best descendant of a state has been	Discovery cares about best reachable outcome, not average child quality
Prior based on reward rank	High-reward states are more likely to seed high-reward descendants	Good artifacts often contain reusable structure
Exploration bonus	Under-expanded states remain eligible	Prevents early lucky candidates from monopolizing the search

The key design choice is that TTT-Discover uses the maximum reward of children, not their mean. This mirrors the entropic objective: a mediocre branch that once produced a brilliant child is still worth attention.

This is the paper’s second important correction to ordinary thinking about search. State reuse is not merely “prompt the model with past attempts.” It is a way of extending the effective horizon. Reusing a prior solution is like adding another step to a trajectory. The model can build on earlier partial discoveries instead of trying to rederive them each time from an empty prompt.

The environment must be executable, not merely discussable

The method needs an environment where candidate solutions can be generated, parsed, evaluated, and rejected if invalid. That requirement explains both the strength and the boundary of the paper.

Domain	Candidate state	Reward signal	What makes it suitable
Mathematics	Step-function constructions or certificates	Certified upper/lower bounds	Constructions can be verified numerically under stated constraints
GPU kernel engineering	Kernel code	Inverse runtime after correctness checks	Runtime is measurable and competition harnesses exist
Algorithm engineering	C++ algorithm implementations	Contest score on generated/local and official tests	Validity and score can be automatically evaluated
Single-cell denoising	Analysis code	Benchmark score tied to MSE and Poisson metrics	Benchmark provides repeatable proxy evaluation

The word “environment” is doing serious work here. TTT-Discover is not a general brainstorming assistant. It is not a strategy consultant with gradient updates. It is a loop around a verifier.

That is why the paper focuses on continuous rewards. A continuous reward lets the model learn from degrees of improvement. A kernel that is 5% faster is better than one that is 1% faster. A mathematical bound that tightens slightly is still useful. A denoising score can move incrementally. Sparse yes/no tasks are harder because failed attempts provide less gradient signal. Non-verifiable domains are harder still because there may be no trustworthy reward function at all.

The main evidence: four domains, one design pattern

The paper evaluates TTT-Discover across mathematics, GPU kernel engineering, AtCoder heuristic competitions, and single-cell analysis. The authors report every task they attempted, compare against best-known human and AI results, and include Best-of-25600 baselines using the same gpt-oss-120b model and sampling budget. They also compare with OpenEvolve where relevant, though the paper notes that OpenEvolve suffered from prompt truncation under the same context-window constraints, which weakens any simplistic “method A beats method B” reading.

The evidence is best read by purpose, not by excitement level.

Evidence block	Likely purpose	What it supports	What it does not prove
Mathematics results	Main evidence	TTT-Discover can find verifiable constructions that improve known bounds in selected open problems	It does not prove broad mathematical reasoning ability across arbitrary theorem-proving tasks
GPUMode TriMul kernels	Main evidence plus practical engineering evidence	The method can optimize real performance artifacts against expert competition baselines	It does not guarantee deployment safety in full production workloads
MLA-Decode kernels	Boundary/comparison evidence	The method improved over Best-of-25600 but did not beat top human submissions with statistical significance	It does not support a blanket claim that TTT-Discover beats experts everywhere
AtCoder competitions	Main evidence for algorithm engineering	The method can generate high-scoring algorithms in long-horizon optimization contests	It does not prove general superiority across all programming contests or hidden industrial constraints
Single-cell denoising	Exploratory extension with expert caution	The method can optimize benchmarked scientific analysis code	It does not prove improved biological insight downstream
Ablations	Mechanism evidence	Adaptive entropic training and reuse both matter, especially together	It does not exhaust all possible hyperparameter schedules or reuse variants

This table is less glamorous than a headline result. It is also more useful. The paper is not making one uniform claim across every domain. It is showing a pattern: when the problem can be converted into an executable search-and-learning loop, the model can improve during the test problem and sometimes cross a frontier.

Mathematics: small numbers, real discoveries

The mathematics section is easy to underestimate because the numerical differences look tiny. They are not tiny in context.

For Erdős’ minimum overlap problem, the paper reports an improvement in the upper bound from AlphaEvolve’s 0.380924 to TTT-Discover’s 0.380876. The authors note that this improvement over AlphaEvolve is 16 times larger than AlphaEvolve’s improvement over the previous state of the art. TTT-Discover found a 600-piece asymmetric step function, compared with AlphaEvolve’s 95-piece construction and the best human 51-piece construction.

That is a useful example of why discovery is not average-case scoring. Most candidates are irrelevant. The important object is the one construction that verifies a tighter bound.

For the first autocorrelation inequality, TTT-Discover improves the upper bound to $C_1 \leq 1.50286$ with a 30,000-piece step function. The paper distinguishes this from ThetaEvolve, which refined an AlphaEvolve V2 construction. TTT-Discover starts from scratch and finds a new construction path. The late-stage improvement came from focusing optimization on near-tight constraints in the linear program—the parts of the problem where the frontier actually resists movement.

For the second autocorrelation inequality, the paper is refreshingly non-theatrical: TTT-Discover does not make a discovery. Its best construction reaches 0.9591, below AlphaEvolve V2’s 0.9610. Circle packing is also not presented as a breakthrough; TTT-Discover matches the best known constructions for $n=26$ and $n=32$ using Qwen3-8B.

That restraint matters. If every benchmark were described as a revolution, the paper would become marketing. Instead, the unevenness is informative. TTT-Discover is strongest where incremental executable improvement can compound into a new construction. It does not magically dominate every structured search problem. Reality has rudely remained real.

Kernel engineering: the business-friendly result, with a numerical spine

The GPU kernel result is the easiest one to translate into business language because runtime is money wearing a stopwatch.

For the GPUMode TriMul competition, TTT-Discover trains using H100 runtime as the reward, yet the resulting kernel generalizes across other GPU types in final evaluation. The reported runtimes are:

Hardware	Best human runtime	Best-of-25600	TTT-Discover	Interpretation
A100	4531.5 µs	9219.7 µs	2198.2 µs	TTT-Discover is about 50% faster than top human
H100	1371.1 µs	5390.3 µs	1161.2 µs	TTT-Discover beats the official human best
B200	1038.9 µs	3254.9 µs	914.2 µs	Improvement reported through repeated trials and confidence intervals
AMD MI300X	2515.8 µs	4902.0 µs	1555.7 µs	Cross-hardware improvement despite H100-only training reward

The mechanism behind the generated kernel is not mysterious magic. Expert review says the solution correctly identifies the task as memory-bound and reduces memory traffic through fusion, lower precision, and delegating the heavy matrix multiplication to cuBLAS or rocBLAS. This is precisely what good human kernel engineers would think about. The difference is that the system searched and learned its way into executing the pattern better.

The MLA-Decode result is a useful counterweight. TTT-Discover improves substantially over Best-of-25600, but the paper states that its best kernels were not better than the top human submissions with statistical significance across the tested MI300X instances. The generated solutions also relied mainly on torch.compile() rather than fine-grained Triton optimization, limiting flexibility.

That is not a failure of the paper. It is a warning label for operational readers: AI-generated performance code should be benchmarked like performance code, not admired like poetry.

Algorithm engineering: search learns the shape of industrial heuristics

The AtCoder experiments matter because they resemble real business optimization more than many polished academic tasks do. Routing, scheduling, production planning, and resource allocation are not solved once in a slide deck. They are solved repeatedly under constraints, budgets, and ugly edge cases.

The paper evaluates two AtCoder Heuristic Contest problems:

Contest	Problem type	Starting point	TTT-Discover result
AHC039	Geometry-style optimization: design a net to capture targets and avoid penalties	Starts from an ALE-Agent-based program that would rank 5th	Scores 567,062, slightly above the top human score of 566,997
AHC058	Production-planning style optimization	Starts from scratch	Scores 848,414,228, above the best human score of 847,674,723 and ALE-Agent’s 848,373,282

The generated strategies are recognizable to anyone who has built heuristic optimizers. For AHC039, the solution builds a pool of axis-aligned rectangles, greedily seeds a connected union, then improves it with simulated annealing moves such as add, remove, replace, expand, shrink, and slide. For AHC058, it combines greedy plans, short beam search, simulated annealing, local cleanup, and cached intermediate states.

Again, the important point is not that the model invented a new species of optimization. It learned to assemble and refine effective heuristic machinery under a measurable objective. That is exactly what many enterprise optimization problems need, provided the evaluation harness is trustworthy.

Single-cell denoising: benchmark gains are useful, but biology is not a leaderboard

The single-cell analysis experiment applies TTT-Discover to denoising RNA-seq data on the OpenProblems benchmark. The system trains on the Pancreas dataset and reports performance on held-out PBMC and Tabula datasets. It starts from MAGIC code and adds gene-adaptive transform ensembling, low-rank SVD refinement, and log-space polishing.

The results show improvement over benchmark baselines:

Dataset	Prior strong baseline score	OpenEvolve	Best-of-25600	TTT-Discover
PBMC	0.64 for MAGIC variants	0.70	0.62	0.71
Tabula	0.64 for MAGIC variants	0.71	0.65	0.73

This is an exploratory extension, not a declaration that the method has discovered new biology. The paper itself includes a careful disclaimer: benchmark metrics are incomplete and do not guarantee biological validity for downstream tasks. The expert review makes the same point. The improvement aligns with MAGIC’s smoothing-based approach and improves key metrics, but further evaluation would be needed to know whether it improves biological insight.

This is exactly where business readers should slow down. In engineering competitions, the reward function may be close to the business target: faster correct kernel, higher contest score. In scientific analysis, a proxy metric can be useful while still missing the real downstream value. Optimizing the benchmark may improve the benchmark. One tries not to faint from surprise.

The ablation table explains why this is not just “more attempts”

The ablations are the paper’s strongest defense against the lazy interpretation that TTT-Discover is just Best-of-N with a nicer name.

The TriMul ablation compares training objectives and reuse strategies under the same general setting. The best human H100 runtime is 1371.1 µs. Full TTT-Discover achieves 1203.10 µs in the ablation evaluator. The variants tell the story:

Variant	Best runtime on H100	What the test likely probes
Full TTT-Discover: adaptive entropic objective + PUCT	1203.10 µs	Whether both discovery-biased learning and reuse work together
Constant-$\beta$ entropic objective + PUCT	1483.83 µs	Whether adaptive $\beta$ matters
Expected reward objective + PUCT	1985.67 µs	Whether ordinary average-reward RL is misaligned
No TTT + PUCT	2060.70 µs	Whether reuse alone is enough
Adaptive entropic objective + $\epsilon$-greedy reuse	1328.89 µs	Whether simpler reuse can work when lucky
Adaptive entropic objective + no reuse	5274.03 µs	Whether training without horizon extension is enough
Naive test-time RL: expected reward + no reuse	5328.73 µs	Whether standard RL framing works
Best-of-N	5352.36 µs	Whether sampling alone solves the problem

The pattern is not subtle. Sampling alone performs poorly. Naive test-time RL performs similarly poorly. Removing reuse damages performance badly. Using expected reward slows improvement. The full system wins because the objective and reuse rule are aligned with discovery.

The authors also state that these ablations are not exhaustive. Additional tuning might improve some variants. That caveat is important, but it does not erase the main lesson: for this task, the search loop needs both learning and a way to build on promising states.

What businesses should take from this paper

The business interpretation is not “your company should train an LLM at test time for everything.” That would be an efficient method for converting GPU credits into sadness.

The useful interpretation is narrower and stronger:

When a business problem can be expressed as candidate generation plus automatic evaluation plus incremental improvement, test-time training can become an optimization engine rather than a chat interface.

This suggests a practical architecture.

Layer	Operational role	Example
Problem wrapper	Translate a business task into a candidate artifact and reward function	“Generate a warehouse routing heuristic and score it on historical orders”
Verifier/evaluator	Reject invalid artifacts and score valid ones	Unit tests, simulator, benchmark harness, latency test, cost model
Search buffer	Store prior candidate artifacts and metadata	Candidate code, parameters, scores, failure reasons
Test-time learner	Update the model or adapter on the current problem’s attempts	LoRA-style adaptation or equivalent controlled update
Selection policy	Decide which candidates to refine next	PUCT-like reuse balancing high scores and exploration
Deployment gate	Separate benchmark success from production acceptance	Backtesting, stress tests, human review, rollback plan

The domains that fit this best are not vague “AI strategy” domains. They are optimization-heavy workflows with measurable rewards: kernel tuning, route planning, scheduling, bidding simulations, data-cleaning pipelines, forecasting model selection, feature engineering, and benchmark-driven automation.

For Cognaptus-style business automation, the paper points to a shift from static copilots to adaptive problem loops. A static copilot gives suggestions. An adaptive loop proposes an artifact, tests it, learns from the result, and proposes a better one. The output is not advice. The output is an executable improvement.

That is the difference between an AI assistant that says “consider optimizing memory traffic” and a system that actually produces a faster kernel.

What the paper directly shows, and what we infer

The direct evidence is specific:

Category	Directly shown by the paper	Cognaptus inference	Still uncertain
Method	TTT-Discover updates model weights during a single test problem using an adaptive entropic objective and PUCT reuse	Discovery systems should optimize for best artifact, not average model behavior	How stable the method is under many more domains and reward designs
Compute economics	Runs use gpt-oss-120b with LoRA, 50 steps, 512 rollouts per step, and roughly a few hundred dollars per problem under stated assumptions	For high-value optimization tasks, this cost profile may be commercially plausible	Costs vary with model, evaluator latency, token usage, and number of failed attempts
Engineering tasks	TriMul results beat top human submissions across reported hardware settings	Automated kernel and code optimization can become a targeted ROI use case	Production deployment still needs numerical stability and workload-specific validation
Algorithmic tasks	AtCoder heuristic solutions would rank first in two retrospective contests	Many business optimization problems could be framed similarly	Private industrial constraints may be harder than contest harnesses
Scientific workflows	Single-cell denoising benchmark improves over existing methods	AI can tune analysis pipelines around benchmark objectives	Benchmark improvement may not equal scientific usefulness

This separation matters. The paper shows a working discovery loop in selected verifiable settings. Cognaptus infers that similar loops may be valuable for business optimization. It does not follow that every executive decision can be turned into a reward function. Some meetings remain tragically resistant to gradient descent.

The boundary: verifiability is the price of admission

TTT-Discover needs a reward signal that is continuous, executable, and hard to game. If the reward is noisy, sparse, delayed, or only loosely connected to business value, the loop becomes less reliable.

The major boundaries are clear.

First, the current method is built for continuous rewards. The authors explicitly name sparse, binary, and non-verifiable domains as future work. That excludes many attractive but slippery tasks: brand strategy, open-ended legal reasoning, political judgment, early-stage product taste, and many forms of advisory work.

Second, the evaluator is part of the product. A weak evaluator produces optimized nonsense. In single-cell denoising, the paper itself warns that benchmark gains may not transfer to biological insight. In business settings, the equivalent risk is optimizing a simulator, KPI proxy, or backtest that fails in production. Anyone who has seen a model “beat the backtest” and then lose real money should feel a small chill here.

Third, expert review still matters. The mathematics constructions can be validated. The kernel results were reviewed by GPUMode organizers. The biology result received expert caution. The paper does not remove expert judgment; it changes where expert judgment sits. Humans design the environment, inspect the artifacts, and decide whether benchmark success is meaningful.

Fourth, the loop is not free. The authors describe roughly $500 per run under their token and rollout assumptions. That can be cheap for a kernel used at scale or a logistics optimizer that saves real money. It is expensive for toy automation or vanity benchmarks. The ROI case depends on how often the discovered artifact is reused and how closely the benchmark maps to operational value.

The deeper lesson: search becomes more valuable when it learns

The paper’s contribution is not that LLMs can generate code, or that repeated sampling helps, or that benchmarks can be beaten. Those are now table stakes.

The contribution is a design pattern for discovery systems:

Treat the test problem as its own environment.
Generate candidate artifacts.
Score them with a verifier.
Store the useful states.
Train the model on its own attempts.
Reuse promising states without collapsing into them.
Return the best artifact, not the average policy.

That pattern will not work everywhere. It is too mechanical for domains where value is ambiguous and feedback is subjective. But where the reward is real, continuous, and executable, it changes the role of AI from “a model that answers” to “a process that improves.”

The paper’s title says “learning to discover.” The business translation is more blunt: in some workflows, the useful AI system is not the one that knows the answer. It is the one that can keep failing, update itself, and eventually produce an artifact worth keeping.

Not glamorous. Quite useful. A familiar combination, unfortunately rare.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun, “Learning to Discover at Test Time,” arXiv:2601.16175v2, 2026. https://arxiv.org/abs/2601.16175 ↩︎

Discovery cares about the maximum, not the average#

TTT-Discover has two moving parts: biased learning and smarter reuse#

The entropic objective teaches the model to chase exceptional attempts#

PUCT reuse extends the effective horizon without worshipping yesterday’s best answer#

The environment must be executable, not merely discussable#

The main evidence: four domains, one design pattern#

Mathematics: small numbers, real discoveries#

Kernel engineering: the business-friendly result, with a numerical spine#

Algorithm engineering: search learns the shape of industrial heuristics#

Single-cell denoising: benchmark gains are useful, but biology is not a leaderboard#

The ablation table explains why this is not just “more attempts”#

What businesses should take from this paper#

What the paper directly shows, and what we infer#

The boundary: verifiability is the price of admission#

The deeper lesson: search becomes more valuable when it learns#