A persistent mystery in the recent surge of reasoning-augmented LLMs—like OpenAI’s o1 or DeepSeek-R1—is whether these models learn to reason through post hoc reinforcement fine-tuning, or if they were already good at it to begin with. ASTRO offers a rare counter-example: a method that imbues non-reasoner LLMs (like vanilla Llama 3) with structured reasoning behavior from scratch.
Rather than rely on emergent capabilities or distillation from models that already search well, ASTRO teaches LLMs to think like search algorithms themselves, using a hybrid approach combining Monte Carlo Tree Search (MCTS), procedure cloning, chain-of-thought generation, and reinforcement learning with verifiable rewards.
From Trees to Traces: Embedding Search into Language
At its core, ASTRO does something deceptively simple yet powerful: it transforms MCTS search trees—built over stepwise math problem-solving traces—into natural language outputs. These outputs encode not just the correct answer, but also paths that include mistakes, reflections, and recoveries.
Each MCTS trace is linearized into a chain-of-thought (CoT) embedding:
- Backtracking: when the model identifies an error and explicitly reverts to an earlier step.
- Self-reflection: with phrases like “But wait, are we solving the problem correctly so far? Hmm…”
- Multiple paths: incorrect and correct paths are both included, enabling the model to experience failure during training and learn to recover.
This transforms what is usually an opaque CoT into a traceable, structured map of a search trajectory.
Procedure Cloning: The Hidden Backbone
ASTRO doesn’t just prompt models to reflect—it clones the reasoning procedures embedded in MCTS traces. This method is akin to behavioral cloning from RL but applies it to reasoning steps instead of actions in a game. Crucially, it includes failures and restarts.
For each problem:
- ASTRO performs MCTS over step-wise math reasoning.
- It selects one high-quality terminal node (correct solution) and several incorrect ones.
- It linearizes the paths—including backtracks—into long CoTs.
- These CoTs are used for supervised fine-tuning (SFT).
This turns the LM into a search policy that reasons autoregressively, learning to backtrack and correct itself without relying on external scaffolding.
Self-Correcting Thoughts Pay Off
Fine-tuning with just 36K such search-based CoTs gives immediate gains: Llama-3.1-70B-ASTRO-SFT achieves +3.8% on MATH-500, +6.7% on AMC 2023, and +6.3% on AIME 2024, compared to its counterpart trained on CoTs without search structure.
But the real magic happens when ASTRO’s SFT checkpoint is used to initialize reinforcement learning. Using Group Relative Policy Optimization (GRPO), the ASTRO-RL model continues to learn through verifier-based rewards.
The result? State-of-the-art math reasoning from a model that didn’t start out good at math. ASTRO-RL achieves:
Model | MATH-500 | AMC 2023 | AIME 2024 |
---|---|---|---|
Llama-3.1-70B-Instruct | 65.8% | 37.5% | 10.0% |
ASTRO-SFT | 69.6% | 51.9% | 16.3% |
ASTRO-RL | 81.8% | 64.4% | 30.0% |
Notably, ASTRO-RL even outperforms Llama-3.3-70B variants trained with methods like SPOC and Step-KTO.
Why Search Priors Matter
ASTRO’s ablation studies are telling. When trained on the same data but without the backtrack/self-reflection phrasing, performance drops across the board. In fact, models trained without search priors produce shorter CoTs, backtrack less often, and solve fewer problems.
This suggests that self-reflection and backtracking are not emergent behaviors—they need to be taught.
Moreover, ASTRO’s CoTs can be mapped as directed graphs—where each node is a reasoning step and edges are transitions. This opens up interpretability pathways rarely available in CoT reasoning.
Implications: Search-Like Thinking, Learned Autonomously
ASTRO may mark the beginning of a broader trend: training LLMs not to guess better, but to search better.
For enterprise applications where intermediate reasoning steps must be auditable—e.g., legal reasoning, scientific hypotheses, or stepwise financial calculations—ASTRO’s structure offers a clear advantage. You don’t just get an answer. You get the path the model took, including detours.
This could also change how we think about building reasoner agents. Instead of scaffolding LLMs with external tools (e.g., scratchpads, verifiers), ASTRO suggests a path toward internalizing those behaviors into the model itself.
Cognaptus: Automate the Present, Incubate the Future.