RL Needs a Menu, Not a Miracle

Menus are underrated.

When a language model knows only one way to solve a problem, reinforcement learning can mostly reward or punish that route. It can make the model more confident, more selective, and sometimes more verbose. But it has little room to choose among genuinely different ways of reaching the answer.

A recent arXiv paper, Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models, studies a simple but important variation on that setup: before applying reinforcement learning, expose the model to multiple correct solution paths for the same question.1 Not more random data. Not larger teacher-model distillation by default. A menu of verified, diverse reasoning trajectories.

That distinction is the useful part. The paper is not mainly a story about squeezing another percentage point from math benchmarks. It is a study of where reinforcement learning gets its raw material. If RL mostly amplifies behaviors that are already available to the model, then the pre-RL stage becomes a design problem, not a ceremonial warm-up.

For business readers, the lesson is sharp: buying or running RL is not enough. The expensive question is whether the model has been given the right set of correct behaviors before RL begins. Otherwise, RL is asked to optimize a narrow habit. Very elegant. Also very limiting.

The paper tests whether RL benefits from behavioral diversity before RL

The authors start from a familiar observation in current LLM training: reinforcement learning with verifiable rewards can improve reasoning models, especially when answers can be automatically checked. But recent work has raised a less flattering interpretation of RL gains. Instead of inventing new reasoning abilities from nothing, RL may often sharpen, select, or compose behaviors that already exist in the model’s distribution.

The paper turns that observation into an experiment. If RL depends on what the model already knows how to do, then giving the model multiple correct ways to solve the same problem before RL should help. The authors call this intermediate phase mid-training: a supervised fine-tuning stage between the base/instruct model and reinforcement learning.

Their mid-training data is built from GSM8K math questions. For each question, they generate solution variants guided by George Pólya-style problem-solving heuristics: analogy, decomposition, working backward, introducing auxiliary elements, restating the problem, and many others. The actual pipeline is more disciplined than “ask the model to be creative,” which is a phrase that usually means the budget has already left the building.

The process is roughly:

Seed math question
  → choose a Pólya-style heuristic
  → prompt the base model with heuristic description + few-shot examples
  → sample many candidate solutions
  → keep candidates with the correct final answer
  → score adherence to the intended heuristic
  → select the best heuristic-specific trajectory
  → mid-train the model on multiple correct trajectories per question
  → run GRPO reinforcement learning

The primary model is Llama 3.2–3B–Instruct. The authors use GSM8K as the seed training source, construct datasets with $n \in {1,2,4,8,16,32,64}$ solution variants per question, and then evaluate on several math reasoning benchmarks, including MATH-500, AIME 2024, AIME 2025, AMC 2023, HMMT 2025, and OlympiadBench. Performance is reported mainly with pass@k, where a problem counts as solved if at least one of $k$ sampled attempts is correct.

That last metric matters. Pass@1 asks, “Does the first answer work?” Pass@64 asks, “Can the model produce a correct answer somewhere in a larger batch of attempts?” For a single-shot regulated workflow, pass@1 matters more. For agentic systems that can sample, verify, retry, or route candidate answers, pass@k can be operationally meaningful.

The mechanism is probability geometry, not training folklore

The paper’s theoretical section is useful because it explains why diverse correct trajectories before RL might change what RL can do.

The core idea is simple. A model trained on a single solution path tends to concentrate probability around one dominant continuation at key reasoning points. A model trained on multiple correct strategies for the same problem can hold several plausible continuations with meaningful probability mass. The authors describe this as an $N$-modal next-token distribution.

In a uni-modal regime, where the model is already highly confident in one token or path, a policy-gradient update has limited room to move probability. In an $N$-modal regime, a sampled token belongs to one of several dominant modes. Positive updates can reinforce a successful path without immediately collapsing the whole distribution. Negative updates can reduce probability on one sampled path and redistribute probability toward other dominant alternatives.

That sounds technical because it is. But the business translation is clean:

RL works better when the model has verified alternatives to move probability toward.

This is why the paper’s result should not be read as “more synthetic data helps.” That is too vague to be useful. The stronger claim is narrower and more operational: verified diversity at the level of solving approaches can make later RL more effective.

For an enterprise system, this is the difference between dumping more historical tickets into a fine-tuning job and deliberately constructing several correct resolution strategies for the same class of ticket. One is volume. The other is behavioral coverage.

The main evidence: diversity helps most when multiple attempts are allowed

The first result is mid-training alone. With Llama 3.2–3B–Instruct, average pass@1 improves from 9.25% at $n=1$ to 11.50% at $n=64$ across the six math benchmarks. That is a real but modest single-sample gain. STaR reaches a higher average pass@1 of 13.02%.

The more interesting movement appears under pass@64. The zero-shot baseline averages 46.30%. STaR is almost unchanged at 46.32%. The Pólya-guided mid-trained model with $n=64$ reaches 48.17%. Some individual benchmark gains are more noticeable: AIME 2025 moves from 12.84% to 18.66%, AMC 2023 from 83.49% to 85.18%, and OlympiadBench from 42.13% to 43.57%.

After GRPO-based reinforcement learning, the pattern becomes clearer. At pass@64, vanilla RL averages 44.21% across the six benchmarks. STaR+RL reaches 45.69%. The mid-trained models reach 48.09% at $n=16$ and 47.62% at $n=64$. On AIME 2025, vanilla RL reaches 16.91%, while the mid-trained RL model reaches 23.34%. On AMC 2023, vanilla RL reaches 78.18%, while the mid-trained RL model reaches 84.52%.

A small warning belongs here, but not the decorative kind. The gains are not perfectly monotonic. More heuristics are generally helpful, especially at larger $k$, but $n=16$ sometimes beats $n=64$. The authors suggest one possible reason: the GRPO rollout group size is 16. When the number of learned strategies and the rollout group size line up, the RL group may cover the available strategies more effectively. When $n$ exceeds the group size, each group samples only a subset.

That interpretation is plausible, but it should not be over-polished into a universal rule. The practical message is simpler: diversity must fit the RL procedure. A bigger menu is not automatically better if the training process cannot inspect enough of it.

The ablations separate useful diversity from decorative variety

The paper is strongest where it prevents the easy but wrong reading: “just generate more varied reasoning.” The authors run several tests that make the mechanism more specific.

Test or result Likely purpose What it supports What it does not prove
Table 1: mid-training with $n$ Pólya-guided variants Main evidence before RL More correct reasoning variants improve average pass@64 and modestly improve pass@1 That the method gives state-of-the-art one-shot reliability
Figure 2: GRPO after mid-training Main evidence after RL RL benefits from diverse correct mid-training more than vanilla RL or STaR+RL That gains are monotonic in the number of heuristics
Figure 3: behavior composition analysis Mechanism support RL-trained models combine multiple learned problem-solving behaviors more often That these behaviors are fundamentally new rather than surfaced from pretraining
Figure 4a: more problems vs. more approaches Controlled ablation With the same number of supervised instances, multiple approaches per problem beat more single-solution problems by roughly 7% relative improvement That fewer unique problems is always better in other domains
Figure 4b: heuristic reasoning with incorrect final answers Correctness ablation Diversity only helps when the trajectories also lead to correct answers That automatic correctness checking is easy outside math
Figure 5: QwQ-32B distillation comparison Comparison with teacher distillation The heuristic-guided data has higher measured diversity and better pass@64 after RL than the teacher-distilled baseline That teacher distillation is generally inferior in all settings
Table 2: HumanEval and MuSR Exploratory extension Some transfer appears in coding and narrative reasoning, especially multi-step tasks That a math-derived heuristic taxonomy is domain-general
Appendix Qwen2.5–7B results Robustness/sensitivity test The optimal amount of added diversity depends on the base model and RL setup That the method reliably beats stronger base models without qualification

The two most important ablations are Figure 4a and Figure 4b.

In Figure 4a, the authors compare two mid-training datasets with the same total number of supervised instances: one uses many distinct problems with one solution each; the other uses fewer problems with 16 different approaches per problem. The multiple-approach setting performs better after RL across all tested $k$ values, with roughly a 7% relative improvement.

That is a direct challenge to the usual “more examples” instinct. Sometimes the scarce asset is not another problem. It is another verified way to solve the same problem.

Figure 4b is the necessary slap on the wrist. The authors construct heuristic-guided reasoning chains that follow varied approaches but end with incorrect answers. Increasing the number of such approaches does not help. It hurts. These models fall below vanilla RL across the six benchmarks.

This is the cleanest operational boundary in the paper: diversity without correctness is not a training asset. It is styled error propagation.

The distillation result is about diversity, not anti-teacher ideology

The comparison with QwQ-32B is also worth reading carefully. The authors sample 16 reasoning chains per question from a stronger teacher model, mid-train the base model on those traces, and then apply RL. Their heuristic-guided data receives a Vendi Score of 13.81, compared with 10.95 for the distilled dataset. The heuristic-guided model also achieves better pass@64 after RL and comparable or better pass@1 than the teacher-distillation baseline in the reported comparison.

This should not be turned into a lazy conclusion that teacher distillation is bad. That would be too convenient, and convenience is not evidence.

The better interpretation is that a stronger teacher can still produce a narrow family of reasoning traces. If the goal is diversity of solving approaches, sampling from a teacher is not automatically enough. A teacher may be powerful and still stylistically repetitive. The paper even notes that distilled RL rollouts can become verbose and repetitive.

For business teams, this matters because many fine-tuning pipelines rely on a stronger model to generate synthetic examples. That can be useful. But if the downstream objective is robust problem solving under RL or agentic retry, synthetic generation needs diversity controls, not just teacher prestige. “Generated by a bigger model” is not a diversity guarantee. It is a procurement argument wearing a lab coat.

The business value is behavioral coverage before optimization

The practical implication is not that every company should train a math reasoner with Pólya heuristics. The implication is that pre-RL data design should be treated as a coverage problem.

A business workflow rarely has only one valid path. Consider invoice exception handling. A disputed invoice might be resolved by checking purchase-order terms, matching delivery records, reviewing vendor history, comparing contract clauses, or escalating based on materiality thresholds. If the model has only seen one preferred resolution route, RL can reward that route. If it has seen several correct routes, RL can learn when to preserve, combine, or abandon them.

The same pattern applies to support automation, compliance triage, underwriting assistance, internal research agents, and financial reconciliation. The goal is not “make the model reason more.” That phrase is operational fog. The goal is to define the task family’s valid solution strategies, generate or collect examples of each, verify correctness, and then train the model so RL has alternatives to select from.

A useful enterprise version of the paper’s pipeline would look like this:

Training design question Business translation Failure mode if ignored
What are the distinct correct approaches? Build a strategy taxonomy for the workflow The model overfits one procedure and fails on exceptions
Can final outputs be verified? Define rule checks, human checks, reconciliation checks, or acceptance tests Synthetic data becomes fluent but wrong
Are multiple approaches attached to the same case type? Train on behavioral variation, not just case volume More data adds surface coverage but little procedural flexibility
Does RL sample enough candidates to use the diversity? Align rollout design, verification, and selection The model has alternatives but training rarely sees them
Is pass@k operationally usable? Use retry, ranking, or verifier loops only where allowed Benchmark gains do not translate into one-shot reliability

This is also where ROI thinking becomes more disciplined. The paper’s strongest business relevance is not “better benchmark score.” It is cheaper diagnosis of what kind of training data is missing.

If a model fails under RL, the failure may not be the RL algorithm alone. It may be that the pre-RL data gave the model too few correct behaviors to optimize. That diagnosis changes the spending decision. Instead of immediately increasing RL compute, the team may need to invest in a smaller but richer data construction process: multiple verified solutions per task pattern, better filters, stronger evaluators, and explicit coverage of exception-handling routes.

Where the result applies, and where it does not yet travel comfortably

The boundary is not that the method fails. The boundary is that the paper’s evidence is strongest for math-style reasoning with verifiable answers and multi-sample evaluation.

The primary experiments use Llama 3.2–3B–Instruct and math benchmarks. The out-of-domain results on HumanEval and MuSR are encouraging, especially on Team Allocations in MuSR, where vanilla RL underperforms the base model while the mid-trained RL models perform much better. But those are still extensions, not proof of universal transfer.

The appendix also complicates the easy story. In Qwen2.5–7B–Instruct experiments, the zero-shot model already has the highest average pass@1 and pass@64 among the reported mid-training table, and $n=8$ is the best or near-best among Pólya configurations. The authors suggest that stronger base models may already contain substantial reasoning diversity. That is an important deployment clue. If a model already has broad behavioral coverage, adding more heuristic variants may produce smaller or less consistent gains.

There is also a verification problem. Math has automatic answer checking. Many business tasks do not. A compliance memo, customer escalation note, or vendor-risk summary may be partly checkable but not fully reducible to a final numeric answer. Without a reliable verifier, synthetic diversity becomes riskier. The paper’s incorrect-answer ablation is a polite way of saying: do not train on attractive nonsense and call it diversity.

Finally, the behavior-composition analysis relies on an LLM judge to classify which heuristics appear in reasoning traces. The authors validate this with two human annotators and report Fleiss’ $\kappa = 0.65$, which they interpret as substantial agreement. That supports the analysis, but it is not the same as a mechanistic microscope. It shows that the composition claim is plausible and measured with some validation, not that every internal behavior is cleanly identified.

The useful lesson: prepare the model before asking RL to be clever

The paper’s most valuable contribution is not a new magic recipe. It is a placement lesson.

Reinforcement learning is often discussed as the stage where models become better reasoners. This paper shifts attention to the stage before that: what behaviors are available for RL to reward, penalize, and combine? If the model sees only one correct path, RL can optimize a narrow distribution. If it sees multiple verified paths, RL has a richer policy landscape.

For business AI systems, this turns into a practical checklist:

  1. Identify the recurring task families where reliability matters.
  2. Write down the distinct valid ways experts solve each task.
  3. Generate or collect multiple correct trajectories for the same case type.
  4. Verify outputs before training on them.
  5. Measure whether the system benefits under realistic retry, ranking, or human-review workflows.
  6. Treat pass@k gains as useful only when the workflow can actually exploit multiple attempts.

The mildly annoying conclusion is that better RL may require better pre-RL craftsmanship. Less glamorous than shouting “agentic,” admittedly. Also more likely to work.

The paper does not prove that Pólya-style heuristics are the universal taxonomy for business reasoning. It does not prove that every stronger model needs this treatment. It does not make pass@1 reliability suddenly disappear as a deployment concern.

What it does show is more useful: when diverse correct behaviors are made available before RL, later RL can exploit them more effectively. In enterprise terms, optimization improves when the model has something worth optimizing.

A menu first. Then the reward signal.

Cognaptus: Automate the Present, Incubate the Future.


  1. Aswin RRV, Jacob Dineen, Divij Handa, Mihir Parmar, Ben Zhou, Swaroop Mishra, and Chitta Baral, “Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models,” arXiv:2605.08472v1, 2026, https://arxiv.org/pdf/2605.08472↩︎