Mid-Training

RL Needs a Menu, Not a Miracle Menus are underrated. When a language model knows only one way to solve a problem, reinforcement learning can mostly reward or punish that route. It can make the model more confident, more selective, and sometimes more verbose. But it has little room to choose among genuinely different ways of reaching the answer. ...