Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer
TL;DR for operators The paper behind this article proposes Curriculum GRPO: a reinforcement-learning training method that starts a reasoning model with a larger token budget, then gradually shrinks that budget until the model learns to solve problems in shorter traces.1 The important point is not “ask the model to be brief.” We have tried that. It works roughly as well as asking a committee to be concise, which is to say: occasionally, under duress. The paper instead changes the training trajectory. The model is first allowed to explore longer reasoning paths, then is forced to compress successful strategies into a tighter token budget. ...