When it comes to reasoning, bigger isn’t always better. Large language models (LLMs) often produce unnecessarily long chains of thought, burning through tokens — and budgets — even for simple problems. While fixed token limits during training can force brevity, they also rob models of the chance to first explore and then compress their reasoning.
A new study, Train Long, Think Short, proposes a smarter path: curriculum learning for length control. Instead of a one-size-fits-all cap, the model starts with a generous token budget, learns robust reasoning strategies, and then gradually adapts to shorter limits over time. The result is a model that solves complex tasks with fewer tokens, without losing accuracy.
From Exploration to Compression
The approach builds on Group Relative Policy Optimization (GRPO) — a reinforcement learning method where multiple responses are compared and rewarded based on relative performance. The reward system combines three elements:
- Correctness – verified final answer accuracy.
- Length efficiency – adherence to a progressively shrinking token budget.
- Formatting – clear separation of reasoning (
<think>
) and answer (<answer>
).
The budget decays using an exponential or linear schedule: $B(t) = \max(1, B_0 \cdot \gamma^{\lfloor t/T \rfloor})$ where $B_0$ is the starting budget (256 tokens in tests), $\gamma$ the decay factor, and $T$ the interval between reductions.
Results That Count — and Cost Less
Trained on both GSM8K (grade-school math) and MATH500 (competition-level math), curriculum-based models:
- Achieved higher accuracy than fixed-budget models at the same final limit.
- Used ~5% over the target budget on average — far less than base models.
- Generalized better to adversarial and out-of-distribution datasets.
- Responded predictably without runtime prompt hints.
Schedule design mattered: linear decay boosted accuracy on harder problems, while faster exponential decay maximized efficiency. Reward shape also mattered — a triangular reward preserved accuracy better than a flat-band reward, which tended to over-compress outputs.
Why It Matters for Businesses
For enterprises deploying LLMs in cost-sensitive, reasoning-heavy workflows — financial analysis, legal review, technical support — this approach offers:
- Lower inference costs without accuracy loss.
- Scalable deployment with predictable token usage.
- Customizable trade-offs between efficiency and correctness via reward weighting.
- No extra prompting overhead for length control.
By teaching models to “think smarter, not longer,” curriculum length control could become a default training paradigm for efficient AI reasoning.
Cognaptus: Automate the Present, Incubate the Future