When it comes to reasoning, bigger isn’t always better. Large language models (LLMs) often produce unnecessarily long chains of thought, burning through tokens — and budgets — even for simple problems. While fixed token limits during training can force brevity, they also rob models of the chance to first explore and then compress their reasoning.

A new study, Train Long, Think Short, proposes a smarter path: curriculum learning for length control. Instead of a one-size-fits-all cap, the model starts with a generous token budget, learns robust reasoning strategies, and then gradually adapts to shorter limits over time. The result is a model that solves complex tasks with fewer tokens, without losing accuracy.

From Exploration to Compression

The approach builds on Group Relative Policy Optimization (GRPO) — a reinforcement learning method where multiple responses are compared and rewarded based on relative performance. The reward system combines three elements:

  1. Correctness – verified final answer accuracy.
  2. Length efficiency – adherence to a progressively shrinking token budget.
  3. Formatting – clear separation of reasoning (<think>) and answer (<answer>).

The budget decays using an exponential or linear schedule: $B(t) = \max(1, B_0 \cdot \gamma^{\lfloor t/T \rfloor})$ where $B_0$ is the starting budget (256 tokens in tests), $\gamma$ the decay factor, and $T$ the interval between reductions.

Results That Count — and Cost Less

Trained on both GSM8K (grade-school math) and MATH500 (competition-level math), curriculum-based models:

  • Achieved higher accuracy than fixed-budget models at the same final limit.
  • Used ~5% over the target budget on average — far less than base models.
  • Generalized better to adversarial and out-of-distribution datasets.
  • Responded predictably without runtime prompt hints.

Schedule design mattered: linear decay boosted accuracy on harder problems, while faster exponential decay maximized efficiency. Reward shape also mattered — a triangular reward preserved accuracy better than a flat-band reward, which tended to over-compress outputs.

Why It Matters for Businesses

For enterprises deploying LLMs in cost-sensitive, reasoning-heavy workflows — financial analysis, legal review, technical support — this approach offers:

  • Lower inference costs without accuracy loss.
  • Scalable deployment with predictable token usage.
  • Customizable trade-offs between efficiency and correctness via reward weighting.
  • No extra prompting overhead for length control.

By teaching models to “think smarter, not longer,” curriculum length control could become a default training paradigm for efficient AI reasoning.

Cognaptus: Automate the Present, Incubate the Future