Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

When it comes to reasoning, bigger isn’t always better. Large language models (LLMs) often produce unnecessarily long chains of thought, burning through tokens — and budgets — even for simple problems. While fixed token limits during training can force brevity, they also rob models of the chance to first explore and then compress their reasoning.

A new study, Train Long, Think Short, proposes a smarter path: curriculum learning for length control. Instead of a one-size-fits-all cap, the model starts with a generous token budget, learns robust reasoning strategies, and then gradually adapts to shorter limits over time. The result is a model that solves complex tasks with fewer tokens, without losing accuracy.

From Exploration to Compression

The approach builds on Group Relative Policy Optimization (GRPO) — a reinforcement learning method where multiple responses are compared and rewarded based on relative performance. The reward system combines three elements:

Correctness – verified final answer accuracy.
Length efficiency – adherence to a progressively shrinking token budget.
Formatting – clear separation of reasoning (<think>) and answer (<answer>).

The budget decays using an exponential or linear schedule: $B(t) = \max(1, B_0 \cdot \gamma^{\lfloor t/T \rfloor})$ where $B_0$ is the starting budget (256 tokens in tests), $\gamma$ the decay factor, and $T$ the interval between reductions.

Results That Count — and Cost Less

Trained on both GSM8K (grade-school math) and MATH500 (competition-level math), curriculum-based models:

Achieved higher accuracy than fixed-budget models at the same final limit.
Used ~5% over the target budget on average — far less than base models.
Generalized better to adversarial and out-of-distribution datasets.
Responded predictably without runtime prompt hints.

Schedule design mattered: linear decay boosted accuracy on harder problems, while faster exponential decay maximized efficiency. Reward shape also mattered — a triangular reward preserved accuracy better than a flat-band reward, which tended to over-compress outputs.

Why It Matters for Businesses

For enterprises deploying LLMs in cost-sensitive, reasoning-heavy workflows — financial analysis, legal review, technical support — this approach offers:

Lower inference costs without accuracy loss.
Scalable deployment with predictable token usage.
Customizable trade-offs between efficiency and correctness via reward weighting.
No extra prompting overhead for length control.

By teaching models to “think smarter, not longer,” curriculum length control could become a default training paradigm for efficient AI reasoning.

Cognaptus: Automate the Present, Incubate the Future

From Exploration to Compression#

Results That Count — and Cost Less#

Why It Matters for Businesses#

From Exploration to Compression

Results That Count — and Cost Less

Why It Matters for Businesses