Train Long, Think Short: How Curriculum Learning Makes LLMs Think Smarter, Not Longer

TL;DR for operators

The paper behind this article proposes Curriculum GRPO: a reinforcement-learning training method that starts a reasoning model with a larger token budget, then gradually shrinks that budget until the model learns to solve problems in shorter traces.¹

The important point is not “ask the model to be brief.” We have tried that. It works roughly as well as asking a committee to be concise, which is to say: occasionally, under duress. The paper instead changes the training trajectory. The model is first allowed to explore longer reasoning paths, then is forced to compress successful strategies into a tighter token budget.

For operators, the business idea is straightforward: if reasoning tokens are a recurring cost centre, compression should be learned before deployment, not negotiated at every prompt. In the paper’s main comparisons, Curriculum GRPO improves accuracy over fixed-budget GRPO while using similar or fewer tokens. On GSM8K-trained models, accuracy rises from 82.71% to 86.20% at roughly the same short output length. On MATH500-trained models, accuracy rises from 38.80% to 43.40% while average reasoning length falls from 179.3 to 137.1 tokens.

The catch is not decorative. These are math benchmarks, Qwen-2.5-7B experiments, and short token budgets capped at 256 during training. The method is promising for cost and latency control, especially where correctness can be verified automatically. It is not yet proof that every enterprise agent can think less, bill less, and still behave itself. Lovely thought. Not yet a result.

Overthinking is now a line item

There is a familiar moment in AI operations: the model answers correctly, but only after producing a small novella of intermediate reasoning. The output is not wrong. It is just expensive, slow, and faintly theatrical.

That matters because reasoning tokens are not free. Long chain-of-thought style behaviour can improve difficult problem solving, but it also turns inference into a metered habit. For a one-off demo, verbose reasoning looks impressive. For a production workflow handling thousands or millions of requests, it becomes latency, GPU time, and cost variance.

The obvious response is to cap the output. That is also the crude response. Hard token limits can amputate reasoning. Prompt instructions such as “be concise” are unreliable. Inference-time tricks can help, but they often add control logic around the model rather than changing how the model has learned to reason.

The authors of Train Long, Think Short frame the problem differently. They ask whether efficient reasoning can be learned as a curriculum. The model should not be thrown into a tiny reasoning budget from the first step. It should first discover useful solution patterns with room to think, then progressively learn to express those patterns under tighter constraints.

That is the mechanism-first story. Explore, then compress. Not: panic in 87 tokens and hope arithmetic survives.

The mechanism: exploration first, compression later

The method builds on Group Relative Policy Optimization, or GRPO. In plain language, GRPO samples multiple candidate responses for the same prompt, scores them, and updates the model toward responses that perform better relative to the sampled group. It avoids a separate value model and is well suited to sparse outcome rewards, such as whether a final mathematical answer is correct.

The paper modifies this setup for length-controlled reasoning. Each response receives a reward made from three components:

Reward component	What it encourages	Operational interpretation
Correctness reward	The final answer must pass an automated verifier	Do not save tokens by becoming wrong. A charming strategy, but rarely billable.
Length reward	The response should match the current token budget	Keep reasoning within the compute envelope.
Formatting reward	The model should separate reasoning and answer using specified tags	Make evaluation and parsing easier during training.

The curriculum sits inside the length reward. Instead of training the model under a fixed 87-token budget throughout, the authors start at 256 tokens and decay the budget down to 87 over training. The prompt tells the model to use the specified budget, while the reward pushes it towards correctness, budget adherence, and structure.

This is not merely a smaller inference budget. It is a training path. The model first gets enough space to find workable reasoning strategies. Later, as the budget shrinks, the policy is pressured to retain the useful parts and drop the fluff.

That distinction matters. A fixed short-budget model may never learn the longer reasoning route in the first place. A curriculum-trained model can first learn the route, then learn the shortcut. The shortcut is not magic; it is compressed competence.

The main evidence is a same-endpoint comparison

The cleanest test in the paper is not whether the curriculum model is shorter than the base model. That would be too easy. A low budget will usually reduce length. The sharper test is whether Curriculum GRPO beats fixed-budget GRPO when both end at the same final budget.

The authors train Qwen-2.5-7B in two settings: one trained on GSM8K, a grade-school arithmetic dataset, and one trained on MATH500, a harder competition-style math dataset. Evaluation covers GSM8K, MATH500, SVAMP, GSM+, and College Math.

The headline comparison:

Training setting	Comparison	Accuracy result	Token result	What it supports
GSM8K training	Fixed-budget GRPO vs Curriculum GRPO	82.71% → 86.20% on GSM8K	88.8 vs 87.0 average tokens	Curriculum improves accuracy without losing the short-budget profile.
GSM8K training, OOD	Fixed-budget GRPO vs Curriculum GRPO	77.67% → 85.00% on SVAMP; 62.75% → 67.58% on GSM+	Token counts remain close to fixed-budget baseline	The benefit is not limited to exact in-distribution arithmetic.
MATH500 training	Fixed-budget GRPO vs Curriculum GRPO	38.80% → 43.40% on MATH500	179.3 → 137.1 average tokens	On harder problems, curriculum improves accuracy while also shortening outputs.

This is the paper’s main evidence. The fixed-budget baseline isolates the curriculum effect because it uses the same final budget. The base model comparison shows the broader efficiency gain, but the fixed-budget comparison is where the paper earns its claim.

On GSM8K, the improvement is modest but operationally meaningful: better accuracy at essentially the same token length. On MATH500, the result is more interesting because the model becomes both more accurate and shorter. That suggests the curriculum is not simply letting the model “spend more” during evaluation. It has changed the reasoning policy enough that shorter traces can carry more useful work.

Still, the magnitude should be interpreted with a sober face. These are benchmark gains, not enterprise deployment savings. The paper shows a mechanism that improves token efficiency under controlled training and evaluation. Cognaptus infers that this mechanism could matter for production cost control where tasks are verifiable and repetitive. That inference is plausible. It is not automatically portable.

The reward ablations show the tuning problem hiding inside the method

The paper then tests reward weighting. This is an ablation, not a second thesis. Its purpose is to show whether the method depends on how strongly the training reward prioritises correctness versus brevity.

Two regimes are compared. A length-heavy setting compresses more aggressively. A correctness-heavy setting allows a small increase in output length to preserve accuracy.

The result is unsurprising in the best way: there is a trade-off, but it is controllable. Under length-heavy weighting, GSM8K accuracy reaches 85.37% at 92.3 tokens, still above the base model’s 83.55% at 258.4 tokens. Under correctness-heavy weighting, GSM8K accuracy rises to 87.34% at 93.5 tokens. The token cost of the extra accuracy is small in this setting.

For an operator, this is more useful than a single leaderboard number. It says Curriculum GRPO is not just a method; it creates a tuning surface. Organisations can decide whether they want a stricter cost envelope or a slightly more accuracy-favouring configuration.

But the ablation also removes a comforting fantasy. You cannot simply add a “shorter please” reward and expect the model to optimise the business objective you actually care about. Reward balance becomes product policy. If the task is low-risk and high-volume, length pressure may be acceptable. If the task is finance, compliance, engineering diagnosis, or anything where a one-point accuracy drop becomes an incident report, the reward mix should lean differently.

In other words, token efficiency is not a virtue by itself. It is a constraint to be priced against error.

The schedule ablations say the path matters, not just the destination

A lazy reading of the paper would say: start at 256 tokens, end at 87, enjoy the savings. The schedule experiments make that reading unsafe.

The authors vary how quickly the budget decays while keeping the start and end budgets fixed. This is a sensitivity test. It asks whether the model cares about the compression trajectory or only about the final token limit.

It cares.

In the exponential schedule comparison, the rows using decay factor 0.700 every 150 steps and 0.857 every 75 steps both reach 57.9% average accuracy across datasets, with average token counts of 135 and 115 respectively. The 0.340 every 300 steps setting averages 52.4% accuracy and 248 tokens. It performs strongly on easier datasets such as GSM8K and SVAMP, but collapses badly on MATH500 at 9.8%.

That pattern is the paper’s most operationally annoying finding. The final budget does not define the learned behaviour. The route into that budget affects whether the model learns compression or merely experiences constraint.

The later comparison between exponential and linear decay makes the point sharper. Linear decay produces slightly longer outputs on average, 140 tokens versus 135 for exponential, but improves average accuracy from 57.9% to 60.0%. The gains are concentrated on harder benchmarks: MATH500 improves from 37.4% to 42.8%, and College Math from 13.4% to 17.2%. GSM+ moves the other way, with exponential at 67.6% versus linear at 66.4%.

For business use, this argues against treating token budgets as a static configuration file. The compression schedule is an optimisation object. A support chatbot, an internal coding assistant, and a regulated analytical workflow may need different schedules because their “acceptable compression error” differs. Shocking, yes: training curves contain product decisions.

The length reward shape exposes the danger of over-compression

The paper also compares the default triangular length reward with a band reward. This is another ablation, focused on reward shape.

The triangular reward encourages the model to use the budget well: reward rises toward the target and then falls after it. The band reward gives maximum reward to outputs up to the target and then penalises longer ones. The band version sounds attractive if one worships shortness. It says, effectively, “anything short enough is fine.”

The results explain why that can be a trap.

Reward function	Average token count	Average accuracy	Interpretation
Triangular	135	57.9%	Uses more tokens but preserves more reasoning quality.
Band	94	55.0%	Compresses harder but loses accuracy.

The loss is especially visible on harder or more adversarial tasks. On MATH500, triangular reward reaches 37.4% accuracy at 201 tokens, while band reward reaches 30.8% at 112 tokens. On GSM+, triangular reaches 67.6%, while band reaches 64.6%.

This is a useful warning. If the reward says “short is good,” the model may learn to be short before it learns to be reliably correct. The triangular reward is more nuanced. It does not reward verbosity for its own sake, but it avoids prematurely celebrating tiny traces that happen to pass easier examples.

In enterprise language, the band reward is the finance department’s dream and the risk team’s migraine. It produces visible savings. It may also quietly throw away reasoning capacity.

What the paper directly shows, what Cognaptus infers, and what remains open

The paper’s evidence is strongest when read within its experimental frame. It directly shows that, for Qwen-2.5-7B trained with GRPO on selected math datasets, curriculum-based budget tightening can outperform fixed-budget GRPO at the same final budget.

That is already useful. It says the learning trajectory matters. It also says that token efficiency can be trained, not merely requested.

The business interpretation requires one extra step. Cognaptus infers that similar training-time compression could help organisations running reasoning-heavy LLM workloads where cost, latency, and throughput are material. Examples include automated tutoring, mathematical checking, structured analytical assistants, coding support, and constrained decision-support systems where outputs can be verified or scored.

But that inference has boundaries.

Paper result	Business meaning	Boundary
Curriculum GRPO beats fixed-budget GRPO on math benchmarks	Train models to become concise rather than forcing concision only at inference	Evidence is mathematical and verifier-friendly. Open-ended business tasks may behave differently.
Reward weighting controls accuracy-token trade-off	Teams can tune efficiency against risk appetite	Reward design requires measurement infrastructure and domain-specific evaluation.
Schedule shape affects final accuracy and length	Compression should be treated as a training design variable	There is no universal schedule shown for all model sizes or task types.
Linear decay improves harder-task accuracy at slightly higher length	Smoother compression may preserve complex reasoning	The cost gain is smaller; harder tasks may resist aggressive compression.
Band reward shortens outputs but hurts accuracy	Over-compression is real	Token savings alone can be a misleading KPI.

The lesson is not “always compress reasoning.” The lesson is “compress reasoning only after the model has learned something worth compressing.”

Where this applies first

The most natural early use cases are high-volume, bounded, verifiable tasks. That means problems where the answer can be checked, where a shorter reasoning trace is valuable, and where failures can be measured.

Mathematics is the clean laboratory because correctness can be automatically verified. Enterprise equivalents include rules-based calculations, data transformation validation, configuration checks, structured diagnostics, and narrow coding tasks with executable tests. In those contexts, a Curriculum GRPO-like method could help convert verbose model behaviour into a cheaper policy without relying on brittle prompt-level nagging.

The method is less immediately proven for open-ended advisory tasks. Strategy memos, legal analysis, medical reasoning, financial recommendations, and multi-step agent workflows introduce evaluation ambiguity. A model might be concise because it has compressed the right reasoning, or because it has stopped saying the parts where the mistake lives. Very efficient. Also how reputational fires begin.

That does not make the method irrelevant. It means the evaluation layer must mature first. If a business cannot measure correctness, it should be cautious about rewarding brevity. Compression without verification is not efficiency. It is silence with confidence.

The limitation section is short, but it matters

The authors name two main limitations. First, the experiments use relatively short context windows and budgets capped at 256 tokens. That is enough for many GSM8K-style problems and only partly sufficient for harder math. It does not tell us how curriculum compression behaves when tasks require thousands of tokens of reasoning, tool calls, retrieval, or multi-document synthesis.

Second, all experiments use Qwen-2.5-7B. That is a reasonable model for experimentation, but it leaves open how the method scales. Larger models may compress differently. Smaller models may lack enough reasoning capacity to benefit from the same curriculum. Different model families may respond differently to the same reward design.

There is also an implicit limitation: the setup depends on automated verification. Math-verify can score final answers. Many business tasks do not have such clean verifiers. Without reliable scoring, GRPO-style optimisation becomes harder and easier to fool. Models are excellent at finding reward loopholes. They do not even look embarrassed.

So the method is most convincing where three conditions hold: the task is repeated, the cost of reasoning tokens matters, and correctness can be evaluated with reasonable confidence.

The operator’s takeaway: budget control belongs in training, not just prompting

The paper’s core contribution is not a new slogan about efficient AI. It is a training dynamic: let the model think long enough to learn, then make it think shorter without forgetting how.

That is a useful corrective to the current “more reasoning is better” instinct. More reasoning can help, but it is not automatically better. Past a point, it becomes expensive ceremony. The operational goal is not maximum reasoning length. It is sufficient reasoning at predictable cost.

Curriculum GRPO gives one plausible route: use reinforcement learning to combine correctness, formatting, and length adherence while gradually tightening the budget. The results show that this trajectory can outperform fixed-budget training and that the details—reward weights, decay timing, reward shape—materially affect the outcome.

For businesses, the strategic implication is modest but important. Future LLM cost control will not come only from cheaper inference chips, smaller models, caching, or prompt engineering. It will also come from shaping how models learn to spend tokens in the first place.

Train long. Think short. Invoice less. Maybe even keep the accuracy. That last clause is where the work lives.

Cognaptus: Automate the Present, Incubate the Future.

Hasan Abed Al Kader Hammoud, Kumail Alhamoud, Abed Hammoud, Elie Bou-Zeid, Marzyeh Ghassemi, and Bernard Ghanem, “Train Long, Think Short: Curriculum Learning for Efficient Reasoning,” arXiv:2508.08940, 2025, https://arxiv.org/abs/2508.08940. ↩︎

TL;DR for operators#

Overthinking is now a line item#

The mechanism: exploration first, compression later#

The main evidence is a same-endpoint comparison#

The reward ablations show the tuning problem hiding inside the method#

The schedule ablations say the path matters, not just the destination#

The length reward shape exposes the danger of over-compression#

What the paper directly shows, what Cognaptus infers, and what remains open#

Where this applies first#

The limitation section is short, but it matters#

The operator’s takeaway: budget control belongs in training, not just prompting#