Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

Opening — Why this matters now

Large language models can solve math problems. The more interesting question in 2025 is whether they can learn how to reason, at scale, across contexts that are long, messy, and computationally expensive. Most math datasets answer the first question. Nemotron-Math answers the second — and does so with a surprisingly pragmatic eye on cost.

As context windows stretch toward 128K tokens and beyond, brute-force long-context training is becoming economically absurd. Nemotron-Math arrives with a sharper proposition: diversify reasoning behaviors, integrate tools properly, and train long only when you must.

Background — Context and prior art

Mathematical reasoning has long been the stress test for LLMs. Recent datasets — OpenMathInstruct, NuminaMath, OpenMathReasoning — pushed models toward deeper chain-of-thought supervision, mostly via competition-style problems.

But two structural limits remained:

Single-style reasoning — one dominant depth, one tone, one way of thinking.
Competition bias — elegant, formal problems that underrepresent how real people ask math questions.

Meanwhile, long-context fine-tuning quietly became a compute sinkhole: most samples are short, yet every step pays the 128K-token tax.

Nemotron-Math targets all three problems at once.

Analysis — What the paper actually does

Nemotron-Math is a 7.5M-trace mathematical reasoning dataset distilled from gpt-oss-120b, exploiting two unusual capabilities:

Multi-mode reasoning control: high / medium / low depth
Tool-Integrated Reasoning (TIR) via Python execution

Dataset design (the non-obvious parts)

Dimension	Design choice	Why it matters
Problem sources	AoPS + StackExchange-Math	Balances rigor with real-world diversity
Reasoning modes	High / Medium / Low	Teaches models how much to think
Tool usage	With / without Python	Separates symbolic reasoning from computation
Filtering	Remove problems with ≥0.8 low-mode pass rate	Avoids wasting tokens on trivial tasks

After aggressive filtering, the final corpus spans 347K problems with reasoning traces up to 128K tokens.

The underrated innovation: sequential bucketed training

Instead of training everything at full context length, Nemotron-Math introduces a progressive length curriculum:

16K → 32K → 64K → 128K tokens

Each stage uses parallelism settings optimized for that length.

Training strategy	Relative cost	Accuracy loss
Full 128K from start	1.0× (baseline)	—
Sequential bucketed	~0.4–0.5×	1–3%

This is not glamorous. It is operationally decisive.

Findings — What actually improves (with evidence)

1. Better supervision beats more data

When controlling for problem set and scale, Nemotron-Math outperforms OpenMathReasoning across AIME and HMMT benchmarks.

Dataset	AIME25 pass@1	AIME25 maj@16
OpenMathReasoning	59.38%	71.67%
Nemotron-Math	77.08%	90.00%

The gain comes from how reasoning is generated, not just how much.

2. Community math improves robustness

Adding StackExchange-Math barely affects olympiad performance — but meaningfully boosts HLE-Math, an open-domain benchmark.

This is a quiet rebuke to the idea that harder problems alone produce better reasoners.

3. Tool use matters — but only when taught properly

Across all settings:

Python TIR consistently outperforms non-tool reasoning
High-mode + TIR reaches 100% maj@16 on AIME 2024/25

Importantly, models without tool access still benefit from having seen tool-augmented traces during training.

Implications — Why this matters beyond math

Nemotron-Math is not really about math. It is about behavioral supervision at scale.

Three broader takeaways:

Reasoning depth is a controllable variable — and should be trained as such.
Long-context training needs curricula, not brute force.
Tool-augmented cognition transfers, even when tools are absent at inference.

For businesses building agentic systems, this suggests a shift:

Train how models think, not just what they answer.

The same logic applies to legal analysis, financial modeling, and multi-step planning — anywhere verbosity, verification, and cost collide.

Conclusion — The quiet efficiency play

Nemotron-Math does not introduce a flashy new architecture. It does something rarer: it respects economics.

By combining multi-style reasoning, real-world problem diversity, and a disciplined long-context training strategy, it shows that better reasoning does not require infinite tokens — just better supervision.

That is a lesson many foundation-model teams are about to relearn the expensive way.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Dataset design (the non-obvious parts)#

The underrated innovation: sequential bucketed training#

Findings — What actually improves (with evidence)#

1. Better supervision beats more data#

2. Community math improves robustness#

3. Tool use matters — but only when taught properly#

Implications — Why this matters beyond math#

Conclusion — The quiet efficiency play#