Cover image

Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...

December 18, 2025 · 17 min · Zelina

DeepSeek-R1

An open-source reasoning model achieving state-of-the-art performance in math, code, and logic tasks.

2 min