Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale
The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...