Mathematical Reasoning

Long Thoughts, Short Bills: Distilling Mathematical Reasoning at Scale

The invoice arrives after the benchmark party Math benchmarks are fun until the training bill arrives. A model can be taught to produce longer reasoning traces. It can be shown more olympiad problems. It can be given Python. It can be pushed into 128K-token contexts and told, heroically, to think harder. All of this sounds impressive in a benchmark table. Less impressive is the operational detail that most training samples do not need the full 128K window, yet a naive training setup can still make every step pay for it. ...

$Cover image$

Error Hunting Season: Why Pessimism Makes LLMs Smarter at Math

Review is not a democracy. That sounds unpleasant, which is why it is useful. In many business settings, we like consensus because it feels stable. Three analysts agree, five reviewers approve, the dashboard turns green, and everyone can pretend the risk has been domesticated. Mathematics is less polite. One invalid theorem application, one hidden assumption, one algebraic step that does not follow, and the whole proof may collapse. The majority does not get to vote a contradiction out of existence. ...

The Problem with Problems: Why LLMs Still Don’t Know What’s Interesting

A tutoring system has one deceptively simple job: give the learner the next problem. Not the hardest problem. Not the flashiest problem. Not the one that makes the model feel terribly pleased with itself after a 4,000-token monologue. The next problem: the one that keeps a student engaged, teaches the right structure, and feels worth the effort. ...