Drafts, Then Do Better: Teaching LLMs to Outgrow Their Own Reasoning
Opening — Why this matters now Large language models have learned to sound confident. Unfortunately, confidence is not correctness—especially in long-horizon reasoning tasks like competition math or multi-step logic. Reinforcement learning has helped, but most RL pipelines still assume a one-shot world: generate once, score once, update once. Humans don’t work that way. We draft, reread, cringe, fix, and try again. ...