Confidence Is Not Truth, But It Can Steer: When LLMs Learn When to Stop
Opening — Why this matters now Large Language Models are no longer compute-bound at training time. They are inference-bound at deployment time. The last year has made this painfully clear. Frontier reasoning models increasingly win benchmarks not by being smarter, but by thinking more: longer chains-of-thought, more samples, more retries, more votes. The result is an arms race in test-time scaling—512 samples here, best-of-20 there—where accuracy inches upward while token bills explode. ...