Policy-Optimization

Tool calls are not tokens. Neither are paragraphs, reasoning blocks, spreadsheet edits, web searches, code executions, or the awkward little detours an agent takes before finally answering the user. Yet much of reinforcement learning for language models still behaves as if it must choose between two unsatisfying extremes. At one end, every token is treated as a tiny action. At the other, the whole answer is treated as one indivisible action. The first view is mathematically tidy and operationally noisy. The second is practical for verifiable tasks, but it compresses an entire reasoning process into one final score, which is a bit like reviewing an employee only by checking whether the office building is still standing. ...

Policy-Optimization

The Missing Present Is a Distribution: DUPO for Delayed Control

When Tokens Become Actions: A Policy Gradient Built for Transformers