Reward Models

TL;DR for operators Every GUI automation project has a familiar failure mode: the agent gets almost there, makes one bad click, and the training system treats the whole episode as garbage. That is tidy for spreadsheets and absurd for learning. ProgRM addresses that absurdity by replacing final-only success/failure rewards with step-level estimates of task progress.1 Instead of asking only, “Did the agent finish?”, it asks, “How much closer is the agent now than it was one step ago?” The reward is the change in estimated progress. A search that reaches the right article but fails to bookmark it is no longer equivalent to an agent staring at the home screen and scrolling like a caffeinated intern. ...