TL;DR for operators
Every GUI automation project has a familiar failure mode: the agent gets almost there, makes one bad click, and the training system treats the whole episode as garbage. That is tidy for spreadsheets and absurd for learning.
ProgRM addresses that absurdity by replacing final-only success/failure rewards with step-level estimates of task progress.1 Instead of asking only, “Did the agent finish?”, it asks, “How much closer is the agent now than it was one step ago?” The reward is the change in estimated progress. A search that reaches the right article but fails to bookmark it is no longer equivalent to an agent staring at the home screen and scrolling like a caffeinated intern.
The paper’s strongest result comes from ProgRM trained with environment-provided milestone labels: 62.00% success on the WikiHow Android benchmark, compared with 58.67% for an outcome reward model, 58.00% for GUI-R1, and 56.00% for both Claude-3.7-Sonnet prompting and the supervised fine-tuned actor. A longer 30K-step training run pushes the environment-label ProgRM actor to 67.33%.
The more operationally interesting result is the weaker but still useful one: ProgRM trained with LCS-based self-annotated progress labels reaches 59.33%. That is only slightly above ORM, but it matters because it does not rely on manually labelled process rewards or expensive frontier-model evaluators. The paper is basically saying: if your workflows contain repeated successful patterns, you may be able to mine them into progress supervision. Not magic. Engineering.
For business teams, the message is clear: the next improvement in GUI agents may come less from buying a larger model and more from building better reward instrumentation around the workflows you already care about. The boundary is equally clear: this was tested on WikiHow Android tasks, the LCS labels still trail environment milestone labels, QA tasks remain a weakness, and a better-trained GUI agent is also a more capable agent for doing the wrong thing quickly. Delightful, as always.
The real bug is not failure; it is throwing away partial success
Outcome reward models are attractive because they are simple. At the end of an episode, the evaluator says success or failure. The reinforcement learning loop gets its reward. The dashboard gets a binary number. Everyone can pretend the world is crisp.
GUI work is not crisp.
A mobile task usually unfolds through a sequence of local achievements: search for the target article, enter the query, select a result, open a menu, bookmark, rate, share, answer, or navigate back. A failed trajectory can still contain several correct sub-actions. Conversely, a successful trajectory can contain hesitation, redundant scrolling, or recovery from a bad turn. Treating all failed trajectories as zero-value objects wastes exactly the data that online training should exploit: the agent’s messy attempts.
That is the mechanism-first point of ProgRM. The paper is not mainly about a new GUI benchmark number. It is about changing what the training system can see.
An ORM sees this:
| Episode type | ORM interpretation | Training consequence |
|---|---|---|
| Agent searches correctly, opens the right page, then misses the final bookmark click | Failure | Penalise the whole trajectory |
| Agent loops through scroll actions and never approaches the target | Failure | Penalise the whole trajectory |
| Agent succeeds after several useless detours | Success | Reward the whole trajectory |
ProgRM tries to see this instead:
| Episode type | ProgRM interpretation | Training consequence |
|---|---|---|
| Agent searches correctly, opens the right page, then misses the final bookmark click | Partial progress, then stall | Reward useful steps, penalise failure to complete |
| Agent loops through scroll actions and never approaches the target | No meaningful progress | Discourage repetition |
| Agent succeeds after several useless detours | Progress with hollow steps | Reward key steps more than filler |
This distinction is not philosophical. It directly affects exploration. If the reward only arrives at the end, long-horizon GUI tasks become a maze where most attempts look equally useless. If progress is measured along the way, the agent receives feedback before the whole episode collapses into “failed”. A small mercy, but reinforcement learning often lives on small mercies.
ProgRM turns progress into the reward, not the report card
The paper defines progress as the percentage of a task completed at a given state. ProgRM estimates that progress from the task instruction, the action history, and the current screen representation. The reward is then based on the progress gain across recent steps.
In the main experiments, the progress history length is set to 1. So the useful mental model is simple:
The paper allows a longer history window, but its own ablation shows that increasing the history length hurts performance on this benchmark. That makes sense. Most GUI actions already have visible local effects: click, input, scroll, go back, answer. If a task’s action space is not extremely atomic, crediting one step at a time is cleaner than smearing reward over several steps and hoping the algorithm guesses which click mattered. Hope, as usual, is not a systems architecture.
The progress model itself is practical rather than exotic. It combines a pretrained LLM with an MLP head and a sigmoid output so the predicted progress stays between 0 and 1. It is trained with binary cross-entropy against progress labels. The actor is based on Qwen2.5-7B, first supervised fine-tuned on successful trajectories, then trained online with an adaptation of REINFORCE++ for multi-turn GUI interaction.
That last detail corrects a common lazy summary of this work. ProgRM is not “just PPO with a nicer reward”. The paper uses a multi-turn RL setup adapted from REINFORCE++, with token-level credit assignment for language-agent actions. The important part for operators is less the algorithmic badge and more the reward interface: every step can now be scored for incremental progress.
The clever part is the label factory
Dense reward sounds expensive because dense labels sound expensive. In reasoning tasks, process reward models often rely on human annotation or Monte Carlo rollouts from intermediate states. GUI environments make that even more annoying. Resetting mobile states, rolling back app interactions, and evaluating arbitrary partial trajectories is not exactly what one does for relaxation.
ProgRM’s answer is an LCS-based self-annotation pipeline. The idea is almost suspiciously simple: successful trajectories for the same task often share common action patterns. Those common subsequences can be treated as execution recipes. Once a recipe exists, new trajectories can be aligned to it, and progress can be assigned according to how much of the recipe has been completed.
The pipeline has three stages.
First, the system builds a recipe library. For each task goal, successful trajectories are grouped using LCS-based similarity. Because GUI actions may include natural language arguments, the authors use a soft LCS variant rather than relying only on exact action matches. The appendix notes a trajectory grouping similarity threshold of 0.6.
Second, the system discovers key steps. Given a new trajectory, it selects the recipe with the highest completion ratio. Steps in the trajectory that match the selected recipe are treated as key steps: the moments that actually move the task forward.
Third, it assigns progress labels. A key step receives a progress value based on its position in the matched recipe. Non-key steps inherit the progress label of the nearest preceding key step. In other words, useless scrolling does not get to pretend it advanced the task merely because time passed. Good.
A simplified version:
| Stage | What happens | Why it matters |
|---|---|---|
| Recipe construction | Successful trajectories are grouped; common subsequences become task recipes | Converts repeated success patterns into weak process supervision |
| Key-step discovery | A new trajectory is aligned to the best-matching recipe | Separates progress-producing actions from hollow actions |
| Progress assignment | Key steps receive increasing progress values; non-key steps inherit prior progress | Produces dense labels without human step-by-step annotation |
This is the paper’s main operational move. It does not require a frontier model to judge every step during online training. It also does not require a human to label “this click is 42% progress” with a straight face. It uses behaviour already generated by agents and mines structure from it.
There is a catch, and it is important: the paper also evaluates ProgRM trained with environment-reward-based progress labels, called ProgRMEnv. WikiHow provides intermediate milestone rewards, so it can serve as a cleaner label source. The fully self-annotated version is ProgRMLCS. ProgRMEnv performs better. That means LCS self-annotation is useful, not solved.
The experiment is really two claims, not one
The paper evaluates ProgRM on the WikiHow benchmark, a real-world Android navigation benchmark with 577 annotated tasks. The authors use 427 tasks for training and 150 tasks for testing. The test set is split into Cross-Page, In-Page, and QA tasks.
The reward-model training data comes from rollouts using Qwen2.5-7B and GPT-4o-mini. The authors collect 7,725 trajectories, then use synthesis to augment and balance the dataset to 10,438 trajectories: 5,729 successful and 4,709 failed. The appendix reports 207,102 total steps. For ProgRM training, trajectories are split into individual steps, producing 113,270 training steps, 15,935 validation steps, and 30,220 test steps after sampling failed-trajectory steps for balance.
So the paper is making two related but distinct claims:
| Claim | Evidence source | What it supports | What it does not prove |
|---|---|---|---|
| Progress rewards help GUI-agent RL | ProgRMEnv reaches the best success rate on WikiHow | Dense progress labels can outperform outcome-only rewards | That automatic labels are already optimal |
| LCS self-annotation can generate usable progress labels | ProgRMLCS beats ORM slightly and has better reward-model diagnostics | Dense labels can be bootstrapped from trajectories | That LCS recipes match environment milestones perfectly |
| Lightweight reward models are practical for online RL | ProgRM latency is 0.050s vs 5.725s/8.531s for Claude evaluator variants | Self-hosted reward models are more suitable for training loops than frontier evaluators | That they are safe or reliable enough for production autonomy |
| Progress rewards reduce repetitive failure modes | Failure analysis shows lower useless repetition for ProgRM-trained actors | Progress gain discourages actions that do not move the task forward | That all error categories are solved |
That separation matters. The best benchmark number belongs to ProgRMEnv, not the LCS-only version. The most commercially relevant idea belongs to ProgRMLCS, because most enterprises do not have perfectly instrumented milestone rewards sitting around in their legacy systems, labelled and waiting politely.
The main result is modest, but the diagnostic evidence is stronger
Here are the headline results from the WikiHow task set:
| Actor | Avg. cumulative reward | Success rate |
|---|---|---|
| GPT-4o-mini | 1.60 | 38.00% |
| GPT-4.1-mini | 2.16 | 52.00% |
| Claude-3.7-Sonnet | 2.38 | 56.00% |
| Qwen2.5-7B | 1.89 | 31.33% |
| SFT | 2.32 | 56.00% |
| GUI-R1 | 2.33 | 58.00% |
| ORM-trained actor | 2.35 | 58.67% |
| ProgRMLCS-trained actor | 2.37 | 59.33% |
| ProgRMEnv-trained actor | 2.39 | 62.00% |
The headline lift over ORM is not enormous: 58.67% to 62.00% for ProgRMEnv, and 58.67% to 59.33% for ProgRMLCS. This is not one of those papers where a chart screams revolution and the appendix whispers “small dataset”. Here, the improvement is incremental.
But the diagnostics make the paper more interesting than the raw success-rate gap.
The reward-model evaluation table shows that ProgRM is much better at matching environment ground truth than ORM when used as an evaluator:
| Reward model | Precision | Recall | Accuracy | False positives |
|---|---|---|---|---|
| ORM | 71.05% | 92.04% | 73.33% | 22.00% |
| ProgRMLCS | 88.66% | 96.63% | 90.67% | 7.33% |
| ProgRMEnv | 97.73% | 92.47% | 94.00% | 1.33% |
The false-positive gap is the operationally juicy part. A reward model that confidently praises failed GUI behaviour is worse than merely wrong. It actively trains the agent toward bad habits. This is how you get agents that click the wrong thumb direction, rate when they should bookmark, or declare victory because a screen vaguely resembles a target state. Automation: now with institutionalised hallucinated competence.
ProgRMLCS still has more false positives than ProgRMEnv, but it is far better than ORM. That supports the paper’s central claim that progress-aware training provides a richer evaluator, even when the progress labels are automatically mined rather than provided by the environment.
The reward model learns progress; ORM mostly learns a doorbell
The paper also measures key-step progress prediction error. It treats environment milestone rewards as ground-truth key steps, assigns progress evenly across those key steps, then asks different reward models to estimate progress.
| Reward model | Key-step progress error | Avg. final score | Latency |
|---|---|---|---|
| ORMClaude | 0.638 | 0.000 | 5.725s |
| ORMClaude-CoT | 0.177 | 0.593 | 8.531s |
| ORM | 0.171 | 0.595 | 0.050s |
| ProgRMLCS | 0.126 | 0.846 | 0.050s |
| ProgRMEnv | 0.036 | 0.842 | 0.050s |
This table does several jobs.
First, it shows that ProgRMEnv estimates progress much more accurately than the alternatives. That is expected: it is trained from environment milestone labels. The interesting comparison is ProgRMLCS versus ORM. ProgRMLCS has lower progress error than ORM, suggesting that the LCS self-annotation pipeline does teach the model something about intermediate progress.
Second, the average final score exposes the behavioural difference between ORM and ProgRM. ORM behaves like a binary success detector. ProgRM estimates cumulative progress, so failed trajectories can end with a moderate score if they got partway through the task. This is precisely the point. A failed trajectory that reaches the right content but misses the last action is not the same as a failed trajectory that never finds the content.
Third, the latency comparison explains why “just use Claude as the evaluator” is not a serious answer for online RL. The self-hosted reward models return in about 0.050 seconds. Claude-based evaluators take seconds, and chain-of-thought evaluation is slower still. In a training loop with many interaction steps, seconds become a tax. And unlike most taxes, this one does not even buy public infrastructure.
The category breakdown shows where the method helps—and where it does not
The paper reports results across Cross-Page, In-Page, and QA tasks. The pattern is not uniformly flattering, which is useful.
On Cross-Page tasks, ProgRMEnv and ORM both reach 72.88% success, while ProgRMLCS reaches 69.49%. Claude-3.7-Sonnet does best at 77.97%. So progress reward is competitive but not dominant there.
On In-Page tasks, ProgRMEnv reaches 88.24%, above SFT at 84.31%, GUI-R1 at 86.27%, and ORM at 78.43%. This is the strongest category-level evidence that progress reward helps. The authors’ analysis suggests that ORM and ProgRMLCS can still be misled by wrong final actions, such as giving a thumbs-down when the instruction requires a thumbs-up. ProgRMEnv reduces that problem more effectively, likely because environment milestone labels offer cleaner key-step supervision.
QA is the awkward category. GPT-4.1-mini reaches 30.00%, GPT-4o-mini reaches 27.50%, Claude-3.7-Sonnet reaches 20.00%, while the fine-tuned Qwen-based agents perform poorly: SFT at 10.00%, ORM and ProgRMEnv at 12.50%, ProgRMLCS at 15.00%. The paper argues that QA tasks require stronger natural language capability, and GUI-focused SFT may degrade the base model’s language ability. That explanation is plausible and commercially relevant. Training an agent to navigate an interface can make it worse at answering from content if the training objective over-specialises the model.
For operators, that means GUI-agent training should not be evaluated as one blob called “task success”. Search, navigation, manipulation, and question answering are different competencies. Combining them into a single success rate is fine for a leaderboard and dangerous for procurement.
The ablations are not a second thesis; they tell us where the reward signal is brittle
The paper includes several analysis and appendix components. Their likely purpose is worth separating, because otherwise one can overread them.
| Test or appendix element | Likely purpose | What it supports | What it does not support |
|---|---|---|---|
| LCS vs environment-label ProgRM | Ablation / upper-bound comparison | Automatic LCS labels are useful but less clean than milestone labels | That LCS annotation is production-ready across apps |
| Key-step progress error | Direct diagnostic | ProgRM learns intermediate progress better than ORM | That lower progress error always converts linearly into higher success |
| Final-step score prediction | Mechanism evidence | ProgRM preserves partial-credit information that ORM discards | That moderate scores are always safe to reward |
| Invocation latency | Implementation feasibility | Lightweight reward models are plausible for online RL | That production deployment is risk-free |
| Failure mode analysis | Explanatory diagnostic | ProgRM reduces useless repetition most clearly | That all failure modes are resolved |
| History length ablation | Sensitivity test | Longer reward history hurts on WikiHow-style GUI tasks | That history length 1 is optimal for all GUI systems |
| 30K-step run | Exploratory extension | Longer training can improve ProgRMEnv from 62.00% to 67.33% | That scaling training always improves every variant |
| Data synthesis appendix | Implementation detail | The dataset is balanced with synthetic failed and successful trajectories | That synthetic trajectories fully match real-world deployment behaviour |
The history-length ablation is particularly useful. At 20K training steps, the main ProgRMEnv setting reaches 62.00% success. A longer reward-history setting falls to 54.00%. At 30K training steps, the main setting reaches 67.33%. This suggests that, at least for WikiHow, immediate step-level progress gain is a better training signal than accumulated progress over a longer local window.
That is not a universal law. In enterprise software, some workflows may have compound actions where no single visible step looks valuable until a short sequence completes. Think multi-field forms, authentication handshakes, or workflow states that update only after a batch submission. In those systems, a slightly longer reward window may be useful. But the paper’s evidence says not to assume that “more temporal context” automatically means “better reward”. Sometimes it just means blurrier credit assignment wearing a nice coat.
The business value is cheaper diagnosis, not merely cheaper training
The lazy business interpretation is: “ProgRM improves GUI agents.” Accurate, but only slightly more informative than saying a kettle improves water.
The more useful interpretation is that ProgRM changes the economics of diagnosing agent behaviour. Outcome-only training tells you which tasks failed. Progress-aware training can tell you where the task stopped making progress. That difference matters for teams trying to deploy agents into real digital workflows.
Consider an enterprise agent trained to file a reimbursement, update a CRM record, or navigate a supplier portal. A final failure signal tells you little. Did it find the right account? Did it enter the wrong field? Did it fail because the UI changed? Did it loop in a dropdown because the state representation was stale? Did it reach a confirmation page but not click submit? The binary evaluator shrugs. Very enterprise.
A progress model gives you a more useful diagnostic surface:
| Operational question | ORM answer | ProgRM-style answer |
|---|---|---|
| Did the task finish? | Yes / no | Yes / no, plus estimated path progress |
| Did the agent make useful partial progress before failure? | Mostly invisible | Measurable through intermediate progress gains |
| Which steps were hollow? | Hard to isolate | Steps with little or no progress gain |
| Is repetition being reduced? | Indirectly visible in logs | Directly penalised if it produces no progress |
| Can failed trajectories still train the model? | Mostly blunt signal | Yes, if they contain progress-producing steps |
Cognaptus inference: for business GUI agents, the design opportunity is not simply to fine-tune harder. It is to instrument workflows so that progress can be measured, learned, and audited. In some cases, that may mean explicit milestone events from the application. In other cases, it may mean mining successful traces into recipes, as ProgRM does. In well-controlled internal systems, it may mean adding telemetry directly: page reached, field completed, validation passed, document uploaded, approval submitted.
The more structured the workflow, the more valuable this becomes. The more chaotic the interface, the more the LCS assumption weakens. A repeated back-office process is a better candidate than a free-form consumer app with constantly changing content and many acceptable paths.
Do not confuse self-annotation with free supervision
ProgRM’s LCS pipeline is attractive because it avoids expensive human process labels. But it is not free supervision. It is weaker supervision extracted from behaviour. That distinction saves budgets and reputations.
The method assumes successful trajectories for the same task share meaningful common subsequences. This is often true in structured GUI workflows. It may be less true when there are many equivalent paths, when UI content changes frequently, when successful users use different strategies, or when actions contain ambiguous natural language arguments. The appendix tries to soften exact matching through a soft LCS function, including sentence-transformer similarity for text arguments and special handling for NOTHING actions. That helps, but it does not remove the underlying assumption.
The paper itself is refreshingly clear about the gap. LCS-based labels lag environment-reward labels in both actor performance and progress estimation error. ProgRMLCS reaches 59.33% success; ProgRMEnv reaches 62.00%. ProgRMLCS has progress error 0.126; ProgRMEnv has 0.036. The recipe miner is useful. The environment signal is cleaner.
For enterprise use, this points to a hierarchy:
| Label source | Cost | Likely quality | Best fit |
|---|---|---|---|
| Human process labels | High | Potentially high | Small, safety-critical workflows |
| Environment or application milestones | Medium, if telemetry exists | High | Internal systems with controllable instrumentation |
| LCS-mined recipes | Lower | Medium | Repeated workflows with many successful traces |
| Outcome-only labels | Low | Coarse | Early baselines, simple tasks, weak diagnostics |
| Frontier-model evaluator | High latency / variable cost | Mixed | Offline evaluation, not tight online RL loops |
The practical sweet spot may be hybrid. Use application telemetry where available, mine recipes where telemetry is absent, and reserve human review for ambiguous or high-risk milestones. The prize is not doctrinal purity. It is usable reward signal per dollar.
The safety problem is downstream, but not optional
The paper’s broader-impact section is short but pointed: stronger GUI agents can be misused to bypass CAPTCHAs, gain unauthorised access, or perform harmful actions. Also, imperfect GUI agents can take unexpected actions that damage data or systems.
This is not a generic “AI could be risky” sticker. GUI agents act through interfaces designed for humans. They can click buttons, submit forms, delete records, change permissions, trigger payments, and send messages. A progress reward model that makes them better at completing tasks also makes them better at completing tasks you did not want completed. The model does not know which buttons are legally, financially, or socially expensive unless the deployment system constrains it.
So the production lesson is not “add ProgRM and let the agent roam”. It is:
| Deployment layer | Requirement |
|---|---|
| Task scope | Limit actions to approved workflows |
| Authentication | Avoid giving broad credentials to autonomous agents |
| State changes | Require confirmation for irreversible or high-value actions |
| Logging | Preserve step-level traces and progress estimates |
| Evaluation | Test by task category, not only aggregate success |
| Recovery | Add rollback, human escalation, and timeout logic |
| Security | Prevent use against unauthorised public systems |
Progress-aware training can reduce useless repetition. It does not magically produce judgement. That remains management’s charming responsibility.
What this paper directly shows, what we infer, and what remains uncertain
The paper directly shows that progress rewards can improve GUI-agent RL on WikiHow, especially when progress labels come from environment milestones. It also shows that an LCS-based self-annotation pipeline can generate usable progress labels that improve reward-model diagnostics and slightly improve actor success over ORM. It further shows that lightweight reward models are much faster than Claude-based evaluators in the tested setup, making them more plausible for online training.
Cognaptus infers that enterprise GUI-agent builders should treat reward design as workflow instrumentation. The business opportunity is not simply “train the agent more”. It is to stop wasting partial trajectories, identify hollow steps, and turn repeated successful workflows into process supervision. That can improve both training and debugging.
What remains uncertain is generalisation. WikiHow is useful because it provides task structure and intermediate milestone rewards for validation. Real enterprise software is messier in different ways: permissions, modal windows, stale sessions, unpredictable validation rules, UI changes, and business exceptions that do not look like app-navigation tasks. The paper’s own limitations acknowledge the need for more benchmarks and better LCS labeling.
The weakest area is QA. GUI navigation and language answering are not the same skill. If an agent must both operate software and reason over documents, fine-tuning must preserve language competence while improving interface control. Otherwise, the agent may become very good at reaching the page and less good at understanding it. A finely trained clerk who cannot read the memo. We have seen organisations run on worse, but let’s not automate it.
The takeaway: reward design is becoming product design
ProgRM is not a silver bullet. The self-annotated version barely beats ORM on headline success rate. The best version depends on environment milestone labels. The benchmark is one Android app domain. The safety issues are real.
Still, the paper lands an important point: GUI agents need feedback about progress, not just outcomes. In long-horizon digital workflows, a final pass/fail score is too blunt to teach good behaviour efficiently. It erases partial success, hides repetition, and gives operators little insight into why the agent failed.
The deeper lesson for AI builders is that reward models are no longer a research afterthought. They are part of product architecture. If a workflow matters enough to automate, it probably matters enough to instrument. Define milestones. Collect traces. Mine successful paths. Detect hollow actions. Separate navigation from language answering. Keep irreversible actions behind guardrails. Then train.
Outcome-only rewards tell the agent whether it arrived. Progress rewards teach it how to travel. For GUI automation, that difference is not academic. It is the difference between a system that learns from experience and one that repeatedly fails in exactly the same boring way.
Cognaptus: Automate the Present, Incubate the Future.
-
Danyang Zhang, Situo Zhang, Ziyue Yang, Zichen Zhu, Zihan Zhao, Ruisheng Cao, Lu Chen, and Kai Yu, “ProgRM: Build Better GUI Agents with Progress Rewards,” arXiv:2505.18121, 2025, https://arxiv.org/abs/2505.18121. ↩︎