The GUI Agent Bottleneck: Stuck in Sparse Feedback

Training LLM-based GUI agents to complete digital tasks—such as navigating mobile apps or automating workflows—faces a fundamental limitation: reward sparsity. Traditional reward formulations (Outcome Reward Models, or ORMs) provide feedback only at the end of a trajectory. If the task fails, the agent receives zero signal, regardless of how many useful intermediate steps it took. This severely limits credit assignment and slows learning, especially in environments with long action horizons.

PROGRM: Progress Reward Model for GUI Tasks

PROGRM introduces a refined solution: dense, intermediate feedback based on task completion progress. The model predicts a scalar value $p_t \in [0, 1]$ at each timestep $t$, representing the estimated progress toward final task success.

This allows reinforcement learning agents to optimize using a reward signal shaped by directional progress rather than just binary success. In practice, the PROGRM architecture uses a multi-layer perceptron (MLP) that consumes the current GUI screen embedding and a history of past actions to estimate $p_t$. This estimation is then used as a reward in RL training (e.g., Proximal Policy Optimization or Q-Learning).

Self-Supervised Progress Labeling via Behavioral Recipes

To train PROGRM without human-annotated labels, the authors introduce a recipe mining pipeline that leverages the Longest Common Subsequence (LCS) algorithm to identify consistent behavioral patterns among successful trajectories. Here’s how it works:

  1. Cluster Successful Trajectories by action embeddings using K-means.
  2. Mine Recipes (core steps) within each cluster using LCS across action sequences.
  3. Align New Trajectories to recipes and compute relative progress as the ratio of matched core steps.

This results in a fully automated labeling mechanism that enables scalable progress supervision with high fidelity to user intent.

Implementation Details

  • Environment: WikiHow GUI benchmark, with over 500 real Android task trajectories.
  • Base Agent: Trained on Qwen2.5-7B with decision head for action selection.
  • Progress Reward Model: 3-layer MLP, trained via regression to the self-labeled progress scores.
  • Training Loop: Uses PPO with progress reward replacing ORM.

Agents trained with PROGRM show:

  • +6.5% improvement in success rate over ORM.
  • Faster convergence during fine-tuning.
  • Reduced action repetition and better alignment with human-like task flows.

Why It Works: Credit Assignment and Trajectory Disambiguation

Sparse reward models offer no gradient until a task is complete. PROGRM’s stepwise signals improve credit assignment, helping agents learn that selecting an app, filling a field, or navigating to a submenu are meaningful even in failed runs.

Moreover, in partially successful but failed trajectories, PROGRM distinguishes early effective actions from late-stage errors. This enables trajectory-level differentiation in a way ORMs cannot.

Practical Implications for AI Builders

For any enterprise or developer building GUI agents (e.g., in mobile automation, RPA, or agentic LLMs for apps), PROGRM presents:

  • A scalable reward engineering method that requires no human labels.
  • Compatibility with existing transformer agents through lightweight reward model integration.
  • A generalizable framework for injecting semantic structure into reward signals.

To integrate PROGRM into your workflow:

  1. Collect successful and failed trajectories.
  2. Run recipe mining via LCS to generate progress annotations.
  3. Train a progress predictor MLP using these annotations.
  4. Replace your sparse reward function with PROGRM output in RL fine-tuning.

Toward More Cognizant Agents

As LLM-based agents grow more autonomous, they must reason not just over goals, but over how well they are progressing toward those goals. PROGRM adds this missing layer of introspection. It’s not just about whether an agent gets to the destination—it’s about knowing it’s on the right path.


Cognaptus: Automate the Present, Incubate the Future