TL;DR for operators
GUI automation agents do not usually fail because clicking is hard. They fail because almost everything they could click is irrelevant.
The CoGA paper proposes a pragmatic way to reduce that waste: use a vision-language model before reinforcement learning begins to generate executable code that identifies which GUI actions are currently affordable, then use that code as an action mask during RL training and inference.1 The VLM is not the agent. It is more like an expensive consultant brought in once to write a rule-based narrowing function. After that, a reinforcement learning agent still learns the policy.
That distinction matters. A full VLM-in-the-loop GUI agent can be flexible, but every decision may require costly model calls. CoGA pushes the model’s visual and semantic prior knowledge into generated scripts, so runtime exploration becomes cheaper and narrower. The paper’s experiments on MiniWoB++ show large early sample-efficiency gains over the RL baseline at 1,000 training steps, transfer of affordance scripts within related task families, and stronger average performance than behavioural cloning when only 10, 50, or 200 expert trajectories are available. With 1,000 expert trajectories, behavioural cloning overtakes it.
For business use, the interesting idea is not “VLMs can automate web tasks.” We have heard that sentence often enough; it now qualifies as office wallpaper. The sharper claim is that foundation models may be most useful when they compress messy interface knowledge into cheap, inspectable runtime constraints. That is attractive for enterprise agents where demonstrations are scarce, human annotation is expensive, and the action space is mostly nonsense.
The boundary is equally important. CoGA’s hard mask is only as good as its affordance script. If the script misses the right action, the RL agent cannot recover by being clever. It has been politely forbidden from trying.
The real bottleneck is not intelligence; it is irrelevant choice
A web interface is a hostile playground for reinforcement learning. In MiniWoB++, the environment gives the agent a rendered screenshot and asks it to complete tasks by interacting with a simulated webpage. The paper uses pixel-only observations of size 160×210, discretised into a 32×32 grid, with an action space combining action types and pixel coordinates. After excluding text-entry tasks, the effective action space is $4 \times 1024$.
That sounds modest compared with the open web, but it is already enough to expose the problem. Most actions in most states are useless. A button may be clickable. A random blank pixel is technically clickable too, in the same tedious way that a wall is technically touchable. Sparse reward makes the situation worse: the agent may only receive useful feedback after completing the task, so random exploration spends most of its budget discovering nothing.
Behavioural cloning offers one escape route: collect expert demonstrations and imitate them. The paper positions CoGA against that low-data setting. Prior MiniWoB++ systems have used large expert datasets, including hundreds, millions, or model-generated demonstrations depending on the system. But demonstrations are not free. In enterprise settings, expert trajectories often mean process analysts, operators, compliance review, screen-recording workflows, and a surprising number of people saying “actually, this button depends on the month-end exception rule.”
CoGA attacks the problem from another angle. Instead of teaching the agent what to do directly, it teaches the agent what not to waste time considering.
CoGA turns VLM judgement into executable affordance filters
The central mechanism is a pipeline for generating affordances as code. In reinforcement learning terms, affordances are actions available in a state that can complete an intended consequence. CoGA uses a vision-language model to infer those intents, identify relevant visual objects, create visual templates, and generate a Python function that returns affordable actions for a new observation.
The pipeline has four operational stages:
| Stage | What CoGA does | Operational interpretation |
|---|---|---|
| Intent inference | The VLM receives task descriptions, example instructions, and screenshots, then identifies general actionable intents such as “click a tab” or “click an option.” | It separates task-type structure from a specific instruction. |
| Visual grounding | The VLM identifies relevant objects and bounding boxes, using gridded screenshots to help locate them. Cropped object templates are extracted from sampled observations. | It creates reusable visual handles for interface elements. |
| Code generation | The VLM writes a determine_affordable_actions(observation) function using template matching and task-specific logic. |
It converts semantic judgement into executable filtering logic. |
| Verification and selection | A critique VLM reviews code, execution errors are checked, and scripts are scored against five manually annotated test observations using precision and recall. | It treats generated code as software that needs tests, not as mystical AI residue. |
This is why the common reading of CoGA as “a VLM agent for web tasks” is wrong. The VLM does not look at every state and decide what to click. It is used before training to write an affordance script. During RL training and inference, that script returns a set of action types and pixel regions that remain available. The RL policy learns inside that reduced action space.
The paper uses hard masking: unaffordable actions receive zero probability. This is an aggressive design choice. It gives CoGA its efficiency advantage when the script has high recall, because the agent no longer wastes samples on irrelevant clicks. It also makes failure cleaner and less forgiving. If the script excludes the correct action, the agent cannot sample it. The paper explicitly notes that soft masking could allow eventual recovery, but at the cost of sample efficiency.
This is the main engineering trade-off: CoGA buys speed by trusting the affordance layer. Trust is useful. Blind trust is just a bug wearing a suit.
The VLM is expensive where it can be, cheap where it must be
The clever part of CoGA is not simply that it uses a VLM. Many agent systems do that. The clever part is where the VLM sits in the cost structure.
A direct VLM controller would repeatedly inspect observations and decide actions. That can be flexible, but inference costs accumulate with every step. CoGA instead asks the VLM to generate an intermediate artefact: code. The code is then queried repeatedly by the RL agent without needing a VLM call for every decision.
That gives the method an appealing shape for operators:
- spend more compute and model reasoning during setup;
- produce a reusable script that narrows the action space;
- run cheaper RL training and inference inside that narrowed space;
- retain an artefact that can be inspected, tested, and versioned.
This is closer to software engineering than to agent theatre. The generated scripts in the appendix are recognisably ordinary computer vision code: OpenCV template matching, thresholding, bounding boxes, and lists of affordable actions such as CLICK_COORDS, MOUSEDOWN_COORDS, or MOUSEUP_COORDS. The VLM provides the semantic bridge from task description and screenshot to executable visual logic; the runtime mechanism is much more prosaic.
Prosaic is good. Prosaic can be debugged.
The main evidence is sample efficiency at 1,000 steps
The paper’s primary empirical claim is that action masking with generated affordances makes RL far more sample efficient early in training. The authors compare CoGA with a DQN-style RL baseline using the same underlying architecture: a CNN over pixel observations, SBERT instruction embeddings, double DQN, and prioritised experience replay.
The headline result is Figure 3: across 23 MiniWoB++ tasks and three seeds, CoGA shows over 10× sample-efficiency gains over the RL baseline at only 1,000 steps. The comparison is deliberately early-stage. That matters because CoGA’s practical appeal is not that it changes the theoretical destination of RL; it changes how much blind exploration is needed before anything useful happens.
Several details keep this result in perspective.
First, the authors evaluate tasks where affordance scripts mostly have high F1 scores, plus a few lower-quality scripts such as use-slider to probe weaker cases. This means the main result is strongest when the generated affordance layer is already reasonably accurate. That is not a flaw; it is the mechanism. CoGA is a method for exploiting good action filters, not for magically surviving bad ones.
Second, the 1,000-step comparison is a sample-efficiency result, not a universal final-performance claim. The paper also reports longer training for comparisons with behavioural cloning, but the cleanest contribution is the early reduction in wasted exploration.
Third, the improvement varies by task. Some tasks show dramatic gains; others show little or none. In the Figure 3 task-level comparison, several tasks show large absolute gains, while a few are flat or slightly negative. That pattern is exactly what a mechanism-first reading predicts: if the affordance script identifies the useful action set well, CoGA helps; if it does not, masking is just a confident mistake.
The affordance-quality test is not decorative; it explains the whole method
Before reporting RL results, the paper evaluates generated scripts using precision, recall, and F1 against manually annotated affordance sets on five test observations per task. This test is not merely a nice diagnostic. It is the hinge on which the method turns.
Precision asks: of the actions the script says are affordable, how many match the ground-truth affordances? Recall asks: of the ground-truth affordances, how many did the script include? For CoGA, recall is especially dangerous. Low precision may leave the agent with extra actions, reducing the efficiency gain. Low recall can remove the right action entirely.
The paper’s Figure 2 shows that most generated affordance scripts have high F1, but the appendix also includes unsuccessful examples such as use-slider and use-spinner. The failure cases are useful because they prevent a lazy interpretation of the result. CoGA does not remove the hard part of grounding. It relocates it into generated code and template detection.
That relocation is still valuable. It makes the failure point more inspectable. Instead of asking why a neural policy did something strange after thousands of interactions, an engineer can inspect whether the template matcher found the button, whether the VLM selected the right intent, whether the generated code returned the correct action type, and whether the verification set was too thin. This is not full interpretability. It is a better handle.
Transfer works when tasks share affordances, not when they merely share vibes
The paper also tests whether generated affordance scripts transfer within task families. The authors define a task family as tasks with the same affordances but different optimal policies. For example, click-test-2 and click-button-sequence may share the same GUI and affordable buttons, but the policy differs: click either one button versus click button ONE then TWO.
This is an important distinction. Affordances are not policies. They define the feasible or relevant action set; they do not decide the correct sequence. That is why transfer is plausible. If two tasks share the same interface objects and action possibilities, the same affordance script may help both, even when the optimal policy changes.
The transfer table reports three examples:
| Task | RL success rate | CoGA with original script | CoGA with transferred script |
|---|---|---|---|
click-button-sequence |
3.00 ± 1.00 | 15.67 ± 1.15 | 23.67 ± 1.53 |
focus-text-2 |
80.00 ± 28.79 | 100.00 ± 0.00 | 100.00 ± 0.00 |
click-checkboxes-large |
0.00 ± 0.00 | 0.33 ± 0.58 | 0.67 ± 0.58 |
This is not a sweeping generalisation theorem. It is a targeted test showing that scripts can transfer when the affordance structure is shared. The click-button-sequence result is especially interesting because the transferred script outperforms the original script. That suggests that “task-specific generation” is not always best; a cleaner script from a related task may provide a better action prior.
For business settings, this matters because enterprise interfaces often contain families of workflows. Invoice approval, vendor onboarding, leave requests, and CRM updates contain repeated UI motifs: select a row, open a tab, choose a status, submit a form, confirm a modal. If affordance filters can be reused across related workflows, implementation cost drops. If every workflow needs bespoke affordance generation, the approach becomes less attractive.
The paper supports the former possibility only in a narrow benchmark sense. It does not prove broad enterprise reuse. But it gives the right kind of evidence: transfer happens when shared interface structure is real, not when a product manager waves at two workflows and calls them “basically similar.”
Behavioural cloning wins with enough demonstrations; CoGA matters before then
The behavioural cloning comparison is best read as a low-data argument, not as a contest for permanent supremacy.
The authors collect expert trajectories using rollouts from Pix2Act and filter out trajectories with reward below 0.8. They then train behavioural cloning agents with 10, 50, 200, and 1,000 expert demonstrations, comparing them with RL and CoGA using self-collected data. The reported average success rates in Figure 4 are:
| Method / data regime | Mean success rate |
|---|---|
| RL, 0 expert trajectories | 46.7% |
| CoGA, 0 expert trajectories | 60.1% |
| BC, 10 expert trajectories | 40.1% |
| BC, 50 expert trajectories | 49.9% |
| BC, 200 expert trajectories | 59.9% |
| BC, 1,000 expert trajectories | 73.6% |
The interpretation is straightforward. CoGA outperforms behavioural cloning on average up to 200 expert trajectories, but behavioural cloning overtakes it at 1,000. That is exactly the niche CoGA is designed for: when expert data is scarce, expensive, unavailable, or too brittle to collect at scale.
It also tells us what not to claim. CoGA does not make demonstrations obsolete. It competes well before behavioural cloning has enough high-quality data. Once enough demonstrations are available, imitation becomes strong. The paper even suggests that combining limited expert data with CoGA could produce further gains, though that combination is not established as the central experimental result here.
For enterprise operators, this implies a practical segmentation:
| Situation | Likely fit for CoGA-style approach |
|---|---|
| Few demonstrations, sparse rewards, regular UI objects | Strong fit |
| Many high-quality demonstrations already available | Behavioural cloning or BC+RL may be stronger |
| Highly variable text-heavy interfaces | Weak unless OCR and grounding improve |
| Long workflows requiring memory and partial observability | Needs a stronger RL backbone and workflow state handling |
| Safety-critical actions where missing or including actions has high cost | Requires much stronger verification and guardrails |
The point is not to pick one learning paradigm forever. The point is to stop using the wrong one just because it demos well on a slide.
The evidence stack: what each experiment actually supports
The paper’s experiments are easy to over-read, so it helps to label their roles.
| Evidence | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Script F1 scores against five manually annotated observations | Main mechanism diagnostic | Generated code often identifies plausible affordance sets with high precision and recall | That scripts will remain accurate on arbitrary websites or OCR-heavy layouts |
| CoGA vs RL at 1,000 steps across 23 tasks | Main evidence | Hard affordance masking can sharply improve early RL sample efficiency | That CoGA always improves final performance or works when recall is low |
Success curves on count-sides and click-test-2 |
Illustrative learning dynamics | CoGA can accelerate learning over training, not just improve one static score | That all tasks have similar learning curves |
| Transfer table across related task families | Exploratory extension / generalisation test | Scripts can transfer when tasks share affordances but differ in optimal policies | Broad transfer across unrelated enterprise workflows |
| CoGA vs behavioural cloning at 10–1,000 demonstrations | Comparison with prior learning style | CoGA is competitive in low-demonstration regimes and loses when BC has enough data | That demonstrations are unnecessary |
| Appendix scripts and run counts | Implementation detail and failure analysis | The generated artefacts are concrete code and require varying VLM runs/iterations | That the generation process is fully reliable or cheap for all task types |
This framing is more useful than summarising “CoGA beats RL and BC.” It does not, in every sense. It beats a specific RL baseline on early sample efficiency. It beats or matches behavioural cloning when demonstrations are scarce. It transfers within selected task families. Those are strong claims precisely because they are bounded.
What this means for enterprise GUI agents
The business relevance is not that CoGA is ready to run your ERP system tomorrow. Please do not let a benchmark with 160×210 screenshots near procurement approvals without adult supervision.
The relevance is architectural.
Enterprise GUI automation often has three painful cost centres: demonstrations, exploration, and runtime reasoning. Demonstrations require humans to show the system what to do. Exploration is expensive because failed interactions can be slow, noisy, or risky. Runtime reasoning becomes costly when a large model is consulted at every step.
CoGA suggests a different allocation. Use a capable VLM to create reusable affordance code offline. Use verification to catch obvious failures. Then use cheaper learning and execution inside the constrained action space. In business terms, the model is not the worker; it is the workflow analyst that drafts the action constraints.
This has several potential advantages:
- Lower demonstration dependence. CoGA’s comparison with behavioural cloning suggests value when only small amounts of expert data are available or when expert data cannot be gathered cheaply.
- Cheaper exploration. Action masking reduces the number of irrelevant actions the RL agent samples, especially in sparse-reward settings.
- Inspectable intermediate artefacts. Generated scripts can be reviewed, tested, versioned, and monitored more easily than opaque policy behaviour.
- Workflow-family reuse. Transfer within task families hints at amortising setup costs across related interfaces.
- Hybrid deployment potential. Affordance scripts could sit alongside existing RPA, UI testing, and agent orchestration systems as a narrowing layer rather than a full replacement.
The practical implementation path would not begin with open-ended web autonomy. It would begin with constrained, visually regular workflows: internal portals, fixed dashboards, repetitive review screens, structured forms, or controlled browser environments. In those cases, the affordance layer can be tested against known states before being trusted.
The wrong implementation path would be to point the method at arbitrary websites and hope the VLM-generated code keeps up with every layout change, pop-up, localisation difference, and modal produced by the modern web’s tireless commitment to making automation miserable.
The hard boundary: recall, grounding, and verification
CoGA’s limitations are not footnotes; they are the product requirements.
The first boundary is pixel grounding. The VLM must identify object locations well enough to create useful templates. The authors mitigate this by overlaying a coordinate grid, but they still report that VLMs struggle with exact pixel coordinates. Bad boxes create bad templates; bad templates create bad affordance sets.
The second boundary is template matching. The paper uses OpenCV template matching over greyscale image templates extracted from five sampled observations. This works best when objects are visually consistent across states. It is weaker for variable text, non-isomorphic objects, dynamic layouts, and tasks where the affordance is geometric or relational rather than object-like. The authors exclude most tasks with varying text-based observations after finding OCR inconsistent.
The third boundary is generated-code correctness. A script can have the right templates and still reason incorrectly about which action types or pixel coordinates are affordable. The critique VLM helps, but the paper notes that the critique model can itself be wrong. Token limits also prevented further regeneration in some tasks.
The fourth boundary is RL backbone strength. Given perfect affordances, CoGA is still limited by the underlying RL agent. The authors point specifically to sequential, partially observable tasks such as checkbox tasks where stronger agents may be needed. Action masking narrows choice; it does not grant memory, planning, or robust state estimation by divine intervention.
The fifth boundary is verification scale. The paper uses five manually annotated test observations per task. That is reasonable for a research prototype and typical of testing software components, but enterprise deployment would need broader coverage: edge cases, role permissions, UI variants, localisation, latency states, error screens, and rollback paths.
Hard masks should be earned. They are not a default entitlement.
The strategic lesson is compression, not autonomy
CoGA belongs to a growing class of methods where foundation models generate code, rewards, policies, or environment abstractions for reinforcement learning. The immediate temptation is to file it under “more autonomous agents.” That is too vague to be useful.
The better category is model-assisted compression of decision spaces.
The VLM looks at the messy visual world and compresses it into a smaller set of executable affordances. The RL agent then learns within that reduced world. This makes the system less general than a full VLM controller, but often more operationally sane. Generality is expensive. In production, the cheapest agent is usually the one that does not need to think about 98% of its possible actions.
That is the larger business insight. Foundation models may not need to sit in every runtime loop. Sometimes their highest-value role is upstream: generating constraints, tests, templates, validators, and small pieces of code that make conventional systems behave less stupidly.
For GUI automation, that is a compelling pattern. Use the model where semantic interpretation is needed. Use code where repeatability is needed. Use RL where interaction and adaptation are needed. And use verification everywhere, because generated confidence is still confidence, not correctness.
Conclusion: fewer paths, better learning
CoGA is not a finished recipe for enterprise web agents. It is a clear mechanism for one of the most stubborn problems in RL-based GUI control: the wastefulness of unconstrained action spaces.
The paper’s contribution is strongest when read mechanistically. A VLM infers task-level intents, extracts visual templates, generates affordance code, and passes that code through a verification loop. The resulting script masks irrelevant action types and pixel coordinates while an RL agent learns the actual policy. On MiniWoB++, this produces large early sample-efficiency gains, limited transfer across related task families, and competitive performance against behavioural cloning when demonstrations are scarce.
For operators, the message is practical. Do not ask an agent to learn from infinite paths when the interface already tells you which paths are plausible. But do not confuse plausible with correct. The affordance layer must be tested, monitored, and updated like any other operational dependency.
The future of GUI agents may not be one giant model staring at a screen and deciding everything. It may be a stack of smaller, sharper components: a model to interpret, code to constrain, RL to adapt, and tests to keep everyone honest.
That is less glamorous than full autonomy. It is also much closer to how useful systems get built.
Cognaptus: Automate the Present, Incubate the Future.
-
Lynn Cherif, Flemming Kondrup, David Venuto, Ankit Anand, Doina Precup, and Khimya Khetarpal, “Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning,” arXiv:2504.17282, submitted 24 April 2025. ↩︎