Affordances

TL;DR for operators GUI automation agents do not usually fail because clicking is hard. They fail because almost everything they could click is irrelevant. The CoGA paper proposes a pragmatic way to reduce that waste: use a vision-language model before reinforcement learning begins to generate executable code that identifies which GUI actions are currently affordable, then use that code as an action mask during RL training and inference.1 The VLM is not the agent. It is more like an expensive consultant brought in once to write a rule-based narrowing function. After that, a reinforcement learning agent still learns the policy. ...