Clicks Are Cheap. Wrong Clicks Are Not.

Click.

That is the unit where many AI agent demos stop being impressive and start becoming expensive.

A planning model can write a beautiful instruction sequence: open the settings panel, choose the correct tab, find the export button, confirm the dialog. Lovely. Then the visual grounding model clicks the button two pixels away from the actual target, or chooses the visually similar icon beside it, or mistakes a disabled control for an active one. Suddenly the “agentic workflow” is not a workflow. It is a small robot poking the wrong part of a screen with great confidence. Very modern. Very avoidable, perhaps.

The paper behind this article, Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback, makes a useful move in this exact failure zone.1 It does not claim that GUI grounding will be solved by yet another heroic model size increase. It does not rely on new fine-tuning data as the main story. Instead, it asks a quieter question: what if multimodal models already contain more grounding ability than one-shot coordinate prediction allows them to express?

The proposed answer is Chain-of-Ground, or CoG: a training-free framework that turns GUI grounding into a sequence of visual hypotheses. The model first anchors a guess. That guess is then fed back as a reference. The model reopens the full screenshot, compares the instruction against the marked or described previous prediction, and refines the location.

Anchor. Mark. Revisit. Refine.

That is the mechanism. The paper’s most useful business lesson sits there, not in the leaderboard number alone.

GUI Grounding Fails Because Interfaces Are Crowded, Not Because Models Forgot How to See

GUI grounding means mapping a natural-language instruction to a precise visual target in an interface. “Click the settings gear.” “Open the export menu.” “Press the power button.” In a browser, spreadsheet, CAD tool, analytics dashboard, or industrial panel, the grounding output is often a coordinate or target region.

That sounds small. It is not.

A professional interface can contain dozens or hundreds of controls. Some are small. Some are text-heavy. Some are repeated. Some have nearly identical icons. Some depend on spatial relationships: the right button in the correct panel, the active tab rather than the inactive one, the button beside the relevant label rather than the same-looking button elsewhere.

The paper frames this as a problem with single-step prediction. In the standard setup, a model sees the screenshot and instruction once, then produces a coordinate. If it is slightly wrong, there is no internal checkpoint. No second look. No chance to ask, “Did I just point to the neighboring icon?”

Earlier iterative methods tried to solve this by narrowing the image. First guess a region, crop around it, then predict again. That helps focus attention, but it can also throw away global context. In dense interfaces, global context is not decoration. It tells the model whether the selected icon belongs to the export toolbar, the formatting panel, the file tree, or the wrong universe entirely.

CoG takes a different route. It keeps the full image available and iterates in the reasoning process rather than simply cropping the image. This is the paper’s key mechanism-first contribution.

CoG Treats Coordinates as Hypotheses, Not Answers

The easiest way to understand Chain-of-Ground is to contrast it with a normal grounding call.

A one-shot grounding model behaves like this:

Step What happens Failure mode
Input Screenshot + instruction The instruction may be underspecified or visually ambiguous
Prediction Model outputs a coordinate The coordinate can land on a similar nearby element
Result Downstream agent clicks The error becomes an action

CoG changes the meaning of the first coordinate. It is no longer treated as the final answer. It becomes an anchor: a provisional hypothesis that the next step can inspect.

The loop has three conceptual phases.

Phase What CoG does Why it matters
Anchoring The model produces an initial target location Gives the system a concrete visual hypothesis
Reference construction The previous prediction is encoded as feedback, either textually or visually Makes the hypothesis visible to the next reasoning step
Refinement The model revisits the original instruction and full screen while considering the reference Allows correction without losing global context

This is not “thinking” in the romantic sense. The authors are careful enough to note that the framework emulates aspects of human thought but does not prove the neural network is genuinely reasoning. Good. We can all survive one paper without metaphysics.

Operationally, the important point is simpler: CoG gives the model a second and third chance to inspect its own visual commitment.

The reference feedback can be textual or image-based. In the text-based version, the prior coordinate is appended as a textual reference. In the image-based version, the previous prediction is marked directly on the screenshot. The paper studies marker designs, including small and large marks. In its triple-step design, it uses a large red circle for the first visual layer and a large blue square for the second.

The visual version matters because coordinates are not always meaningful to a vision-language model in the same way a marked image is. A red mark says, “Look here, but also look around here and judge whether this is actually the intended control.” That is a more natural signal for a visual model than a coordinate string floating in text.

The Result Is a Test-Time Control Loop, Not a New Model

One reason the paper is business-relevant is that CoG is training-free. It wraps existing multimodal models in a multi-step grounding process. The paper tests different model choices across stages, including Qwen3-VL models and UI-TARS-1.5-7B.

This distinction matters.

Fine-tuning is a model development project. It needs data, training infrastructure, evaluation, maintenance, and governance. CoG is closer to an inference-time reliability pattern: ask once, expose the guess, ask again with feedback, then refine. That does not make it free. Multiple inference calls raise latency and cost. But it changes the adoption pathway.

For GUI automation teams, the relevant question becomes less “Can we train a better grounding model from scratch?” and more “Can we reduce wrong-click risk by adding a hypothesis-refinement layer around models we already use?”

That is a much more practical question. Less glamorous, unfortunately. Also more likely to survive contact with procurement.

The Main Evidence: ScreenSpot-Pro Shows Iteration Helps on Professional GUIs

The paper’s main benchmark evidence comes from ScreenSpot-Pro, a professional GUI grounding benchmark covering categories such as Development, Creative, CAD, Scientific, Office, and Operating Systems.

The headline result is straightforward:

Method Average grounding accuracy on ScreenSpot-Pro
Qwen3-VL-235B-Instruct single-step baseline 63.9%
Previous strong baseline, GTA-32B 63.6%
Dual-Step CoG 66.7%
Triple-Step CoG 68.4%

The triple-step CoG configuration reaches 68.4% average accuracy, 4.8 percentage points above the GTA-32B baseline reported in the table and 4.5 percentage points above the single-step Qwen3-VL-235B baseline. The dual-step version reaches 66.7%, already improving materially over the same single-step baseline.

The size of the improvement should be interpreted correctly. This is not a jump from unusable to solved. A 68.4% grounding success rate still means a lot of failures in any production setting where a wrong click has consequences. But the direction is important: the improvement comes from changing the inference procedure, not from simply claiming a bigger model will behave itself if we clap loudly enough.

The category results also show that CoG is not uniformly dominant in every subdomain. Triple-Step CoG is strongest in several categories, including Development, Creative, and OS. Dual-Step CoG performs best in Scientific and Office in the table. CAD is more mixed, with the single-step Qwen3-VL-235B result higher than the triple-step CoG result in that category. This matters because it prevents the lazy conclusion that “more steps always win everywhere.”

The better reading is narrower and stronger: iterative reference feedback improves average grounding performance across a challenging professional-GUI benchmark, but its advantage depends on model combination, domain, and visual structure.

The TPanel-UI Test Moves the Argument Toward Physical Interfaces

The paper’s second contribution is TPanel-UI, a dataset of 420 annotated industrial control-panel instances. The dataset spans 20 commercial brands and includes 320 touch-interface instances and 100 physical-button instances. The authors also introduce degraded variants with disturbances such as blur, masking, exposure shifts, noise, and compression.

This dataset matters because most GUI grounding benchmarks live in digital interface territory. That is useful, but business interfaces are not always clean screenshots. Factories, appliances, medical devices, logistics equipment, control dashboards, and field devices involve glare, worn surfaces, tiny labels, camera distortion, and physical buttons.

The paper uses TPanel-UI as a domain-generalization and robustness-oriented benchmark. It is not a full deployment trial. It does not prove that an agent should be allowed to press machinery controls in the real world. Please do not read one benchmark table and hand a robot the boiler panel. But it does move evaluation closer to the kind of visual ambiguity that industrial automation actually faces.

The reported TPanel-UI results are strong:

Method Touch interaction Physical button Average
Qwen3-VL-32B single-step 82.2% 74.0% 81.2%
Qwen3-VL-235B single-step 84.7% 78.0% 83.1%
UI-TARS-1.5-7B single-step 50.0% 40.0% 47.6%
CoG: Qwen3-VL-32B → Qwen3-VL-32B 90.9% 78.0% 87.9%
CoG: Qwen3-VL-235B → Qwen3-VL-32B 90.9% 87.0% 90.0%
CoG: UI-TARS-1.5-7B → UI-TARS-1.5-7B 64.4% 54.0% 61.9%

The best dual-step CoG configuration, Qwen3-VL-235B followed by Qwen3-VL-32B, reaches 90.0% average accuracy. That is 6.9 percentage points above the strongest single model baseline, Qwen3-VL-235B at 83.1%.

There is an interesting detail here: the best TPanel-UI configuration is heterogeneous. It uses a stronger model first and a smaller model second. On ScreenSpot-Pro, the best triple-step result uses a sequence of UI-TARS-1.5-7B, Qwen3-VL-235B, and Qwen3-VL-32B. The paper suggests that different models may have different visual blind spots, and that multi-model chains can compensate for them.

That is plausible. It is also not yet a universal rule. The evidence supports the value of testing model combinations; it does not prove that model diversity always improves grounding. In production, the right model sequence would need to be benchmarked against the actual interface family.

The Ablations Explain Why the Mechanism Works

The paper includes several ablations. These are not separate theses. They are diagnostic tests around the mechanism. Their job is to answer: which part of CoG is doing useful work?

Test Likely purpose What it supports What it does not prove
Number of iterations Main ablation More refinement steps improve average ScreenSpot-Pro accuracy from 63.9% single-step to 66.7% dual-step and 68.4% triple-step That infinite or many-step refinement will keep improving
Feedback modality Mechanism ablation Feedback matters; image-based feedback performs better than text-based feedback in the tested setup That visual feedback is always superior for every model or interface
Marker size Design sensitivity test Salient markers help; large markers slightly outperform small markers in the tested configurations That bigger markers are always safer, especially near tiny controls
Model combination Exploratory extension and ablation Heterogeneous model chains can outperform repeated use of one model That any diverse model chain will outperform a single strong model
TPanel-UI evaluation Domain and robustness-oriented test CoG transfers beyond clean digital GUI screenshots into industrial-style panels That the method is ready for safety-critical autonomous control

The number-of-iterations ablation is the cleanest support for the central claim. On ScreenSpot-Pro, the single-step Qwen3-VL-235B baseline scores 63.9%. Dual-Step CoG scores 66.7%. Triple-Step CoG scores 68.4%. The direction is consistent with the paper’s mechanism: a first prediction can be improved by structured revisiting.

The feedback-modality ablation gives the mechanism more texture. With Qwen3-VL-32B in both steps, removing feedback gives the baseline score of 61.4%. Text-based feedback reaches 64.3%. Image-based feedback reaches 65.8%. This suggests that the feedback is not decorative. It gives the model a usable reference. More specifically, visual reference feedback appears more effective than text-only reference in this tested setup.

The marker-size test is less dramatic, and that is useful. Large markers outperform small markers modestly. In one dual-step Qwen3-VL-32B setup, small marks score 65.0% and large marks 65.8%. In the Qwen3-VL-235B → Qwen3-VL-32B setup, small marks score 65.5% and large marks 66.7%. The operational reading is not “always make the marker huge.” It is “make the feedback visible enough for the model to use, while remembering that excessive marking can occlude nearby elements.”

The model-combination test is the most tempting to overinterpret. The best triple-step ScreenSpot-Pro sequence is UI-TARS-1.5-7B → Qwen3-VL-235B → Qwen3-VL-32B, reaching 68.4%. Repeating Qwen3-VL-235B three times reaches 65.6%. Repeating Qwen3-VL-32B three times reaches 67.5%. The paper interprets this as evidence that different models can compensate for each other’s blind spots.

That is a reasonable hypothesis. It is not yet a law of agent architecture. For a company building GUI automation, the right takeaway is empirical: do not assume the strongest single model is automatically the best model for every step in a multi-stage grounding chain. Test the chain.

Why This Is Not Just “Prompt Engineering With Circles”

A skeptical reader might say: so the method adds a mark to the image and asks again. Is that really a research contribution?

Yes, if the goal is reliable grounding rather than theatrical novelty.

CoG’s value is not that a red circle is intellectually profound. The value is that it changes the interface between the model’s first guess and its final action. A one-shot model hides its uncertainty inside a coordinate. CoG externalizes the intermediate guess. Once externalized, the guess becomes something the model can compare, revise, or reject.

This is closer to a control loop than a clever prompt. The system creates a provisional state, feeds it back, and updates the state under the same original task. That is exactly the kind of pattern business automation needs more often: not “ask the model once and pray,” but “force the model to expose an intermediate commitment before acting.”

For GUI agents, that difference is practical. Many failures are near misses: wrong nearby button, wrong repeated icon, wrong panel, wrong instance of the same label. A visual reference can make near misses correctable because the next step can reason relationally: the target should be above the label, inside the right card, beside the correct field, not merely somewhere vaguely close.

The paper’s qualitative example around model combinations supports this interpretation. When the same model repeats the same mistake, the chain can remain trapped. When different models participate, one model’s initial hypothesis can be revised by another model’s different visual bias. Again, not magic. Just useful friction.

The Business Value Is Reliability per Task, Not Academic Accuracy per Se

The direct paper result is about benchmark grounding accuracy. The business inference is about action reliability in visual workflows.

Those are related, but not identical.

Here is the clean separation.

Layer What the paper directly shows Cognaptus business interpretation Remaining uncertainty
GUI grounding CoG improves average grounding accuracy on ScreenSpot-Pro and TPanel-UI GUI automation systems can benefit from treating click targets as revisable hypotheses End-to-end workflow success was not the main evaluation
Model deployment CoG is training-free and works as a multi-step inference framework Teams may improve reliability without immediately launching a fine-tuning project More inference calls increase latency and cost
Industrial panels TPanel-UI tests real control-panel images and degraded variants Visual agents for field operations, QA, maintenance, and HMI support need robustness tests closer to physical environments Benchmark success is not safety certification
Model orchestration Heterogeneous model chains can outperform repeated use of one model in tested settings Model choice may be step-specific: anchor model and refinement model need not be identical Best combinations likely vary by interface type
Interpretability CoG produces intermediate predictions and visual references Debugging becomes easier than with a single hidden coordinate output A visible trace is not guaranteed to be faithful reasoning

The most immediate applications are not necessarily fully autonomous agents clicking around uncontrolled systems. The nearer-term value is in supervised or semi-supervised workflows:

  • RPA systems that need more reliable visual fallback when DOM or accessibility trees are unavailable.
  • QA agents that inspect whether buttons, labels, and controls are correctly placed.
  • Accessibility assistants that identify interface elements for users who cannot easily inspect the screen.
  • Industrial HMI support tools that help operators locate controls under noisy visual conditions.
  • Agent debugging dashboards where intermediate grounding hypotheses are reviewed before execution.

In these settings, CoG-style grounding can serve as a reliability layer. It can also serve as a diagnostic layer. If an agent fails, the intermediate marks show where the visual interpretation drifted. That does not solve the failure automatically, but it gives engineers something better than “the model clicked wrong because vibes.”

The Cost Is Real: Two or Three Looks Are Slower Than One

The paper’s limitations are not cosmetic. They directly affect deployment.

First, CoG increases inference cost. Dual-step grounding means at least two model calls. Triple-step grounding means more. If the grounding model is large, expensive, or hosted with latency constraints, this matters. A customer-service browser agent might tolerate a slightly slower click. A real-time industrial control assistant might not.

Second, CoG remains bounded by the base models. A feedback loop can help a model correct near misses. It cannot reliably manufacture visual understanding that the model does not possess. If the target is unreadable, outside the captured image, or semantically misunderstood, iteration may simply produce a more elaborate mistake.

Third, the intermediate trace should not be mistaken for faithful reasoning. CoG produces visible hypotheses and refinements, which are useful for debugging. But a correct final click can still arise from flawed intermediate steps, and an apparently reasonable trace can still end wrong. The paper explicitly notes this.

Fourth, benchmark generalization remains open. ScreenSpot-Pro and TPanel-UI are valuable tests, but they do not cover every enterprise GUI, every resolution, every localization issue, every custom internal dashboard, or every safety-critical panel. The model-combination findings especially should be treated as empirical design guidance, not a permanent architecture theorem.

Finally, marker design can create its own risk. A marker that is too small may be ignored. A marker that is too large can cover nearby controls. The paper tests marker size, but production systems would need interface-aware marking policies, especially for dense UIs.

Where CoG Fits in an Automation Architecture

A useful production version of CoG would probably not run blindly on every click. That would be wasteful.

It fits best as a selective reliability layer. The system can use one-shot grounding for easy targets and escalate to CoG when ambiguity is high: repeated icons, dense toolbars, low confidence, small targets, OCR uncertainty, visual degradation, or high consequence of error.

A practical architecture might look like this:

Situation Suggested grounding mode Rationale
Large obvious button with clear text Single-step grounding Low ambiguity; save latency
Dense toolbar or repeated icon Dual-step CoG Prior guess can be visually checked against nearby alternatives
High-consequence action Triple-step or multi-model CoG with confirmation Wrong click cost justifies extra inference
Physical panel photo with glare or blur CoG plus domain-specific validation Visual feedback may help, but robustness must be tested locally
Repeated failure on the same interface family Collect failure traces for interface-specific evaluation or fine-tuning CoG exposes useful debugging artifacts

This is where the paper becomes more than a benchmark result. It suggests a design pattern for GUI agents: do not treat grounding as a stateless utility function. Treat it as a small deliberative process with memory, feedback, and escalation.

That does not make the agent conscious. It makes the pipeline less foolish. In enterprise automation, that is already a win.

The Quiet Shift: From Bigger Eyes to Better Checking

The common misconception is that better GUI agents mainly need larger multimodal models or more fine-tuning data. Sometimes they do. But this paper shows another lever: structure the inference process so the model can reuse and revise its own intermediate visual hypotheses.

That is a different kind of improvement. It is less about giving the model bigger eyes and more about forcing it to check where it is looking.

The ScreenSpot-Pro results show that this can improve average professional-GUI grounding accuracy. The TPanel-UI results suggest the approach can help in more physical, visually noisy interface settings. The ablations show that the mechanism is not arbitrary: iteration, feedback modality, marker visibility, and model combination all affect performance.

The right business conclusion is disciplined optimism. CoG is not a license to let agents click anything anywhere. It is a useful pattern for reducing one-shot grounding errors, especially in dense visual interfaces where wrong actions are costly and where full retraining is impractical.

The real lesson is almost embarrassingly simple: before an AI agent acts on a screen, make it point, look again, and correct itself.

Grounding is no longer just finding the coordinate. It is managing the path to the coordinate.

Cognaptus: Automate the Present, Incubate the Future.


  1. Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, and Shilong Liu, “Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback,” arXiv:2512.01979, 2025, https://arxiv.org/html/2512.01979↩︎