Click with Confidence: Teaching GUI Agents When *Not* to Click

A click looks harmless until it is not.

In consumer software, a wrong click means opening the wrong tab, dismissing the wrong pop-up, or buying the wrong color of phone case. Annoying, perhaps. Civilization survives. In enterprise workflows, a wrong click can approve a payment, change a configuration, delete a record, or submit a compliance form with the confidence of a sleepwalker holding admin rights.

That is the uncomfortable center of the paper “SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration.”¹ The paper is not mainly about making GUI agents more accurate in the usual leaderboard sense. It is about a more operational question: when should a GUI grounding model be allowed to act, and when should it be forced to pause, abstain, or escalate?

This distinction matters. Most discussions of GUI agents still treat grounding accuracy as the central metric. The model receives a screenshot and an instruction, then predicts a coordinate. If that coordinate lands inside the target region, the prediction is correct. If not, it fails. Simple enough.

But deployment is not a benchmark table. In production, every accepted prediction becomes an action. A high average accuracy can still be unacceptable if the remaining errors cluster around irreversible or expensive operations. “The model is usually right” is not a governance policy. It is a sentence people say shortly before writing incident reports.

SafeGround’s contribution is to add a risk-aware decision layer around existing GUI grounding models. It does not ask the model to introspect politely about its confidence. It does not require access to logits. It does not assume the model exposes neat internal probabilities. Instead, it repeatedly samples where the model would click, converts the spatial pattern of those samples into an uncertainty score, and then calibrates a threshold that controls the false discovery rate among accepted actions.

In plain English: click only when the model’s spatial behavior looks reliable enough under a risk budget chosen before deployment.

That is a better mental model than “bigger model equals safer agent.” Sometimes the safer system is not the one that always acts. Sometimes it is the one that knows when to become someone else’s problem.

SafeGround adds a risk gate after the model, not a new model inside it

The paper’s first useful move is architectural. SafeGround is designed as a plug-in wrapper for GUI grounding models. The base model can be Holo1.5, GUI-Actor, UI-TARS, GTA1, or another coordinate-predicting GUI model. SafeGround does not replace the base model. It watches the model’s output behavior and decides whether the predicted action should be admitted.

The workflow is roughly:

Screenshot + instruction
        ↓
Primary GUI grounding model
        ↓
Multiple stochastic coordinate samples
        ↓
Spatial density map and candidate regions
        ↓
Uncertainty score
        ↓
Calibrated threshold
        ↓
Execute, abstain, or escalate

The important point is that SafeGround treats uncertainty as an operational signal, not as a decorative number printed beside the answer.

Many uncertainty methods do not fit GUI grounding well. Token probabilities require access to model internals and may not map cleanly to whether a coordinate is safe. Verbal confidence is fragile because models can say “I am confident” with the calm authority of a hotel receptionist giving directions to a building that no longer exists. Some GUI uncertainty methods depend on ground-truth annotations, which are obviously unavailable at test time.

SafeGround avoids these traps by using sampled outputs. The model is asked, multiple times, to ground the same instruction on the same interface. If the predicted coordinates cluster tightly around the same target, that is evidence of lower uncertainty. If the predictions scatter across competing screen regions, that is evidence of risk.

This is not a philosophical claim about consciousness, self-awareness, or whether the model “knows” anything. Good. We have enough of that already. It is a behavioral test: when nudged through stochastic sampling, does the model keep pointing to the same place?

Uncertainty comes from where repeated clicks land

SafeGround builds uncertainty from spatial dispersion.

For each input, the method samples the GUI grounding model multiple times. In the experiments, the authors use 10 samples with stochastic decoding. Those sampled coordinates are projected onto a discretized screen grid. The grid becomes a density map: patches with many sampled coordinates receive higher density; patches with few or no samples receive lower density.

From this density map, SafeGround extracts connected high-density regions. Each region represents a plausible target area. The method then scores and ranks these regions.

This matters because GUI grounding is spatial. The model is not choosing from a tidy list of labels. It is trying to place a point on a screen where multiple buttons, icons, panels, and text fields may be visually close. Two wrong clicks may be semantically very different even if their token probabilities look similar. A coordinate near the right button is not the same kind of error as a coordinate on a destructive button three panels away.

SafeGround captures three related but distinct uncertainty signals:

Signal	What it measures	Operational interpretation
Top-candidate ambiguity	Whether the leading region clearly beats the runner-up	The model is torn between nearby plausible targets
Informational dispersion	How spread out the region-score distribution is	The model’s belief is scattered across the interface
Concentration deficit	Whether probability mass has a dominant spatial focus	No region clearly deserves the click

The combined uncertainty score aggregates these three components with fixed weights. That fixed-weight choice is not just a technical convenience. It is part of the deployment story. A method that needs careful model-by-model tuning is less attractive as a general risk layer. SafeGround is trying to be boring in the right way: attach it, calibrate it, and use the resulting threshold as an execution policy.

There is a subtle but important distinction here. The uncertainty score itself does not guarantee safety. It only ranks predictions from more reliable-looking to less reliable-looking. The guarantee comes later, when the score is calibrated against observed errors.

That is where the paper becomes more interesting than another “we measured confidence” exercise.

Calibration turns doubt into an operating rule

A raw uncertainty score is not enough. A business system needs a decision rule.

SafeGround uses a Learn-Then-Test calibration procedure. On a held-out calibration set, it evaluates candidate uncertainty thresholds. For each threshold, predictions below the threshold are “accepted,” and predictions above it are rejected or escalated. The method then estimates the false discovery rate, meaning the proportion of accepted predictions that are actually wrong.

The basic risk quantity is:

$$ FDR(\tau) = \frac{\text{number of incorrect accepted predictions under threshold } \tau} {\text{number of accepted predictions under threshold } \tau} $$

The user supplies a risk level. SafeGround then chooses the largest uncertainty threshold that satisfies the risk constraint, using a Clopper–Pearson upper confidence bound to provide finite-sample control.

This is the paper’s real deployment contribution. It does not merely say, “high uncertainty is bad.” Everyone suspected as much. It asks: at what threshold can we admit predictions while keeping the error rate among admitted actions below a specified tolerance?

That changes the business conversation.

Without calibration, an automation team might say:

“The GUI agent has 52% grounding accuracy on a difficult benchmark.”

With calibration, the conversation becomes:

“For accepted actions, we can target a maximum error proportion, and the system will abstain or escalate when that target is not achievable.”

The second sentence is much closer to something a risk owner, compliance team, or operations manager can reason about. It is still not magic. The guarantee depends on calibration data and assumptions such as exchangeability between calibration and test settings. But it is a contract-shaped control, not a vibe-shaped confidence score.

The method also has a useful failure mode. Some strict risk levels may be unattainable. That happens when the base model and uncertainty score cannot cleanly separate wrong predictions from correct ones. In that case, no threshold can admit useful actions while satisfying the requested risk level.

This is not a bug. It is exactly the kind of refusal enterprise systems need more often. An AI system saying “I cannot safely operate under this risk budget” is far more useful than a model that performs interpretive dance around its own limitations.

The experiments test three claims, not one leaderboard

The paper evaluates SafeGround on ScreenSpot-Pro, a professional high-resolution GUI grounding benchmark with 1,581 screenshot-instruction pairs. The benchmark is suitable for this study because it contains dense, visually complex interfaces where neighboring UI elements can be hard to distinguish.

The authors test six GUI grounding models: Holo1.5-3B, Holo1.5-7B, GUI-Actor-2VL-7B, GUI-Actor-2.5VL-7B, UI-TARS-1.5-7B, and GTA1-7B. They evaluate uncertainty quality, selective prediction under false discovery rate control, and cascading inference to a stronger model.

These experiments serve different purposes. Mixing them together as “SafeGround improves performance” would be too crude.

Experiment or analysis	Likely purpose	What it supports	What it does not prove
AUROC and AUARC comparisons	Main evidence for uncertainty quality	Spatial uncertainty is useful for separating likely-correct from likely-wrong clicks	It does not by itself provide deployment risk control
FDR calibration tests	Main evidence for risk control	Calibrated thresholds keep accepted-error rates below target levels in the benchmark setting	It does not guarantee success under distribution shift
Power comparison	Operational usefulness test	SafeGround can retain more correct actions than weaker uncertainty baselines under the same risk constraint	It does not mean every action should be automated
Cascading inference	System-level deployment test	A weaker/local model plus selective escalation can beat base models and sometimes Gemini-only inference	It does not provide full cost accounting
Sampling-size sensitivity	Robustness and implementation-cost test	Around 10 samples appears effective; more samples show diminishing returns	It does not remove sampling overhead
Component ablation	Ablation	The three uncertainty components capture complementary failure modes	It does not show one universal component is enough
Split-ratio, temperature, and weighting analyses	Robustness/sensitivity tests	The method is not extremely fragile to moderate setting changes	It does not eliminate the need for calibration data
Qualitative case studies	Illustration	The spatial uncertainty score can reflect visible ambiguity	It is not the main statistical evidence

This classification matters because papers often accumulate experiments until the reader loses track of what each one is supposed to establish. In this case, the logical chain is clean:

The uncertainty score must separate safer clicks from riskier clicks.
The calibrated threshold must control accepted-action error.
Selective admission or escalation must improve system behavior.
Sensitivity tests must show the mechanism is not a delicate laboratory flower.

The paper provides evidence for each step, though with boundaries that matter for real deployment.

Spatial uncertainty is more useful than token confidence

The first empirical question is whether SafeGround’s uncertainty score can identify wrong predictions better than a probability-based baseline.

The paper uses AUROC to measure whether the uncertainty score separates incorrect from correct predictions. Higher AUROC means the uncertainty score is better at ranking errors as risky. Where probabilistic confidence is available, SafeGround’s combined uncertainty score improves AUROC across the reported comparisons:

Model	Probabilistic confidence AUROC	SafeGround combined AUROC	Interpretation
Holo1.5-3B	0.7576	0.8056	Spatial dispersion is noticeably more informative
Holo1.5-7B	0.6983	0.7526	Better discrimination despite a stronger base model
UI-TARS-1.5-7B	0.7844	0.8021	Smaller but still positive gain
GTA1-7B	0.6114	0.6344	Improvement, but uncertainty remains weaker

For GUI-Actor variants, probabilistic confidence is not directly available, which is part of the point. SafeGround still produces usable AUROC values because it depends on sampled coordinates rather than internal probability access.

AUARC tells a related but more deployment-flavored story: if we reject higher-uncertainty predictions, does accepted accuracy improve? SafeGround generally performs well here too, though the GTA1 AUARC comparison is essentially a near-tie and slightly below the probabilistic confidence baseline. That small exception is worth noting because it prevents the result from becoming too neat. The method is strong, not enchanted.

The ablation results explain why combining the three uncertainty components helps. Different models benefit from different signals. For GTA1, one component is more informative; for GUI-Actor and Holo1.5, others matter more. No single uncertainty component dominates across all models. The combined score is therefore less about finding the one perfect measure and more about building a model-agnostic risk signal that behaves reasonably across architectures.

That is the right goal for a plug-in safety layer. Enterprise systems rarely get the luxury of tuning a bespoke uncertainty philosophy for every model update.

FDR control is the safety claim; accuracy gain is the bonus

The paper’s most important evidence is not the accuracy table. It is the false discovery rate control.

SafeGround calibrates thresholds for different user-specified risk levels and evaluates test-time FDR across repeated calibration-test splits. The reported results show empirical FDR staying below the theoretical upper bound across tested models and risk levels.

This is the central safety claim: among accepted predictions, the proportion of wrong actions can be controlled below a chosen risk tolerance with high probability, under the calibration setup.

That wording is intentionally careful.

It does not mean every GUI action is safe. It does not mean the model will never click incorrectly. It does not mean a 0.34 risk level is emotionally comforting in every business process. A one-third wrong-action tolerance would be absurdly high for payment approval but perhaps informative for a difficult benchmark where the base models are weak. The point is not that the tested risk levels are universal business defaults. The point is that the framework exposes the trade-off and lets the operator define a threshold through calibration.

There is another useful detail: SafeGround reports power, which measures how many correct predictions are retained under the risk constraint. A safety system that rejects everything can satisfy many risk constraints while being commercially useless. The interesting system is one that controls risk while still admitting a meaningful fraction of correct actions.

The paper finds that SafeGround tends to retain more correct predictions than probabilistic confidence under the same FDR constraint, especially under stricter risk levels. This is where better uncertainty estimation becomes economically relevant. Better uncertainty does not merely make the system cautious. It makes the system selectively cautious.

That is the difference between a useful safety gate and a very expensive “no.”

Cascading is the business result: local first, expert when needed

The cascading experiment is where SafeGround becomes easiest to translate into business architecture.

The setup is simple. A primary GUI grounding model handles the case when uncertainty is below the calibrated threshold. If uncertainty is above the threshold, the case is deferred to a stronger expert model, Gemini-3-pro. The result is a hybrid system: cheap or local inference for low-risk cases, stronger external inference for uncertain cases.

This matters because businesses do not only optimize accuracy. They optimize accuracy, cost, latency, privacy, auditability, and failure recovery. Calling the strongest model for every click may be accurate but expensive. Using only a small local model may be cheap but unreliable. Cascading offers a middle path: spend the expensive model only where the primary model’s uncertainty justifies it.

The reported results show meaningful gains.

Primary model	Risk level	Base accuracy	SafeGround cascade accuracy	Gain over base	Gain over Gemini-only
Holo1.5-7B	0.34	52.41%	58.66%	+6.25 pp	+5.38 pp
Holo1.5-3B	0.34	45.45%	53.44%	+7.99 pp	+0.16 pp
UI-TARS-1.5-7B	0.38	41.58%	54.70%	+13.12 pp	+1.42 pp
GUI-Actor-2.5VL-7B	0.34	45.69%	55.18%	+9.49 pp	+1.90 pp
GUI-Actor-2VL-7B	0.34	40.79%	55.18%	+14.39 pp	+1.90 pp

The headline result is Holo1.5-7B at risk level 0.34: the cascade reaches 58.66% accuracy, beating Gemini-only inference at 53.28% by 5.38 percentage points and beating the Holo1.5-7B base model by 6.25 points.

That result is easy to misread. It does not mean Holo1.5-7B is “better than Gemini.” It means the cascade can combine the strengths of the primary model and the expert model through uncertainty-aware routing. The routing policy matters.

There is also a cost-side implication, but we should not overstate it. The paper reports that cascading rate decreases as the allowed risk level increases: looser risk budgets escalate fewer cases. That supports a cost-control interpretation. However, the paper does not provide a full enterprise cost model with API pricing, latency distributions, human-review overhead, or incident costs. So the correct business reading is:

SafeGround provides the technical mechanism for risk-budgeted escalation; the ROI depends on the cost and consequence profile of the actual workflow.

That is still useful. It gives product teams a design pattern rather than another benchmark trophy.

The practical design pattern is risk-budgeted autonomy

For business deployment, SafeGround suggests a simple architecture for GUI automation:

Layer	Role	Business question
Primary GUI model	Makes the initial grounding prediction	Can a cheaper or local model handle routine cases?
Uncertainty sampler	Observes whether repeated predictions cluster or scatter	Does this instruction-screen pair look stable?
Calibration gate	Converts uncertainty into an accept/reject threshold	What error rate among accepted actions are we willing to tolerate?
Escalation path	Sends risky cases to a stronger model or human reviewer	What is the right fallback for high-impact uncertainty?
Audit log	Records uncertainty, threshold, decision, and outcome	Can we explain why the system acted or deferred?

This is especially relevant for workflows where the cost of delay is lower than the cost of a wrong click. Examples include invoice processing, ERP data entry, compliance form submission, account configuration, internal dashboard operations, and any interface where “undo” is either unavailable or politically expensive.

The key business inference is not “deploy GUI agents everywhere.” Please do not use academic papers as permission slips for automation sprawl. The more precise inference is:

GUI agents should be deployed with an explicit admission policy. Some actions should be executed automatically, some should be escalated, and some should be refused until the system has enough calibrated evidence.

This reframes the role of uncertainty. It is not a weakness to be hidden from users. It is a routing signal.

A well-designed enterprise agent should not act like a junior employee trying to look confident in front of management. It should act like a controlled process: execute within authorization, escalate outside tolerance, and leave a trail.

Where this result should not be over-sold

SafeGround is promising, but its boundaries matter.

First, the guarantee depends on representative calibration data. If the deployment environment differs sharply from the calibration set, the threshold may no longer mean what it meant during testing. A GUI agent calibrated on benchmark screenshots may behave differently inside a company’s customized ERP system, especially if layouts, languages, permissions, or workflows differ.

Second, the method depends on sampling diversity. The authors explicitly note that if a model is highly deterministic and produces limited variation across stochastic samples, the spatial distribution may become less informative. In that case, repeated samples could create a false sense of stability. A model that confidently repeats the same wrong click is not safe; it is just consistent, which is not the same virtue.

Third, the tested benchmark is ScreenSpot-Pro. It is challenging and relevant, but it is still a benchmark. It does not fully represent live business processes with multi-step dependencies, changing UI states, authentication constraints, dynamic content, user interruptions, and organizational accountability.

Fourth, FDR is a proportion-based metric among accepted predictions. That is useful, but business risk is not always proportional. One wrong click approving a wire transfer may matter more than twenty wrong clicks on harmless navigation elements. For real deployment, SafeGround’s risk budget should be paired with action-level severity classes. Low-impact clicks might tolerate a different threshold than high-impact clicks.

Fifth, cascading to a stronger model is not always the right fallback. In regulated or privacy-sensitive settings, the fallback may need to be a human reviewer, a local specialized model, a rules engine, or a “do not act” state. Escalation is a governance decision, not just a model selection decision.

These limitations do not weaken the paper’s core contribution. They define where it should be used carefully.

The real lesson is not confidence; it is permission

SafeGround is useful because it moves GUI grounding from prediction to permission.

A normal grounding model answers the question:

Where should I click?

SafeGround adds the more important deployment question:

Should I be allowed to click at all?

That second question is where AI agents become business systems rather than demos. In production, autonomy is not merely the ability to act. It is the ability to act under constraints, defer under uncertainty, and provide enough evidence for others to trust the process.

The paper’s mechanism-first contribution is therefore bigger than its benchmark numbers. Sampling reveals spatial uncertainty. Calibration turns uncertainty into a threshold. The threshold becomes an admission rule. The admission rule enables selective execution or escalation. That chain is the actual product idea.

The industry will keep building GUI agents that click faster, browse deeper, and operate across more applications. Fine. But the more interesting agents will be the ones that know when not to click.

A little hesitation, properly calibrated, may be the cheapest safety feature enterprise automation has.

Cognaptus: Automate the Present, Incubate the Future.

Qingni Wang, Yue Fan, and Xin Eric Wang, “SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration,” arXiv:2602.02419, 2026. ↩︎

SafeGround adds a risk gate after the model, not a new model inside it#

Uncertainty comes from where repeated clicks land#

Calibration turns doubt into an operating rule#

The experiments test three claims, not one leaderboard#

Spatial uncertainty is more useful than token confidence#

FDR control is the safety claim; accuracy gain is the bonus#

Cascading is the business result: local first, expert when needed#

The practical design pattern is risk-budgeted autonomy#

Where this result should not be over-sold#

The real lesson is not confidence; it is permission#