A click looks harmless until it is not.
In consumer software, a wrong click means opening the wrong tab, dismissing the wrong pop-up, or buying the wrong color of phone case. Annoying, perhaps. Civilization survives. In enterprise workflows, a wrong click can approve a payment, change a configuration, delete a record, or submit a compliance form with the confidence of a sleepwalker holding admin rights.
That is the uncomfortable center of the paper “SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration.”1 The paper is not mainly about making GUI agents more accurate in the usual leaderboard sense. It is about a more operational question: when should a GUI grounding model be allowed to act, and when should it be forced to pause, abstain, or escalate?
This distinction matters. Most discussions of GUI agents still treat grounding accuracy as the central metric. The model receives a screenshot and an instruction, then predicts a coordinate. If that coordinate lands inside the target region, the prediction is correct. If not, it fails. Simple enough.
But deployment is not a benchmark table. In production, every accepted prediction becomes an action. A high average accuracy can still be unacceptable if the remaining errors cluster around irreversible or expensive operations. “The model is usually right” is not a governance policy. It is a sentence people say shortly before writing incident reports.
SafeGround’s contribution is to add a risk-aware decision layer around existing GUI grounding models. It does not ask the model to introspect politely about its confidence. It does not require access to logits. It does not assume the model exposes neat internal probabilities. Instead, it repeatedly samples where the model would click, converts the spatial pattern of those samples into an uncertainty score, and then calibrates a threshold that controls the false discovery rate among accepted actions.
In plain English: click only when the model’s spatial behavior looks reliable enough under a risk budget chosen before deployment.
That is a better mental model than “bigger model equals safer agent.” Sometimes the safer system is not the one that always acts. Sometimes it is the one that knows when to become someone else’s problem.
SafeGround adds a risk gate after the model, not a new model inside it
The paper’s first useful move is architectural. SafeGround is designed as a plug-in wrapper for GUI grounding models. The base model can be Holo1.5, GUI-Actor, UI-TARS, GTA1, or another coordinate-predicting GUI model. SafeGround does not replace the base model. It watches the model’s output behavior and decides whether the predicted action should be admitted.
The workflow is roughly:
Screenshot + instruction
↓
Primary GUI grounding model
↓
Multiple stochastic coordinate samples
↓
Spatial density map and candidate regions
↓
Uncertainty score
↓
Calibrated threshold
↓
Execute, abstain, or escalate
The important point is that SafeGround treats uncertainty as an operational signal, not as a decorative number printed beside the answer.
Many uncertainty methods do not fit GUI grounding well. Token probabilities require access to model internals and may not map cleanly to whether a coordinate is safe. Verbal confidence is fragile because models can say “I am confident” with the calm authority of a hotel receptionist giving directions to a building that no longer exists. Some GUI uncertainty methods depend on ground-truth annotations, which are obviously unavailable at test time.
SafeGround avoids these traps by using sampled outputs. The model is asked, multiple times, to ground the same instruction on the same interface. If the predicted coordinates cluster tightly around the same target, that is evidence of lower uncertainty. If the predictions scatter across competing screen regions, that is evidence of risk.
This is not a philosophical claim about consciousness, self-awareness, or whether the model “knows” anything. Good. We have enough of that already. It is a behavioral test: when nudged through stochastic sampling, does the model keep pointing to the same place?
Uncertainty comes from where repeated clicks land
SafeGround builds uncertainty from spatial dispersion.
For each input, the method samples the GUI grounding model multiple times. In the experiments, the authors use 10 samples with stochastic decoding. Those sampled coordinates are projected onto a discretized screen grid. The grid becomes a density map: patches with many sampled coordinates receive higher density; patches with few or no samples receive lower density.
From this density map, SafeGround extracts connected high-density regions. Each region represents a plausible target area. The method then scores and ranks these regions.
This matters because GUI grounding is spatial. The model is not choosing from a tidy list of labels. It is trying to place a point on a screen where multiple buttons, icons, panels, and text fields may be visually close. Two wrong clicks may be semantically very different even if their token probabilities look similar. A coordinate near the right button is not the same kind of error as a coordinate on a destructive button three panels away.
SafeGround captures three related but distinct uncertainty signals:
| Signal | What it measures | Operational interpretation |
|---|---|---|
| Top-candidate ambiguity | Whether the leading region clearly beats the runner-up | The model is torn between nearby plausible targets |
| Informational dispersion | How spread out the region-score distribution is | The model’s belief is scattered across the interface |
| Concentration deficit | Whether probability mass has a dominant spatial focus | No region clearly deserves the click |
The combined uncertainty score aggregates these three components with fixed weights. That fixed-weight choice is not just a technical convenience. It is part of the deployment story. A method that needs careful model-by-model tuning is less attractive as a general risk layer. SafeGround is trying to be boring in the right way: attach it, calibrate it, and use the resulting threshold as an execution policy.
There is a subtle but important distinction here. The uncertainty score itself does not guarantee safety. It only ranks predictions from more reliable-looking to less reliable-looking. The guarantee comes later, when the score is calibrated against observed errors.
That is where the paper becomes more interesting than another “we measured confidence” exercise.
Calibration turns doubt into an operating rule
A raw uncertainty score is not enough. A business system needs a decision rule.
SafeGround uses a Learn-Then-Test calibration procedure. On a held-out calibration set, it evaluates candidate uncertainty thresholds. For each threshold, predictions below the threshold are “accepted,” and predictions above it are rejected or escalated. The method then estimates the false discovery rate, meaning the proportion of accepted predictions that are actually wrong.
The basic risk quantity is:
The user supplies a risk level. SafeGround then chooses the largest uncertainty threshold that satisfies the risk constraint, using a Clopper–Pearson upper confidence bound to provide finite-sample control.
This is the paper’s real deployment contribution. It does not merely say, “high uncertainty is bad.” Everyone suspected as much. It asks: at what threshold can we admit predictions while keeping the error rate among admitted actions below a specified tolerance?
That changes the business conversation.
Without calibration, an automation team might say:
“The GUI agent has 52% grounding accuracy on a difficult benchmark.”
With calibration, the conversation becomes:
“For accepted actions, we can target a maximum error proportion, and the system will abstain or escalate when that target is not achievable.”
The second sentence is much closer to something a risk owner, compliance team, or operations manager can reason about. It is still not magic. The guarantee depends on calibration data and assumptions such as exchangeability between calibration and test settings. But it is a contract-shaped control, not a vibe-shaped confidence score.
The method also has a useful failure mode. Some strict risk levels may be unattainable. That happens when the base model and uncertainty score cannot cleanly separate wrong predictions from correct ones. In that case, no threshold can admit useful actions while satisfying the requested risk level.
This is not a bug. It is exactly the kind of refusal enterprise systems need more often. An AI system saying “I cannot safely operate under this risk budget” is far more useful than a model that performs interpretive dance around its own limitations.
The experiments test three claims, not one leaderboard
The paper evaluates SafeGround on ScreenSpot-Pro, a professional high-resolution GUI grounding benchmark with 1,581 screenshot-instruction pairs. The benchmark is suitable for this study because it contains dense, visually complex interfaces where neighboring UI elements can be hard to distinguish.
The authors test six GUI grounding models: Holo1.5-3B, Holo1.5-7B, GUI-Actor-2VL-7B, GUI-Actor-2.5VL-7B, UI-TARS-1.5-7B, and GTA1-7B. They evaluate uncertainty quality, selective prediction under false discovery rate control, and cascading inference to a stronger model.
These experiments serve different purposes. Mixing them together as “SafeGround improves performance” would be too crude.
| Experiment or analysis | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| AUROC and AUARC comparisons | Main evidence for uncertainty quality | Spatial uncertainty is useful for separating likely-correct from likely-wrong clicks | It does not by itself provide deployment risk control |
| FDR calibration tests | Main evidence for risk control | Calibrated thresholds keep accepted-error rates below target levels in the benchmark setting | It does not guarantee success under distribution shift |
| Power comparison | Operational usefulness test | SafeGround can retain more correct actions than weaker uncertainty baselines under the same risk constraint | It does not mean every action should be automated |
| Cascading inference | System-level deployment test | A weaker/local model plus selective escalation can beat base models and sometimes Gemini-only inference | It does not provide full cost accounting |
| Sampling-size sensitivity | Robustness and implementation-cost test | Around 10 samples appears effective; more samples show diminishing returns | It does not remove sampling overhead |
| Component ablation | Ablation | The three uncertainty components capture complementary failure modes | It does not show one universal component is enough |
| Split-ratio, temperature, and weighting analyses | Robustness/sensitivity tests | The method is not extremely fragile to moderate setting changes | It does not eliminate the need for calibration data |
| Qualitative case studies | Illustration | The spatial uncertainty score can reflect visible ambiguity | It is not the main statistical evidence |
This classification matters because papers often accumulate experiments until the reader loses track of what each one is supposed to establish. In this case, the logical chain is clean:
- The uncertainty score must separate safer clicks from riskier clicks.
- The calibrated threshold must control accepted-action error.
- Selective admission or escalation must improve system behavior.
- Sensitivity tests must show the mechanism is not a delicate laboratory flower.
The paper provides evidence for each step, though with boundaries that matter for real deployment.
Spatial uncertainty is more useful than token confidence
The first empirical question is whether SafeGround’s uncertainty score can identify wrong predictions better than a probability-based baseline.
The paper uses AUROC to measure whether the uncertainty score separates incorrect from correct predictions. Higher AUROC means the uncertainty score is better at ranking errors as risky. Where probabilistic confidence is available, SafeGround’s combined uncertainty score improves AUROC across the reported comparisons:
| Model | Probabilistic confidence AUROC | SafeGround combined AUROC | Interpretation |
|---|---|---|---|
| Holo1.5-3B | 0.7576 | 0.8056 | Spatial dispersion is noticeably more informative |
| Holo1.5-7B | 0.6983 | 0.7526 | Better discrimination despite a stronger base model |
| UI-TARS-1.5-7B | 0.7844 | 0.8021 | Smaller but still positive gain |
| GTA1-7B | 0.6114 | 0.6344 | Improvement, but uncertainty remains weaker |
For GUI-Actor variants, probabilistic confidence is not directly available, which is part of the point. SafeGround still produces usable AUROC values because it depends on sampled coordinates rather than internal probability access.
AUARC tells a related but more deployment-flavored story: if we reject higher-uncertainty predictions, does accepted accuracy improve? SafeGround generally performs well here too, though the GTA1 AUARC comparison is essentially a near-tie and slightly below the probabilistic confidence baseline. That small exception is worth noting because it prevents the result from becoming too neat. The method is strong, not enchanted.
The ablation results explain why combining the three uncertainty components helps. Different models benefit from different signals. For GTA1, one component is more informative; for GUI-Actor and Holo1.5, others matter more. No single uncertainty component dominates across all models. The combined score is therefore less about finding the one perfect measure and more about building a model-agnostic risk signal that behaves reasonably across architectures.
That is the right goal for a plug-in safety layer. Enterprise systems rarely get the luxury of tuning a bespoke uncertainty philosophy for every model update.
FDR control is the safety claim; accuracy gain is the bonus
The paper’s most important evidence is not the accuracy table. It is the false discovery rate control.
SafeGround calibrates thresholds for different user-specified risk levels and evaluates test-time FDR across repeated calibration-test splits. The reported results show empirical FDR staying below the theoretical upper bound across tested models and risk levels.
This is the central safety claim: among accepted predictions, the proportion of wrong actions can be controlled below a chosen risk tolerance with high probability, under the calibration setup.
That wording is intentionally careful.
It does not mean every GUI action is safe. It does not mean the model will never click incorrectly. It does not mean a 0.34 risk level is emotionally comforting in every business process. A one-third wrong-action tolerance would be absurdly high for payment approval but perhaps informative for a difficult benchmark where the base models are weak. The point is not that the tested risk levels are universal business defaults. The point is that the framework exposes the trade-off and lets the operator define a threshold through calibration.
There is another useful detail: SafeGround reports power, which measures how many correct predictions are retained under the risk constraint. A safety system that rejects everything can satisfy many risk constraints while being commercially useless. The interesting system is one that controls risk while still admitting a meaningful fraction of correct actions.
The paper finds that SafeGround tends to retain more correct predictions than probabilistic confidence under the same FDR constraint, especially under stricter risk levels. This is where better uncertainty estimation becomes economically relevant. Better uncertainty does not merely make the system cautious. It makes the system selectively cautious.
That is the difference between a useful safety gate and a very expensive “no.”
Cascading is the business result: local first, expert when needed
The cascading experiment is where SafeGround becomes easiest to translate into business architecture.
The setup is simple. A primary GUI grounding model handles the case when uncertainty is below the calibrated threshold. If uncertainty is above the threshold, the case is deferred to a stronger expert model, Gemini-3-pro. The result is a hybrid system: cheap or local inference for low-risk cases, stronger external inference for uncertain cases.
This matters because businesses do not only optimize accuracy. They optimize accuracy, cost, latency, privacy, auditability, and failure recovery. Calling the strongest model for every click may be accurate but expensive. Using only a small local model may be cheap but unreliable. Cascading offers a middle path: spend the expensive model only where the primary model’s uncertainty justifies it.
The reported results show meaningful gains.
| Primary model | Risk level | Base accuracy | SafeGround cascade accuracy | Gain over base | Gain over Gemini-only |
|---|---|---|---|---|---|
| Holo1.5-7B | 0.34 | 52.41% | 58.66% | +6.25 pp | +5.38 pp |
| Holo1.5-3B | 0.34 | 45.45% | 53.44% | +7.99 pp | +0.16 pp |
| UI-TARS-1.5-7B | 0.38 | 41.58% | 54.70% | +13.12 pp | +1.42 pp |
| GUI-Actor-2.5VL-7B | 0.34 | 45.69% | 55.18% | +9.49 pp | +1.90 pp |
| GUI-Actor-2VL-7B | 0.34 | 40.79% | 55.18% | +14.39 pp | +1.90 pp |
The headline result is Holo1.5-7B at risk level 0.34: the cascade reaches 58.66% accuracy, beating Gemini-only inference at 53.28% by 5.38 percentage points and beating the Holo1.5-7B base model by 6.25 points.
That result is easy to misread. It does not mean Holo1.5-7B is “better than Gemini.” It means the cascade can combine the strengths of the primary model and the expert model through uncertainty-aware routing. The routing policy matters.
There is also a cost-side implication, but we should not overstate it. The paper reports that cascading rate decreases as the allowed risk level increases: looser risk budgets escalate fewer cases. That supports a cost-control interpretation. However, the paper does not provide a full enterprise cost model with API pricing, latency distributions, human-review overhead, or incident costs. So the correct business reading is:
SafeGround provides the technical mechanism for risk-budgeted escalation; the ROI depends on the cost and consequence profile of the actual workflow.
That is still useful. It gives product teams a design pattern rather than another benchmark trophy.
The practical design pattern is risk-budgeted autonomy
For business deployment, SafeGround suggests a simple architecture for GUI automation:
| Layer | Role | Business question |
|---|---|---|
| Primary GUI model | Makes the initial grounding prediction | Can a cheaper or local model handle routine cases? |
| Uncertainty sampler | Observes whether repeated predictions cluster or scatter | Does this instruction-screen pair look stable? |
| Calibration gate | Converts uncertainty into an accept/reject threshold | What error rate among accepted actions are we willing to tolerate? |
| Escalation path | Sends risky cases to a stronger model or human reviewer | What is the right fallback for high-impact uncertainty? |
| Audit log | Records uncertainty, threshold, decision, and outcome | Can we explain why the system acted or deferred? |
This is especially relevant for workflows where the cost of delay is lower than the cost of a wrong click. Examples include invoice processing, ERP data entry, compliance form submission, account configuration, internal dashboard operations, and any interface where “undo” is either unavailable or politically expensive.
The key business inference is not “deploy GUI agents everywhere.” Please do not use academic papers as permission slips for automation sprawl. The more precise inference is:
GUI agents should be deployed with an explicit admission policy. Some actions should be executed automatically, some should be escalated, and some should be refused until the system has enough calibrated evidence.
This reframes the role of uncertainty. It is not a weakness to be hidden from users. It is a routing signal.
A well-designed enterprise agent should not act like a junior employee trying to look confident in front of management. It should act like a controlled process: execute within authorization, escalate outside tolerance, and leave a trail.
Where this result should not be over-sold
SafeGround is promising, but its boundaries matter.
First, the guarantee depends on representative calibration data. If the deployment environment differs sharply from the calibration set, the threshold may no longer mean what it meant during testing. A GUI agent calibrated on benchmark screenshots may behave differently inside a company’s customized ERP system, especially if layouts, languages, permissions, or workflows differ.
Second, the method depends on sampling diversity. The authors explicitly note that if a model is highly deterministic and produces limited variation across stochastic samples, the spatial distribution may become less informative. In that case, repeated samples could create a false sense of stability. A model that confidently repeats the same wrong click is not safe; it is just consistent, which is not the same virtue.
Third, the tested benchmark is ScreenSpot-Pro. It is challenging and relevant, but it is still a benchmark. It does not fully represent live business processes with multi-step dependencies, changing UI states, authentication constraints, dynamic content, user interruptions, and organizational accountability.
Fourth, FDR is a proportion-based metric among accepted predictions. That is useful, but business risk is not always proportional. One wrong click approving a wire transfer may matter more than twenty wrong clicks on harmless navigation elements. For real deployment, SafeGround’s risk budget should be paired with action-level severity classes. Low-impact clicks might tolerate a different threshold than high-impact clicks.
Fifth, cascading to a stronger model is not always the right fallback. In regulated or privacy-sensitive settings, the fallback may need to be a human reviewer, a local specialized model, a rules engine, or a “do not act” state. Escalation is a governance decision, not just a model selection decision.
These limitations do not weaken the paper’s core contribution. They define where it should be used carefully.
The real lesson is not confidence; it is permission
SafeGround is useful because it moves GUI grounding from prediction to permission.
A normal grounding model answers the question:
Where should I click?
SafeGround adds the more important deployment question:
Should I be allowed to click at all?
That second question is where AI agents become business systems rather than demos. In production, autonomy is not merely the ability to act. It is the ability to act under constraints, defer under uncertainty, and provide enough evidence for others to trust the process.
The paper’s mechanism-first contribution is therefore bigger than its benchmark numbers. Sampling reveals spatial uncertainty. Calibration turns uncertainty into a threshold. The threshold becomes an admission rule. The admission rule enables selective execution or escalation. That chain is the actual product idea.
The industry will keep building GUI agents that click faster, browse deeper, and operate across more applications. Fine. But the more interesting agents will be the ones that know when not to click.
A little hesitation, properly calibrated, may be the cheapest safety feature enterprise automation has.
Cognaptus: Automate the Present, Incubate the Future.
-
Qingni Wang, Yue Fan, and Xin Eric Wang, “SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration,” arXiv:2602.02419, 2026. ↩︎