When Agents Hesitate: Smarter Test-Time Scaling for Web AI

Forms are boring. That is exactly why they are dangerous for AI agents.

A human filling out an enterprise dashboard does not treat every click as a philosophical crisis. Search here. Scroll there. Submit. Done. A web agent, unfortunately, has no such common sense guarantee. It can overthink a routine step, miss a pivotal one, or spend a small fortune sampling twenty versions of the same obvious action. Very diligent. Also very expensive.

The paper Agentic Test-Time Scaling for WebAgents studies this problem directly: when should a web agent spend more inference-time compute, and when should it simply act?¹ Its answer is not “sample more,” “vote harder,” or “add another LLM judge and hope the bill looks sophisticated.” The answer is more mechanical: look at the agent’s own candidate-action distribution. If the candidates agree, preserve the consensus. If they disagree, allocate extra reasoning.

That sounds simple because the good ideas often do. The difficulty is that agentic systems are not single-shot reasoning models. A math benchmark asks for one answer. A web agent makes a sequence of actions: click, type, scroll, search, go back, exit. One wrong step can move the browser into a different state, hide the correct path, or trigger a failed workflow. In a long-horizon task, error is not a dot on a scorecard. It is a state transition.

The paper’s useful contribution is therefore not merely that its method, Confidence-Aware Test-Time Scaling, or CATTS, improves benchmark success. It explains why uniform test-time scaling fails in web agents. The mechanism matters because it is exactly the mechanism enterprise teams need to understand before they deploy browser automation, RPA copilots, internal workflow agents, or customer-service agents that actually touch software.

The short version is this: web-agent decisions fall into two regimes.

Regime	What the sampled actions look like	What extra compute does	Operational lesson
Redundancy	Most candidates point to the same action	Duplicates the obvious and may invite harmful override	Do not arbitrate just to feel responsible
Contention	Candidates split across plausible actions	Helps reason through an actual fork in the path	Spend compute where it can change the decision

Uniform scaling ignores this structure. CATTS exploits it.

The expensive mistake is treating every step like a hard step

Test-time scaling has become a standard trick in LLM engineering. Generate multiple answers. Use majority voting. Ask a verifier. Run a deeper search. Spend more compute at inference time rather than training time.

For single-shot reasoning tasks, that logic is often sensible. If a model solves a math problem five different ways, one path may reach the correct answer even if the first attempt fails. Majority voting or verification can then extract a better final answer.

Web agents are different. The agent is not producing one final answer; it is operating a stateful environment. In the paper’s setup, the base agent uses a ReAct-style prompting format with gpt-oss-120b, reads a cleaned HTML representation of the current page, and chooses from a structured action space such as clicking, typing, scrolling, searching, going back, selecting a dropdown, or exiting. The authors evaluate on WebArena-Lite, with 165 tasks and programmatic success checks, and a 341-task GoBrowse-style benchmark evaluated using an LLM-as-judge protocol.

At each step, the agent can sample multiple candidate actions. But before voting, there is a practical complication: two generated actions may be semantically equivalent while textually different. For example, a search query or final answer may vary slightly while meaning the same thing. The paper therefore uses a semantic deduplicator to cluster equivalent actions before constructing a vote distribution.

That detail is not decorative. Without deduplication, majority voting can split votes across paraphrases and behave worse as the number of samples increases. In Appendix D, the authors show that semantic deduplication is a necessary preprocessing step for meaningful vote aggregation, especially on text-heavy tasks where surface variation is common. In business language: before you interpret an agent’s “disagreement,” make sure it is not merely spelling the same intention five ways. A meeting full of synonyms is not a debate.

Once candidate actions are clustered, the question becomes: how should the system choose the action to execute under a token budget?

The naive answer is majority voting. Sample more candidates at every step. Pick the most frequent action. Unfortunately, the paper shows that this quickly runs into diminishing and sometimes non-monotonic returns.

Candidate count	WebArena-Lite success	WebArena-Lite tokens	GoBrowse success	GoBrowse tokens
1	38.8%	96K	86.9%	47K
5	42.4%	460K	87.8%	249K
10	43.2%	920K	88.0%	481K
20	43.0%	1.8M	87.8%	995K

The first jump helps. Moving from one candidate to five improves WebArena-Lite from 38.8% to 42.4%. But moving from ten candidates to twenty doubles token usage and slightly reduces success, from 43.2% to 43.0%. On GoBrowse, the movement is similarly unimpressive: 88.0% at ten candidates, then 87.8% at twenty.

That is not a scaling law. That is a billing artifact with confidence issues.

The mechanism is straightforward. Many web-agent steps are routine. If the page shows a search box and the task says to search for a product, most samples will converge on the same action. More samples simply reproduce the same decision. The agent is not exploring meaningful alternatives; it is printing duplicates at enterprise pricing.

But other steps are genuinely contentious. The page might contain several plausible links. The agent may need to infer whether to scroll, search, filter, or enter a category. In those moments, majority voting can be weak because votes are distributed across several plausible actions. The majority action may be only slightly ahead, or it may win because the samples share a correlated misconception.

Uniform scaling treats both cases the same. That is the core waste.

Arbitration helps until it starts arguing with the obvious

A natural response is to add an arbiter. Instead of simply selecting the most frequent candidate action, pass the candidate set and current page context to another LLM call, asking it to choose the best action.

This is not a bad idea. In the paper’s experiments, arbitration improves over simple majority voting on average. The arbiter can use the page state and task context to break ties, reject superficially popular but irrelevant actions, and choose among plausible alternatives.

The problem is that an arbiter is not a magic oracle. It is another model call. It can overthink. It can overrule a correct consensus. It can look at nine agents pointing at the same door and decide the window has better vibes.

The paper’s arbiter-scaling results show this tension clearly.

Method	WebArena-Lite success	WebArena-Lite tokens	GoBrowse success	GoBrowse tokens
Majority vote	42.4%	460K	87.8%	249K
Arbiter	42.8%	442K	88.6%	227K
Arbiter scaling	44.2%	645K	88.2%	351K
Arbiter scaling	44.6%	899K	88.7%	541K
Arbiter scaling	42.0%	1.4M	89.6%	733K

On WebArena-Lite, arbitration improves from 42.8% to 44.6% as arbiter scaling increases, then falls to 42.0% at higher compute. The selector gets more expensive and worse. GoBrowse shows steadier gains, but the broader pattern remains: more selection compute is not automatically better.

The paper then examines a specific failure mode: high-consensus overrides. These occur when candidate actions strongly agree, but the arbiter chooses a minority alternative. Across 495 task-runs from arbiter experiments, tasks without high-consensus overrides succeed at 46.9%, while tasks with at least one such override succeed at 35.0%. The effect also has a dose-response pattern: zero overrides produce 46.9% success, exactly one gives 36.6%, and two or more drops to 29.6%.

This is important because it changes how we should think about verifier models. In many enterprise discussions, a verifier or arbiter is treated as a safety layer. Put another LLM on top, and the system becomes more reliable. The paper’s evidence says: not necessarily. A second model can be a safety layer when the base candidates are genuinely uncertain. It can become a risk layer when it interrupts a strong, correct consensus.

Appendix J gives a concrete example. In a WebArena-Lite grocery task, the agent needs to find “Meat Substitutes.” At a pivotal step, 9 out of 10 sampled candidates choose “Scroll Down,” which is correct because the target category lies below the current viewport. The arbiter overrides that consensus and selects “Click Pantry Staples,” sending the agent into the wrong category. This is a small browser action, but it captures a large deployment lesson: the system did not fail because it lacked reasoning. It failed because it applied reasoning in the wrong place.

Entropy and margin turn hesitation into an operational signal

The paper’s central move is to treat the vote distribution itself as a signal.

At each step, after candidate actions are clustered, the system has a distribution over actions. If ten samples produce nine votes for scrolling and one vote for clicking, the system is highly decisive. If ten samples split across four plausible actions, the system is uncertain.

The authors use two statistics to characterize this:

$$ H_t = - \sum_a p_t(a) \log p_t(a) $$

where $H_t$ measures entropy, or overall disagreement across candidate actions.

They also use the top-1/top-2 margin:

$$ \Delta_t = p_t(a^{(1)}) - p_t(a^{(2)}) $$

where $a^{(1)}$ and $a^{(2)}$ are the most and second-most supported action clusters.

The interpretation is simple:

Pattern	Entropy	Margin	Meaning
Most votes on one action	Low	High	Strong consensus
Votes spread across actions	High	Low	Genuine contention
Several near-tied options	High	Low	Majority vote may be unreliable
One dominant action with minor noise	Low	High	Arbiter override risk is high

Successful trajectories tend to have lower entropy and higher margins. Failed trajectories show higher entropy and lower margins. The authors’ trajectory-level analysis shows that disagreement often spikes at pivotal decision points and is more pronounced or frequent in failed runs.

This is the business-relevant part. The agent already generates a useful reliability signal as a byproduct of sampling. You do not need privileged access to token-level log probabilities. You do not need to believe the model’s self-reported confidence, which is always a charmingly unreliable genre of autobiography. You can inspect behavioral disagreement across candidate actions.

That makes the signal deployable for API-only models. If a provider allows sampling multiple completions but does not expose token probabilities, vote-derived uncertainty still works.

The paper compares this with DeepConf-style confidence filtering, which uses token-level confidence signals. DeepConf variants do improve performance in some configurations, including strong GoBrowse results. But they require access to token-level log probabilities, which limits applicability when working through commercial APIs or hosted model systems. Vote-derived uncertainty is less intimate. It does not ask the model how confident it feels; it watches whether the model’s sampled actions agree.

That distinction matters in production. Enterprise teams can log entropy and margin, monitor when arbitration is invoked, audit override events, and evaluate accuracy-cost tradeoffs across workflows. The signal is not merely a research metric. It is an observability primitive.

CATTS is a gate, not a bigger brain

CATTS uses vote-derived uncertainty to decide whether to invoke arbitration.

At each step, the system computes an uncertainty score $U_t$ from the candidate-action distribution. If uncertainty is low, it uses majority voting. If uncertainty is high, it invokes the arbiter.

$$ a_t = \begin{cases} \arg\max_a p_t(a), & U_t \leq \tau \ \text{ARBITER}(\cdot), & U_t > \tau \end{cases} $$

The paper tests two effective variants:

entropy-gated CATTS, where uncertainty is based on $H_t$;
margin-gated CATTS, where uncertainty is based on $1 - \Delta_t$.

The difference is operational rather than philosophical. Entropy asks, “How spread out are the votes?” Margin asks, “How far ahead is the winner?” Both detect whether the current step is routine or contentious.

The results are strong because CATTS improves both accuracy and token usage.

Method	WebArena-Lite success	WebArena-Lite tokens	GoBrowse success	GoBrowse tokens
Majority vote	43.2%	920K	88.0%	481K
Always-arbitrate	44.0%	762K	88.3%	443K
CATTS, entropy-gated	47.9%	745K	90.2%	422K
CATTS, margin-gated	47.9%	405K	90.4%	372K

On WebArena-Lite, CATTS reaches 47.9% success, compared with 43.2% for majority voting. That is a 4.7 percentage-point gain. The margin-gated version does this with 405K tokens, compared with 920K for majority voting, a roughly 56% token reduction. On GoBrowse, CATTS reaches about 90% success with fewer tokens than the majority-vote baseline.

The interesting part is not that the system uses an arbiter. Always-arbitrate does not achieve the same gains. The interesting part is that CATTS withholds arbitration on high-consensus steps. It is a restraint mechanism.

This is a useful corrective to a common design instinct in agentic systems. Many teams respond to unreliability by adding another layer: more samples, more reviewers, more reflections, more agents in a committee. CATTS suggests a more disciplined pattern: add a second layer only when the first layer’s output distribution says the decision is actually contested.

In software terms, CATTS is closer to a routing policy than a reasoning breakthrough. That is not an insult. Routing policies are how systems become economically usable. A brilliant agent that spends uniformly high compute on every page interaction is not a product. It is a demo with invoices.

The appendix tests robustness, not a second thesis

The paper’s appendices are useful because they clarify which claims are central and which are supporting checks. This distinction matters; otherwise, every table gets promoted into a grand theory, and we get the usual academic confetti.

Test or appendix result	Likely purpose	What it supports	What it does not prove
Semantic deduplication ablation	Implementation ablation	Voting needs semantic clustering to avoid vote splitting	CATTS works without careful action normalization
Plan-and-Act scaling results	Robustness across agent architecture	Non-monotonic scaling is not limited to ReAct	The exact thresholds generalize to all planners
Arbiter scaling sweep	Main support for non-uniform benefit	More arbiter calls can plateau or degrade	Arbitration is useless
RSA and PlanRSA comparison	Comparison with deeper aggregation	Multi-round aggregation can be expensive without better performance in this setting	RSA cannot help any agentic task
Threshold sensitivity	Robustness/sensitivity test	CATTS is not purely a lucky threshold result	No calibration is needed in production
Vote distribution histograms	Mechanism support	Many steps fall into redundancy or contention regimes	Every enterprise workflow has the same distribution
Failure node examples	Qualitative diagnostic	High-consensus override is a real failure mode	Its frequency is fully captured by one example

The Plan-and-Act results are especially useful. The authors test whether non-monotonic scaling is specific to ReAct-style agents by evaluating a Plan-and-Act setup with factorized plan and action sampling budgets. Scaling still behaves non-monotonically. On WebArena-Lite, the best reported Plan-and-Act configuration reaches 43.2% but drops slightly at higher budget. On GoBrowse, scaling from the baseline to a larger budget reduces success from 83.3% to 80.6% in the reported setting.

That does not mean Plan-and-Act is bad. It means uniform test-time scaling is not automatically saved by changing the agent architecture. The scaling policy itself remains suspect.

The RSA and PlanRSA results make a related point. Recursive Self-Aggregation uses iterative refinement over candidates and is designed for settings where deeper aggregation can improve reasoning. In the paper’s WebArena-Lite experiments, RSA reaches at best 43.6% while using many more calls per step, compared with 44.0% for single-round arbitration and 47.9% for CATTS. PlanRSA performs worse. The authors hypothesize that iterative refinement transfers poorly to per-step action selection because the correct action depends heavily on environmental context, not merely on improving a textual solution.

That hypothesis is plausible. It also has a practical echo: a web action is not an essay draft. You cannot always “refine” your way to a better click.

What the paper directly shows, and what businesses should infer

The paper directly shows three things in its tested environment.

First, uniform candidate scaling has diminishing and sometimes non-monotonic returns for long-horizon web agents. More samples can help initially, but beyond a point, token usage grows faster than reliability.

Second, arbitration is conditionally useful. It helps when candidate actions are genuinely contested, but it can harm when it overrides high-consensus decisions.

Third, vote-derived uncertainty provides a practical gating signal. Entropy and margin can identify when extra selection compute is likely to help, allowing CATTS to improve accuracy while reducing token usage.

The business inference is broader but should stay disciplined.

Paper finding	Business interpretation	Practical implementation
Uniform scaling wastes compute on easy steps	Agent reliability is an allocation problem, not just a model-size problem	Track uncertainty per step and route compute selectively
High-consensus overrides correlate with lower success	Verifiers can introduce risk when they interrupt obvious actions	Audit override events and require stronger justification for overriding consensus
Entropy and margin predict useful arbitration regimes	The agent’s hesitation can become an observability metric	Log vote distributions, entropy, margin, selected action, and downstream outcome
CATTS improves accuracy-cost tradeoff	Dynamic inference policies can affect ROI directly	Tune thresholds per workflow and measure success per token, not success alone
DeepConf needs token-level signals while CATTS does not	API-only deployments can still implement confidence-aware scaling	Use candidate sampling and semantic clustering rather than internal model probabilities

For enterprise AI, the immediate lesson is not “use CATTS exactly as published.” It is to stop treating test-time compute as a uniform blanket. A procurement workflow, claims-processing dashboard, CRM update, or internal knowledge-base task will contain both routine and pivotal actions. If the system spends equal deliberation on all of them, it is not cautious. It is unpriced.

A production version should probably include at least five logs per decision step:

the sampled candidate actions;
semantic action clusters;
vote entropy;
top-1/top-2 margin;
whether the arbiter overrode the majority action.

Then the team can ask questions that actually matter. Are failures concentrated after low-margin steps? Does arbitration help in high-entropy states? Are overrides from strong consensus usually harmful? Which websites, forms, or internal tools produce the most contention? This is how agent engineering becomes measurable rather than theatrical.

There is also a governance angle. A deterministic rule such as “invoke arbitration only when entropy exceeds a threshold” is easier to audit than a vague claim that “the agent reflected.” Reflection sounds wise. Logs are better.

Boundaries that matter before anyone productizes this

The paper is useful, but it is not a universal law of agentic intelligence.

The experiments use gpt-oss-120b, ReAct-style agents, Plan-and-Act variants, WebArena-Lite, and a GoBrowse-style benchmark. WebArena-Lite uses programmatic success checks, while GoBrowse relies on an LLM-as-judge protocol. The appendix notes that GoBrowse tasks are generally shorter and easier, with higher baseline success rates, and that the judge protocol has approximately 90% agreement with human evaluations based on the original GoBrowse validation. That is good enough for comparative research, but not the same as a production SLA.

The token accounting also reflects the paper’s environment, where prompt tokens dominate total cost. In a different architecture with compressed state, smaller prompts, local models, or tool-level caching, the cost frontier may shift. The mechanism should remain relevant, but the exact economics will change.

Semantic deduplication is another dependency. If action clustering is poor, the vote distribution becomes noisy. False splits make the agent look more uncertain than it is. False merges hide real disagreements. In a production workflow, deduplication quality is not a footnote; it is part of the reliability stack.

Thresholds also need calibration. The appendix shows CATTS performs robustly across a range of thresholds, which is encouraging. But a bank’s compliance workflow, a logistics dashboard, and an e-commerce admin panel will not share the same risk tolerance. In low-risk flows, uncertainty might route to an arbiter. In high-risk flows, uncertainty might route to a human, a constrained policy engine, or a refusal to proceed. “Think harder” is not the only safe action.

Finally, the paper studies web navigation tasks, not every kind of agent. Agents operating codebases, spreadsheets, APIs, physical devices, or financial trading systems may show different uncertainty distributions. The principle is portable: allocate compute where uncertainty suggests it can change the decision. The thresholds, tools, and failure costs are not.

The real lesson is not more thinking; it is better triage

The paper’s most useful idea is not that CATTS wins a benchmark table. Benchmark wins are nice. They also have a short shelf life.

The durable idea is that agentic compute should be triaged.

A web agent does not need to hold a committee meeting before scrolling. It may need one before choosing among three plausible task paths. The difference is visible in the vote distribution. High consensus means the system probably has enough signal to act. High disagreement means the system has found a decision point where extra reasoning may actually matter.

That changes the design philosophy for enterprise agents. Instead of building agents that always deliberate more, build agents that know when deliberation is worth buying.

The quiet irony is that “confidence-aware scaling” is not about making the agent more confident. It is about making the system less gullible toward its own machinery. Majority voting can be shallow. Arbitration can be overconfident. Recursive aggregation can be expensive theater. The useful system is the one that knows which failure mode it is currently facing.

When the agent’s candidates agree, do not pay it to hesitate.

When they disagree, hesitation is the signal.

Cognaptus: Automate the Present, Incubate the Future.

Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami, “Agentic Test-Time Scaling for WebAgents,” arXiv:2602.12276, 2026. https://arxiv.org/abs/2602.12276 ↩︎

The expensive mistake is treating every step like a hard step#

Arbitration helps until it starts arguing with the obvious#

Entropy and margin turn hesitation into an operational signal#

CATTS is a gate, not a bigger brain#

The appendix tests robustness, not a second thesis#

What the paper directly shows, and what businesses should infer#

Boundaries that matter before anyone productizes this#

The real lesson is not more thinking; it is better triage#