Desktops are where AI ambition goes to discover gravity.

A chatbot can sound competent in one turn. A coding assistant can look brilliant inside a bounded file. But ask an agent to use a real computer for a long task — open the right app, edit the right file, preserve formatting, notice a pop-up, verify the final state, and not confidently click itself into a small administrative tragedy — and the problem changes. Intelligence is no longer a single answer. It is a chain of actions, each one able to quietly poison the next.

That is the setting for Scaling Agents for Computer Use, a paper from Simular Research on computer-use agents, or CUAs.1 The headline result is easy to oversell: the authors report 72.6% success on OSWorld at 100 steps using Behavior Judge with GPT-5 and Claude Opus 4.5 rollouts, above the previous reported best of 63.4% and slightly above the OSWorld human-level figure of 72.36%. There it is: the obligatory leaderboard confetti. Please recycle responsibly.

The more interesting claim is not that one agent suddenly became wise. It is that several imperfect agents, run in parallel, can become useful if the system has a disciplined way to decide which attempt actually worked.

That last clause does most of the work.

The real mechanism is not “try more”; it is “select better”

The obvious reading of the paper is that computer-use agents benefit from brute force. Run ten attempts. Hope one succeeds. Pick the winner. Done.

That reading is pleasantly simple and mostly wrong.

The paper’s central mechanism is wide scaling: instead of spending extra compute inside one trajectory, the system generates multiple complete trajectories and then selects among them. This matters because long desktop tasks have high variance. One run may fail because it misses a tiny UI state change. Another may succeed because it happens to choose a cleaner route. A third may use code instead of clicking through a spreadsheet like a very patient intern.

But wide scaling has a bottleneck: selection. Multiple attempts only help if the system can identify which attempt completed the task. In a desktop environment, that is not the same as asking which transcript sounds most plausible. A trajectory is multimodal, long, noisy, and full of irrelevant visual detail. The final screenshot may not reveal whether the right row was hidden, the correct value was edited, or the file was saved in the expected place.

Behavior Judge, or BJudge, is the authors’ answer. It works in two stages.

First, each dense trajectory is converted into a behavior narrative. Rather than passing every screenshot and action directly to a judge, the system generates factual descriptions of what changed after each action. For pointer actions such as clicks and drags, it marks the click location in the before-screenshot and uses a zoomed crop after execution, because small visual differences can decide whether a click did what the agent believed it did. This is less glamorous than “reasoning,” but much more useful. The desktop does not care about your ontology. It cares whether the Save button was actually pressed.

Second, a comparative evaluator reviews the behavior narratives together and selects the best trajectory. The evaluator is not scoring each run independently in isolation. It is comparing candidate behaviours side by side. That distinction becomes important in the experiments: independent judging plateaus quickly, while comparative judging benefits more from additional rollouts.

So the mechanism is not mere repetition. It is:

  1. generate diverse full attempts;
  2. compress each attempt into action-effect facts;
  3. compare those facts across candidates;
  4. choose the trajectory most aligned with the user’s request.

The paper’s practical insight lives in that pipeline. Scaling creates optionality. Narratives make optionality legible. Comparative judging converts legibility into reliability.

Agent S3 makes the rollouts less wasteful before judging begins

Before BJudge selects among rollouts, those rollouts need to be worth selecting from. The authors therefore introduce Agent S3, an improved baseline built from Agent S2.

Agent S3 has two important changes.

The first is a flat policy. The authors remove hierarchical manager-worker planning and use a single policy that can replan at every step. The argument is operational rather than philosophical: modern models can often maintain short-horizon plans in context, while stale high-level subgoals can become overhead. In desktop automation, a plan that cannot update after a modal dialog appears is not a plan. It is a liability with bullet points.

The second is a coding agent integrated into the GUI policy’s action space. Agent S3 can choose to manipulate the interface directly or call a bounded code loop for programmatic edits, file transformations, parsing, or bulk operations. After the code agent runs, it returns a summary and verification checklist so the GUI agent can inspect the result. This is a sensible hybrid. Some tasks belong in the UI because formatting, menus, and application state matter. Others should be done through code because clicking through repeated edits is how machines cosplay as exhausted office workers.

This baseline improvement is not a footnote. On OSWorld with GPT-5, Agent S3 reaches 62.6% success, compared with 48.8% for Agent S2. It also reduces average LLM calls per task from 73.62 to 35.12 and average time per task from 2366.80 seconds to 891.21 seconds. In other words, the paper improves the agent before scaling it, then uses scaling to extract more reliability from that stronger base.

That sequencing matters for business interpretation. Parallelism is not a magic tax you pay to redeem a weak system. If the base agent produces low-quality attempts, the judge has less useful variation to exploit. Wide scaling works best when the candidate pool contains genuinely different plausible solutions, not ten creative ways to fail.

The main result is strong, but the ablations explain why it matters

The paper’s main OSWorld table is comparison with prior work. Agent S3 with BJudge reaches 69.9% using ten GPT-5 rollouts. The best reported configuration, using a mixture of GPT-5 and Opus 4.5 rollouts, reaches 72.6%. Prior step-wise scaling with GTA1 and GPT-5 is reported at 63.4%.

That is the headline. But the ablations are where the argument becomes convincing.

Test or result Likely purpose What it supports What it does not prove
OSWorld main result: 72.6% with GPT-5 + Opus 4.5 rollouts Main evidence and comparison with prior work Wide scaling plus BJudge can substantially improve benchmark success That live enterprise desktops are solved
Agent S3 vs Agent S2 Implementation and baseline improvement Better rollout generation improves both success and efficiency That hierarchy is always bad
BJudge vs adapted WebJudge under equal rollout budgets Ablation / comparison Comparative trajectory selection is better than independent scoring That the judge is generally reliable outside benchmark-like tasks
Behavior narratives vs screenshots, summaries, and naive captions Representation ablation Transition-level action-effect facts are more useful than raw or generic summaries That narratives are immune to hallucination
Resource-budget experiments Sensitivity test More workers help when each worker still has enough steps That more rollouts always dominate one deeper run
Model-mixture experiments Exploratory ensemble analysis Diverse capable models increase coverage and can improve selected success That maximum Pass@N automatically becomes maximum final success
WindowsAgentArena and AndroidWorld results Robustness / generalization test The method transfers beyond Ubuntu OSWorld That it covers all operating systems, apps, and external-state workflows

The representation ablation is especially important. With ten GPT-5 Mini rollouts, BJudge using behavior narratives reaches 60.2%. Screenshot-only selection reaches 56.0%, trajectory summaries reach 55.0%, and naive captioning reaches 56.8%. The difference is not enormous, but it is directionally clear: the useful abstraction is not “summarise the task.” It is “record what actions changed in the environment.”

That is a sharper lesson than the leaderboard number. Enterprises do not lack logs. They lack logs that explain what mattered.

A raw desktop trace is too detailed. A generic summary is too vague. A behavior narrative sits in the useful middle: enough structure to preserve evidence, enough compression to make comparison feasible. The paper’s method is basically an audit trail designed for machine judging. Apparently, agents also benefit from documentation. Shocking development.

Pass@N shows the upside; selection determines how much of it is captured

The model-mixture results expose a subtle point that many scaling discussions miss. The paper reports both success rate after BJudge selection and Pass@N, where a task is counted as covered if at least one rollout succeeded.

This distinction is crucial.

Pass@N measures the available upside in the candidate pool. BJudge success measures the captured upside after selection. A model mixture can generate at least one correct trajectory for many tasks, yet still fail to deliver the best final result if the judge chooses the wrong candidate.

In the paper’s N=4 mixture study, using all model types gives the highest coverage at 80.5% Pass@N, but its selected success rate is 68.4%. The GPT-5 + Opus combination has lower coverage at 79.1%, but higher selected success at 71.6%. This is not a contradiction. It is the whole problem in miniature.

Diversity expands the frontier. Selection decides whether the system reaches it.

For business use, this is a useful diagnostic split. If Pass@N is low, the agents are not producing enough correct attempts. Improve the base agent, tools, permissions, or task decomposition. If Pass@N is high but selected success is low, the bottleneck is evaluation. Improve narratives, validators, judges, or external checks.

That distinction is much more actionable than saying “the agent failed.” Most automation programmes already know that. Their invoices have been very clear.

More workers help only after the task budget can sustain them

The resource-budget experiment is a useful antidote to naive scaling optimism.

The authors vary the total budget, defined as number of workers multiplied by per-worker step budget. At small budgets, a single agent performs best. This is intuitive: splitting a tiny budget across many workers produces several underfunded attempts. Ten agents that each run out of steps before reaching the file menu are not an ensemble. They are a committee of abandoned errands.

As the total budget increases, larger worker counts become more effective. The paper reports BJudge with four workers achieving a 4.25 percentage point improvement over a single agent at total budget 200, while ten workers produce the largest gain, 6.38 points, at budget 1000.

So the scaling law here is conditional. Wide scaling helps when each rollout has enough room to complete a plausible path. Below that threshold, breadth becomes fragmentation.

This is directly relevant to operational design. If an enterprise wants to use agent ensembles for desktop work, it should not merely ask how many agents to launch. It should ask whether each agent gets a realistic execution budget, whether the environment can be reset, and whether the judge has enough evidence to compare completed behaviours rather than half-finished gestures.

BJudge is accurate enough to be useful, not perfect enough to be trusted blindly

The failure analysis is unusually practical.

On the OSWorld “Judge Subset” — 159 tasks where there is at least one correct and one incorrect trajectory, meaning selection can actually improve the outcome — BJudge reaches 78.4% benchmark-aligned accuracy. After manual inspection of a remaining set of cases, the authors report 92.8% human alignment on the judge subset, noting that OSWorld scripts can be imperfect because they may check only a predefined solution.

This is important for two reasons.

First, benchmark scripts are not sacred. They are useful instruments, not divine revelation in Python. If a task admits multiple valid solutions, a script may mark a behaviour wrong even when a human would accept it. That matters for computer-use agents, where many desktop tasks have several acceptable routes.

Second, BJudge still fails in revealing ways. The authors identify two main failure categories: behavior narrative hallucinations and code-GUI handoff failures. In one case study, the visual model misses a fine detail such as a negative sign and produces an inaccurate narrative. In another, the GUI agent fails to verify what the coding agent changed, then overwrites or misinterprets those changes. BJudge may prefer the richer-looking GUI narrative even when the cleaner code-driven trajectory actually completed the task.

That second failure is particularly instructive. A good narrative can still describe the wrong thing persuasively. Enterprise automation teams should pause here and breathe slowly into a compliance checklist.

The paper’s method improves selection, but it does not eliminate the need for validators, environmental isolation, and task-specific acceptance checks. In fact, it makes those components more valuable. If behavior narratives become the interface between execution and judgment, then narrative fidelity becomes a control surface. A hallucinated audit trail is not an audit trail. It is theatre with timestamps.

The cost story is not “free scaling”; it is “judging is cheaper than rerunning”

The appendix provides useful cost and timing details. For GPT-5-based OSWorld runs, the average per-task cost is reported as $0.72 for a single rollout, $0.11 for behavior narrative generation, and $0.03 for judging ten rollouts. Average time per task is 891 seconds for one rollout, 433.4 seconds for narrative generation, and 226 seconds for judging, with medians lower due to skew from API delays.

The authors also report that running ten rollouts over the 361-task OSWorld benchmark in parallel required 17 hours and 33 minutes end to end, using AWS-hosted OSWorld workers and local API-based narrative generation and judging.

This does not make wide scaling cheap. Ten rollouts are still ten rollouts. But it suggests the selection layer is not the dominant cost. The expensive part is producing candidate behaviours. Once those behaviours exist, converting and judging them is comparatively manageable.

That supports a practical deployment pattern:

Operational decision Paper-grounded implication
Run multiple agents only on high-value or high-risk tasks Wide scaling buys reliability, but rollouts dominate cost
Use cheaper models for some rollouts if they add diversity Model mixtures can improve coverage, though selection quality still matters
Spend on stronger judging when rollout quality is acceptable Appendix experiments suggest evaluator quality can materially affect gains
Preserve execution traces and before/after evidence Behavior narratives depend on transition-level facts, not vibes
Use VM snapshots or containers BJudge assumes repeatable independent rollouts from the same initial state

This is not a universal recipe. It is a cost architecture. The business question becomes: for which workflows is an extra attempt cheaper than a human correction, a compliance error, or a failed customer process?

That answer will vary. An internal spreadsheet cleanup task might tolerate a retry. A payroll update, legal filing, or production admin action should not be treated as a sandbox with nicer stationery.

The enterprise lesson is orchestration before autonomy

The strongest business interpretation is not that companies should immediately run ten agents for every desktop task. That would be one way to replace inefficiency with very expensive enthusiasm.

The better lesson is that reliability may come less from one heroic model and more from an orchestration stack around imperfect models:

\ast reproducible environments; \ast multiple candidate executions; \ast structured action-effect narratives; \ast comparative selection; \ast independent validation; \ast controlled promotion from sandbox to live state.

This is familiar in mature software systems. We do not trust one deployment because it “looked confident.” We test, stage, compare, observe, and roll back. Behavior Judge imports a version of that logic into computer-use agents.

The paper directly shows benchmark gains for CUAs across Ubuntu, Windows, and Android settings. Cognaptus’ inference is broader but conditional: enterprise agent reliability will likely depend on turning agent execution into an inspectable workflow, not merely adding a better front-end chat box. The uncertainty is also clear. Benchmarks offer controlled starts and repeatable environments. Real businesses contain shared cloud files, email accounts, payment states, permissions, stale sessions, rate limits, and people who rename folders to “final_final_USE_THIS_v7”. Science can only do so much.

Where the result should not be overread

The paper’s limitations are not decorative. They shape where the method applies.

BJudge assumes multiple independent rollouts from the same initial state. That is natural in OSWorld-style benchmarks and plausible in virtual machines or containerised enterprise environments. It is much less natural on a user’s live desktop. Two concurrent agents interacting with the same inbox, cart, SaaS dashboard, or shared drive may interfere with each other. Running “parallel attempts” against shared external state can create side effects that cannot be cleanly reset.

The method also depends on the judge’s ability to understand narratives generated by a visual model. Fine-grained visual errors still matter. Code-GUI coordination remains fragile. Some tasks require domain-specific acceptance criteria that a general evaluator may not infer reliably.

So the appropriate conclusion is not “computer-use agents are now human-level.” The OSWorld number is impressive, but it is one benchmark, one evaluation protocol, and a carefully engineered system. The better conclusion is narrower and more useful: when tasks can be repeated in isolated environments, and when trajectories can be represented as faithful behavior narratives, wide scaling can convert variance from a weakness into an asset.

That is a real advance. It just happens to be less magical than the press-release version. Naturally, that makes it more interesting.

More becomes smarter when comparison becomes part of the system

The paper’s title promises scaling. The mechanism delivers something subtler: structured redundancy.

A single computer-use agent fails because long workflows are brittle. Multiple agents help because their failures and successes are not perfectly overlapping. Behavior narratives help because raw traces are too dense to judge well. Comparative evaluation helps because the best attempt is often visible only in contrast.

That is the useful pattern for automation leaders. Do not ask only whether the agent can perform the task once. Ask whether the system can produce alternatives, explain their differences, verify their effects, and choose safely.

The future of desktop automation may not be one agent clicking perfectly through every workflow. It may be a small population of agents trying different paths, watched by a judge that understands what changed, and constrained by infrastructure that prevents their mistakes from leaking into the real world.

More becomes smarter only when the system knows what to do with more.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast


  1. Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang, “Scaling Agents for Computer Use,” arXiv:2510.02250, 2025, https://arxiv.org/abs/2510.02250↩︎