From repetition to reasoning

When early computer-use agents (CUAs) appeared, they promised to automate tedious digital workflows—clicking through files, formatting reports, or organizing spreadsheets. Yet anyone who has tried them knows the frustration: sometimes they succeed spectacularly, sometimes they click the wrong button and crash everything. Reliability, not intelligence, has been the missing link.

A recent paper from Simular Research, “The Unreasonable Effectiveness of Scaling Agents for Computer Use,” shows that scaling these agents isn’t just about more compute—it’s about how we scale. Their method, Behavior Best-of-N (bBoN), turns the brute-force idea of “run many agents and hope one works” into a structured, interpretable, and near-human-level solution.


Scaling, but with judgment

At its core, bBoN transforms chaotic exploration into curated selection. Instead of trusting a single agent’s run, the system launches several in parallel—each one navigating the same digital task differently. Then, rather than simply picking the one that finishes fastest or looks right, bBoN converts each attempt into a behavior narrative: a concise summary of what the agent did and how the environment responded.

These narratives are then compared by a “judge” model (e.g., GPT-5), which identifies the best trajectory—much like a manager reviewing multiple employee attempts and approving the most competent one. The brilliance lies in abstraction: by compressing messy low-level logs into structured, semantically meaningful summaries, the system makes large-scale evaluation tractable.


From chaos to clarity

The results are striking. On the OSWorld benchmark—369 real computer-use tasks in Ubuntu—bBoN achieved 69.9% success at 100 steps, compared to the previous state-of-the-art’s 59.9%, and within 2% of human performance. The method generalizes cleanly to Windows and Android benchmarks as well.

Benchmark Prior SoTA bBoN (GPT-5) Human
OSWorld (Ubuntu) 59.9% 69.9% 72%
WindowsAgentArena 50.2% 56.6%
AndroidWorld 68.1% 71.6%

The key innovation isn’t in a new architecture or giant model—it’s in the process of deciding which run to trust. By turning trajectories into human-readable behavior stories, bBoN introduces an explicit reasoning layer between agent generation and result acceptance.


Smarter foundations: Agent S3

The authors also introduce Agent S3, a refined framework that drops hierarchical “manager-worker” planning and integrates a coding agent capable of programmatic edits (Python or Bash) alongside GUI actions.
Compared with its predecessor (Agent S2), Agent S3 delivers:

  • +13.8% higher success rate
  • 52% fewer LLM calls per task
  • 62% shorter completion time

Together with bBoN’s wide-scaling selection, these improvements create a virtuous cycle: faster rollouts yield more diverse solutions, and structured judging ensures that diversity converts into reliability.


Why this matters

CUAs are a microcosm of enterprise automation.
In corporate workflows—from accounting dashboards to HR portals—automation fails not because AI doesn’t “know what to do,” but because one small UI deviation or delayed response derails the chain of steps.
bBoN demonstrates a generalizable principle: robust automation arises from intelligent redundancy.

Instead of seeking a perfect agent, organizations can deploy multiple imperfect ones and let a structured judge pick the best outcome. The same philosophy could govern document processing, financial reporting, or data cleaning workflows—anywhere parallel attempts can be cheaply generated but must be reliably validated.


The deeper lesson

What Simular’s work really highlights is a shift from “bigger models” to “better orchestration.”
The future of automation will not be dominated by a single omniscient model, but by ensembles of specialized agents—each imperfect, yet collectively robust through comparative understanding.
Scaling, when guided by structure, produces intelligence.


Cognaptus: Automate the Present, Incubate the Future