Opening — Why this matters now

AI agents are learning to use computers the way humans do: by looking at screens and clicking things.

In demos, they book flights, fill forms, navigate desktops, and even write code. The narrative is simple: “LLMs can now operate software.”

But here’s the inconvenient question: can they actually build something through a graphical interface?

A recent paper, “See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch” fileciteturn0file0 introduces SCRATCHWORLD, a benchmark that quietly punctures the hype. Instead of browsing websites or retrieving files, agents must construct programs block by block in Scratch — the children’s visual programming environment.

It sounds trivial.

It is not.


Background — From Navigation to Construction

Most existing GUI benchmarks test:

  • Clicking links
  • Filling forms
  • Launching apps
  • Navigating menus

These are fundamentally retrieval or navigation tasks.

Scratch is different.

Scratch requires program-by-construction. The agent must:

  1. Plan logic
  2. Locate blocks visually
  3. Drag them precisely
  4. Snap them into valid structural positions
  5. Pass runtime tests

This is closer to real-world low-code platforms, enterprise workflow builders, and visual automation tools than to web browsing.

The benchmark includes 83 curated tasks, grouped into four categories:

Category Description Business Analogy
Create Build from scratch Designing a workflow in Zapier/Power Automate
Debug Fix broken logic Repairing misconfigured automation
Extend Add new features Modifying ERP flows
Compute Algorithmic logic Implementing business rules

Crucially, SCRATCHWORLD separates two capabilities:

  • Composite Mode → High-level semantic APIs (logic only)
  • Primitive Mode → Low-level drag-and-drop GUI actions

In other words:

Can you design the program? Can you actually assemble it through the interface?

Those are not the same skill.


Analysis — The Reasoning–Acting Gap

The results are not subtle.

Composite Mode (Logic Only)

Top model (Claude-Sonnet-4.5):

  • 78.31% Success Rate

Modern models can reason about Scratch programs.

They understand loops, variables, control flow.

They pass runtime execution tests when GUI mechanics are abstracted away.

So far, so impressive.


Primitive Mode (Real GUI Manipulation)

Same models, now required to drag blocks precisely.

Best result:

  • 14.46% Success Rate

Let that sink in.

From ~78% to ~14%.

Below is a simplified contrast:

Mode Best SR What It Measures
Composite 78.31% Logical correctness
Primitive 14.46% Real-world execution

This is the reasoning–acting gap.

The models can plan.

They cannot reliably snap.


Deep Dive — Is It Planning or Precision?

The authors went further.

They isolated the problem into a Single-Step Drag Benchmark (60 atomic tasks).

Each task required only one drag-and-drop.

No long-horizon complexity.

Result?

  • Best Pass@1: 23.33%

Even one drag fails most of the time.

This eliminates long-horizon reasoning as the culprit.

The bottleneck is pixel-level execution precision.

More specifically:

Endpoint Localization Is the Real Failure Mode

Models often grab the correct block.

They fail at:

“Where exactly should I drop it?”

The Scratch snapping mechanism requires precise spatial alignment. Dropping at the geometric center is not enough.

Endpoint accuracy in some settings remained around 30%.

In long tasks requiring ~9 block modifications on average, small per-step failure rates compound exponentially.

Mathematically, if per-step success is $p$, then for $n$ steps:

$$ P(\text{task success}) = p^n $$

If $p = 0.8$ and $n = 9$:

$$ 0.8^9 \approx 0.13 $$

That aligns disturbingly well with the ~14% primitive-mode success.

This is not accidental.


Perception Isn’t the Problem

Perhaps the models simply can’t see clearly?

The paper tested that too.

A static Visual Perception QA Benchmark (200 samples) evaluated:

  • Detecting block connections
  • Identifying occluded blocks
  • Reading field values

GPT-5 achieved:

  • 90.5% perception accuracy

So perception is strong.

But drag success remains ~23% (single-step).

Conclusion:

The failure lies not in seeing, not in planning — but in converting perception into precise motor execution.

This is a closed-loop control problem.


Why This Matters for Business Automation

Scratch is not the end goal.

Scratch is a proxy.

The same structural properties exist in:

  • Low-code enterprise builders
  • RPA tools
  • Visual workflow automation
  • No-code AI orchestration platforms
  • ERP configuration panels

If an AI agent cannot reliably perform snap-sensitive drag-and-drop operations:

  • It cannot safely automate visual business tooling
  • It will silently fail under compounding error
  • It cannot be trusted in high-stakes enterprise workflows

Composite-mode benchmarks are flattering.

Primitive-mode benchmarks are sobering.

For operators, this distinction is existential.


Strategic Implications

1. Planning Is Commoditized

Logical reasoning in structured domains is already strong.

This will not be the bottleneck going forward.

2. Execution Policies Are the New Frontier

High-precision interaction policies — not bigger models — are required.

Future improvements will likely come from:

  • Snap-aware coordinate modeling
  • Closed-loop visual feedback
  • Learned motor control policies
  • Hybrid symbolic–geometric planning

3. Benchmarks Must Separate Logic from Execution

SCRATCHWORLD’s dual-mode design is not just clever.

It is necessary.

Without isolating reasoning from acting, we overestimate agent readiness.


Conclusion — Thinking Is Cheap, Snapping Is Hard

SCRATCHWORLD delivers an uncomfortable but valuable message:

AI agents can design the right program.

They often cannot physically build it.

In Scratch, that gap is visible.

In enterprise automation, it becomes financial risk.

The next wave of progress in multimodal agents will not be about better prompts.

It will be about reliable execution under spatial constraints.

Until then, treat GUI agents as planners — not operators.

And benchmark accordingly.

Cognaptus: Automate the Present, Incubate the Future.