See, Plan, Snap: Why AI Can Think in Blocks but Can’t Drop Them

Opening — Why this matters now

AI agents are learning to use computers the way humans do: by looking at screens and clicking things.

In demos, they book flights, fill forms, navigate desktops, and even write code. The narrative is simple: “LLMs can now operate software.”

But here’s the inconvenient question: can they actually build something through a graphical interface?

A recent paper, “See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch” fileciteturn0file0 introduces SCRATCHWORLD, a benchmark that quietly punctures the hype. Instead of browsing websites or retrieving files, agents must construct programs block by block in Scratch — the children’s visual programming environment.

It sounds trivial.

It is not.

Most existing GUI benchmarks test:

Clicking links
Filling forms
Launching apps
Navigating menus

These are fundamentally retrieval or navigation tasks.

Scratch is different.

Scratch requires program-by-construction. The agent must:

Plan logic
Locate blocks visually
Drag them precisely
Snap them into valid structural positions
Pass runtime tests

This is closer to real-world low-code platforms, enterprise workflow builders, and visual automation tools than to web browsing.

The benchmark includes 83 curated tasks, grouped into four categories:

Category	Description	Business Analogy
Create	Build from scratch	Designing a workflow in Zapier/Power Automate
Debug	Fix broken logic	Repairing misconfigured automation
Extend	Add new features	Modifying ERP flows
Compute	Algorithmic logic	Implementing business rules

Crucially, SCRATCHWORLD separates two capabilities:

Composite Mode → High-level semantic APIs (logic only)
Primitive Mode → Low-level drag-and-drop GUI actions

In other words:

Can you design the program? Can you actually assemble it through the interface?

Those are not the same skill.

Analysis — The Reasoning–Acting Gap

The results are not subtle.

Composite Mode (Logic Only)

Top model (Claude-Sonnet-4.5):

78.31% Success Rate

Modern models can reason about Scratch programs.

They understand loops, variables, control flow.

They pass runtime execution tests when GUI mechanics are abstracted away.

So far, so impressive.

Primitive Mode (Real GUI Manipulation)

Same models, now required to drag blocks precisely.

Best result:

14.46% Success Rate

Let that sink in.

From ~78% to ~14%.

Below is a simplified contrast:

Mode	Best SR	What It Measures
Composite	78.31%	Logical correctness
Primitive	14.46%	Real-world execution

This is the reasoning–acting gap.

The models can plan.

They cannot reliably snap.

Deep Dive — Is It Planning or Precision?

The authors went further.

They isolated the problem into a Single-Step Drag Benchmark (60 atomic tasks).

Each task required only one drag-and-drop.

No long-horizon complexity.

Result?

Best Pass@1: 23.33%

Even one drag fails most of the time.

This eliminates long-horizon reasoning as the culprit.

The bottleneck is pixel-level execution precision.

More specifically:

Endpoint Localization Is the Real Failure Mode

Models often grab the correct block.

They fail at:

“Where exactly should I drop it?”

The Scratch snapping mechanism requires precise spatial alignment. Dropping at the geometric center is not enough.

Endpoint accuracy in some settings remained around 30%.

In long tasks requiring ~9 block modifications on average, small per-step failure rates compound exponentially.

Mathematically, if per-step success is $p$, then for $n$ steps:

$$ P(\text{task success}) = p^n $$

If $p = 0.8$ and $n = 9$:

$$ 0.8^9 \approx 0.13 $$

That aligns disturbingly well with the ~14% primitive-mode success.

This is not accidental.

Perception Isn’t the Problem

Perhaps the models simply can’t see clearly?

The paper tested that too.

A static Visual Perception QA Benchmark (200 samples) evaluated:

Detecting block connections
Identifying occluded blocks
Reading field values

GPT-5 achieved:

90.5% perception accuracy

So perception is strong.

But drag success remains ~23% (single-step).

Conclusion:

The failure lies not in seeing, not in planning — but in converting perception into precise motor execution.

This is a closed-loop control problem.

Why This Matters for Business Automation

Scratch is not the end goal.

Scratch is a proxy.

The same structural properties exist in:

Low-code enterprise builders
RPA tools
Visual workflow automation
No-code AI orchestration platforms
ERP configuration panels

If an AI agent cannot reliably perform snap-sensitive drag-and-drop operations:

It cannot safely automate visual business tooling
It will silently fail under compounding error
It cannot be trusted in high-stakes enterprise workflows

Composite-mode benchmarks are flattering.

Primitive-mode benchmarks are sobering.

For operators, this distinction is existential.

Strategic Implications

1. Planning Is Commoditized

Logical reasoning in structured domains is already strong.

This will not be the bottleneck going forward.

2. Execution Policies Are the New Frontier

High-precision interaction policies — not bigger models — are required.

Future improvements will likely come from:

Snap-aware coordinate modeling
Closed-loop visual feedback
Learned motor control policies
Hybrid symbolic–geometric planning

3. Benchmarks Must Separate Logic from Execution

SCRATCHWORLD’s dual-mode design is not just clever.

It is necessary.

Without isolating reasoning from acting, we overestimate agent readiness.

Conclusion — Thinking Is Cheap, Snapping Is Hard

SCRATCHWORLD delivers an uncomfortable but valuable message:

AI agents can design the right program.

They often cannot physically build it.

In Scratch, that gap is visible.

In enterprise automation, it becomes financial risk.

The next wave of progress in multimodal agents will not be about better prompts.

It will be about reliable execution under spatial constraints.

Until then, treat GUI agents as planners — not operators.

And benchmark accordingly.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From Navigation to Construction#

Analysis — The Reasoning–Acting Gap#

Composite Mode (Logic Only)#

Primitive Mode (Real GUI Manipulation)#

Deep Dive — Is It Planning or Precision?#

Endpoint Localization Is the Real Failure Mode#

Perception Isn’t the Problem#

Why This Matters for Business Automation#

Strategic Implications#

1. Planning Is Commoditized#

2. Execution Policies Are the New Frontier#

3. Benchmarks Must Separate Logic from Execution#

Conclusion — Thinking Is Cheap, Snapping Is Hard#