Opening — Why this matters now
AI agents are learning to use computers the way humans do: by looking at screens and clicking things.
In demos, they book flights, fill forms, navigate desktops, and even write code. The narrative is simple: “LLMs can now operate software.”
But here’s the inconvenient question: can they actually build something through a graphical interface?
A recent paper, “See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch” fileciteturn0file0 introduces SCRATCHWORLD, a benchmark that quietly punctures the hype. Instead of browsing websites or retrieving files, agents must construct programs block by block in Scratch — the children’s visual programming environment.
It sounds trivial.
It is not.
Background — From Navigation to Construction
Most existing GUI benchmarks test:
- Clicking links
- Filling forms
- Launching apps
- Navigating menus
These are fundamentally retrieval or navigation tasks.
Scratch is different.
Scratch requires program-by-construction. The agent must:
- Plan logic
- Locate blocks visually
- Drag them precisely
- Snap them into valid structural positions
- Pass runtime tests
This is closer to real-world low-code platforms, enterprise workflow builders, and visual automation tools than to web browsing.
The benchmark includes 83 curated tasks, grouped into four categories:
| Category | Description | Business Analogy |
|---|---|---|
| Create | Build from scratch | Designing a workflow in Zapier/Power Automate |
| Debug | Fix broken logic | Repairing misconfigured automation |
| Extend | Add new features | Modifying ERP flows |
| Compute | Algorithmic logic | Implementing business rules |
Crucially, SCRATCHWORLD separates two capabilities:
- Composite Mode → High-level semantic APIs (logic only)
- Primitive Mode → Low-level drag-and-drop GUI actions
In other words:
Can you design the program? Can you actually assemble it through the interface?
Those are not the same skill.
Analysis — The Reasoning–Acting Gap
The results are not subtle.
Composite Mode (Logic Only)
Top model (Claude-Sonnet-4.5):
- 78.31% Success Rate
Modern models can reason about Scratch programs.
They understand loops, variables, control flow.
They pass runtime execution tests when GUI mechanics are abstracted away.
So far, so impressive.
Primitive Mode (Real GUI Manipulation)
Same models, now required to drag blocks precisely.
Best result:
- 14.46% Success Rate
Let that sink in.
From ~78% to ~14%.
Below is a simplified contrast:
| Mode | Best SR | What It Measures |
|---|---|---|
| Composite | 78.31% | Logical correctness |
| Primitive | 14.46% | Real-world execution |
This is the reasoning–acting gap.
The models can plan.
They cannot reliably snap.
Deep Dive — Is It Planning or Precision?
The authors went further.
They isolated the problem into a Single-Step Drag Benchmark (60 atomic tasks).
Each task required only one drag-and-drop.
No long-horizon complexity.
Result?
- Best Pass@1: 23.33%
Even one drag fails most of the time.
This eliminates long-horizon reasoning as the culprit.
The bottleneck is pixel-level execution precision.
More specifically:
Endpoint Localization Is the Real Failure Mode
Models often grab the correct block.
They fail at:
“Where exactly should I drop it?”
The Scratch snapping mechanism requires precise spatial alignment. Dropping at the geometric center is not enough.
Endpoint accuracy in some settings remained around 30%.
In long tasks requiring ~9 block modifications on average, small per-step failure rates compound exponentially.
Mathematically, if per-step success is $p$, then for $n$ steps:
$$ P(\text{task success}) = p^n $$
If $p = 0.8$ and $n = 9$:
$$ 0.8^9 \approx 0.13 $$
That aligns disturbingly well with the ~14% primitive-mode success.
This is not accidental.
Perception Isn’t the Problem
Perhaps the models simply can’t see clearly?
The paper tested that too.
A static Visual Perception QA Benchmark (200 samples) evaluated:
- Detecting block connections
- Identifying occluded blocks
- Reading field values
GPT-5 achieved:
- 90.5% perception accuracy
So perception is strong.
But drag success remains ~23% (single-step).
Conclusion:
The failure lies not in seeing, not in planning — but in converting perception into precise motor execution.
This is a closed-loop control problem.
Why This Matters for Business Automation
Scratch is not the end goal.
Scratch is a proxy.
The same structural properties exist in:
- Low-code enterprise builders
- RPA tools
- Visual workflow automation
- No-code AI orchestration platforms
- ERP configuration panels
If an AI agent cannot reliably perform snap-sensitive drag-and-drop operations:
- It cannot safely automate visual business tooling
- It will silently fail under compounding error
- It cannot be trusted in high-stakes enterprise workflows
Composite-mode benchmarks are flattering.
Primitive-mode benchmarks are sobering.
For operators, this distinction is existential.
Strategic Implications
1. Planning Is Commoditized
Logical reasoning in structured domains is already strong.
This will not be the bottleneck going forward.
2. Execution Policies Are the New Frontier
High-precision interaction policies — not bigger models — are required.
Future improvements will likely come from:
- Snap-aware coordinate modeling
- Closed-loop visual feedback
- Learned motor control policies
- Hybrid symbolic–geometric planning
3. Benchmarks Must Separate Logic from Execution
SCRATCHWORLD’s dual-mode design is not just clever.
It is necessary.
Without isolating reasoning from acting, we overestimate agent readiness.
Conclusion — Thinking Is Cheap, Snapping Is Hard
SCRATCHWORLD delivers an uncomfortable but valuable message:
AI agents can design the right program.
They often cannot physically build it.
In Scratch, that gap is visible.
In enterprise automation, it becomes financial risk.
The next wave of progress in multimodal agents will not be about better prompts.
It will be about reliable execution under spatial constraints.
Until then, treat GUI agents as planners — not operators.
And benchmark accordingly.
Cognaptus: Automate the Present, Incubate the Future.