Blocks are supposed to make programming easier.
That is the whole promise of Scratch: instead of typing syntax, the learner drags colorful blocks, snaps them together, and watches the program run. No semicolons. No import errors. No spiritual damage from invisible whitespace. Very civilized.
Now give that same interface to an AI agent.
The agent can see the screen. It can read the instruction. It can reason about loops, variables, sprites, and conditional logic. It can explain what should happen. In a demo video, this looks dangerously close to competence.
Then it tries to drag a block into place.
And misses.
That small failure is the useful part of See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch, which introduces ScratchWorld, a benchmark for multimodal GUI agents in Scratch.1 The paper’s central lesson is not that AI agents are bad at Scratch. That would be too narrow, and frankly too easy. The sharper lesson is that visual programming exposes a failure chain that many business-facing AI products prefer to hide: the agent can understand the task, plan the right program, and still fail because the interface requires precise physical action.
That is the reasoning–acting gap. It is not a branding problem. It is an engineering problem with a deceptively cheerful interface.
The failure chain begins after the model already understands the task
The common misconception is simple: if multimodal models become better at seeing screens and reasoning about code, GUI automation will naturally follow.
ScratchWorld makes that belief harder to defend.
The benchmark contains 83 curated tasks across four categories: Create, Debug, Extend, and Compute. These categories are not random labels. They reflect different ways learners and agents interact with programs: building from scratch, repairing broken logic, adding new functionality, and implementing computational procedures. A Create task might ask the agent to build an interactive Scratch project from an empty project. A Debug task might ask it to fix a broken Pong paddle. A Compute task might ask it to implement factorial or palindrome logic.
The clever part is not merely the task set. The clever part is that ScratchWorld evaluates agents in two interaction modes:
| Mode | What the agent receives | What the agent must do | What the mode isolates |
|---|---|---|---|
| Composite mode | Structured program representation and high-level semantic APIs | Add, connect, set, or delete blocks through abstract operations | Program reasoning without low-level GUI mechanics |
| Primitive mode | Screenshots plus an indexed element list | Click, type, scroll, and drag-and-drop through GUI actions | Real GUI execution and visual grounding |
Composite mode asks: can the agent design the right Scratch program?
Primitive mode asks: can the agent actually build it through the interface?
That distinction matters because a business does not buy a beautiful internal monologue. It buys a working automation.
ScratchWorld uses execution-based evaluation through the Scratch VM. The submitted program must pass runtime tests, not merely look plausible. This is important. Visual programming interfaces invite a dangerous kind of fake success: a screenshot may look roughly right while the underlying program state is wrong. The benchmark avoids that by checking functional behavior, such as sprite positions, variable updates, event handling, and output correctness.
So the paper does not simply ask whether the agent can make a program-shaped object. It asks whether the program runs.
A rude but fair standard.
The headline number is not the benchmark; it is the diagnosis
The main result is stark. In composite mode, the best model reaches 78.31% overall success rate. In primitive mode, the best result falls to 14.46%.
That is not a small degradation. That is a collapse.
| Evidence | What it directly shows | Business meaning | Boundary |
|---|---|---|---|
| Best composite-mode success reaches 78.31% | Strong models can solve many Scratch program-construction tasks when GUI mechanics are abstracted away | Reasoning over visual-program logic is already usable in constrained settings | This does not prove robust general software engineering competence |
| Best primitive-mode success reaches 14.46% | The same class of task becomes much harder when solved through low-level GUI actions | Product demos that skip execution mechanics overstate readiness | The result is strongest for snap-sensitive, drag-and-drop interfaces |
| Debug tasks degrade less than Create tasks | Localized fixes are easier than long construction sequences | AI assistance may be safer for targeted edits than full autonomous building | This depends on how much precise manipulation the task requires |
| Extend and Compute remain difficult across modes | Some tasks require both reasoning and extensive manipulation | Not all failures are just “bad mouse control” | Complex logic still matters |
The practical interpretation is not “models cannot reason.” In fact, the composite-mode result says the opposite. The models often can reason well enough when given a semantic action layer.
The failure appears when abstract intention must become interface-level motion.
That is why the article should not be read as a benchmark leaderboard. Leaderboards tempt readers into asking which model is best. The more useful question is where the workflow breaks. ScratchWorld’s answer is: after planning, during execution, especially at the moment of snapping blocks into valid positions.
The agent sees. The agent plans. The agent drops badly.
One failed drag is enough to ruin a beautiful plan
A long-horizon Scratch task might require many block changes. If an agent fails, one easy explanation is that the full task is simply too long. Perhaps the model loses track of state. Perhaps it makes a reasoning mistake in the tenth step. Perhaps Scratch is just an awkward environment.
The authors test that possibility with a Single-Step Drag Benchmark.
This diagnostic benchmark contains 60 atomic drag-and-drop tasks. Each task requires only one drag operation, covering direct block connection and slot insertion. This is not the full ScratchWorld task anymore. It is the action stripped down to its small, stubborn core: pick up the right thing and put it in the right place.
The best Pass@1 result is only 23.33%.
That number matters because it changes the interpretation of the main benchmark. The primitive-mode collapse is not just a long-horizon planning problem. Even the single mechanical step is unreliable.
The paper then separates start-point and endpoint accuracy. This is where the story becomes more specific. Models are often better at locating where to grab than where to drop. When ground-truth start positions are provided, start-component accuracy can become nearly perfect for stronger models, but end-component accuracy remains low. For example, under the GT Start setting, GPT-5 reaches 99.44% start-component accuracy, while its end-component accuracy remains 30.17%. Qwen3-VL reaches 100.00% start-component accuracy, while its end-component accuracy remains 32.22%.
The bottleneck is not merely “find the block.” It is “find the valid drop target.”
That is a different problem.
Scratch blocks do not connect because the cursor is emotionally close to the right region. They connect when the dragged object lands in a valid snapping region determined by the interface. The geometric center of a block may be a bad drop point. A valid endpoint may depend on whether the agent is stacking a command block, inserting a reporter into a slot, nesting inside a C-shaped control block, or attaching a stack above another stack. The interface has structure, and the structure is spatial.
This is why endpoint localization is such a revealing phrase. It sounds minor, but it is the place where symbolic reasoning meets geometry. The model may know that block A should go inside block B. The GUI still asks: exactly where should the mouse release happen?
A planning system can answer the first question. A reliable operator must answer both.
The compounding-error problem is brutal and boring
The paper reports that ScratchWorld tasks require an average of 9.14 block modifications. That average matters because even moderately unreliable actions collapse when chained.
If each action succeeds with probability $p$, and a task needs $n$ independent successful actions, then a crude approximation of task success is:
This is not the paper’s evaluation formula. It is a simple way to reason about operational fragility.
If a GUI agent succeeds at each required manipulation 80% of the time, and the task requires nine such manipulations, the full-task probability is:
That is already close to the primitive-mode success range. If per-step reliability drops further, the task becomes hopeless very quickly.
This is the unglamorous reason execution quality matters. A workflow can tolerate a model being slightly verbose. It cannot tolerate an operator that randomly misses every fourth mechanical step. In business automation, that is not “agentic behavior.” That is a support ticket generator with a nice logo.
Static perception is strong; closed-loop action is not
A second tempting explanation is that the models simply cannot see the Scratch interface clearly enough. Maybe the blocks are visually confusing. Maybe occlusion, small text, or connection shapes cause perception failures.
The paper tests this too.
Its Visual Perception QA Benchmark contains 200 manually curated samples across three dimensions: connection detection, block existence, and field-value reading. These questions check whether models can identify whether blocks are connected, whether blocks are visible under clutter or occlusion, and whether parameter fields contain the right values.
GPT-5 reaches 90.5% overall accuracy on this static perception benchmark.
That does not make it good at dragging.
This is one of the paper’s most useful diagnostic results. It separates static visual understanding from dynamic GUI control. A model can answer questions about the interface and still fail to manipulate it. Seeing a valid target is not the same as moving toward it, adjusting for snapping behavior, and verifying that the final program state changed correctly.
For business readers, this should sound familiar. Many AI tools already look impressive when asked to describe a screen, summarize a dashboard, or infer the next step. But operational value depends on whether the system can execute the step, detect that it executed correctly, and recover when it did not.
Perception is a prerequisite. It is not the product.
The appendix tests are not side quests; they narrow the failure mode
The appendices are useful because they prevent a lazy reading of the paper.
Appendix A reports an ablation on primitive-mode observation representations. The authors compare screenshots-only inputs with element-list-plus-screenshot inputs on 24 representative samples. The result is not a clean victory for either extreme. Structured element lists help models interpret complex GUI environments, but index-based drag-and-drop can also cause imprecise positioning because it may target the center of a destination block rather than a valid snap point.
That is an implementation detail with a larger lesson: adding structure to the observation does not automatically solve action grounding. Metadata can help the model know what exists, while still failing to tell it where the cursor should release.
Appendix I breaks the single-step drag benchmark into direct connection and slot insertion. This is a robustness and diagnostic extension, not a second thesis. It shows that different drag interactions fail differently. Slot insertion can show higher endpoint component accuracy in some settings, while direct connection has its own alignment difficulties. The important point is not that one interaction type is universally harder. The point is that “drag-and-drop” is not one skill. It is a family of spatial operations with different valid regions.
Appendix J explains the BFS-based feasible-region computation. This matters because the benchmark does not rely on vague visual judgment. It computes valid start and end regions by checking whether candidate points produce the intended Scratch VM state. In other words, the benchmark treats the GUI as a functional system, not as a screenshot puzzle.
Appendix K introduces heuristic hints for coordinate prediction. These hints tell models where to grab and where to drop for different block operations: align with a block’s left edge, offset below a target’s bottom edge, position inside C-block mouths, or place reporter blocks at the center of input slots. These heuristics improve results but do not fully solve the problem. That supports a restrained interpretation: explicit rules help, especially for narrow operations, but current agents still lack robust execution policies.
| Test or appendix item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main ScratchWorld results | Main evidence | There is a large gap between program reasoning and GUI execution | That all GUI tasks fail in the same way |
| Single-Step Drag Benchmark | Diagnostic isolation | Even one drag-and-drop action is unreliable | That long-horizon reasoning never matters |
| Visual Perception QA | Alternative-explanation test | Static perception is not the dominant bottleneck | That perception is solved in all dynamic settings |
| Observation ablation | Implementation sensitivity test | Structured metadata helps interpretation but can still misground action | That element lists are always the best interface |
| Drag-type breakdown | Robustness and decomposition | Drag-and-drop contains multiple spatial subskills | That one universal coordinate heuristic will solve all snapping |
| BFS feasible-region computation | Evaluation validity mechanism | Success is tied to functional GUI state, not visual guesswork | That the benchmark covers every possible GUI environment |
This is the difference between a useful benchmark and a theatrical leaderboard. A theatrical leaderboard says model A beats model B. A useful benchmark tells you which subsystem to fix.
ScratchWorld points at the execution layer.
The business lesson is not “use Scratch for enterprise automation”
Scratch is not Salesforce. It is not Power Automate. It is not UiPath. It is not an ERP configuration panel maintained by three consultants and one extremely tired analyst.
But Scratch shares a structural property with many business tools: the user builds logic through a visual interface. The task is not just navigation. It is construction.
That distinction is important. A web-navigation agent can succeed by clicking links, selecting menus, and filling forms. A construction agent must preserve structural relationships. In Scratch, blocks must connect correctly. In low-code platforms, workflow nodes must route correctly. In BI tools, calculated fields must bind correctly. In automation builders, triggers, actions, filters, and exception paths must be placed and configured in ways that the runtime actually understands.
The direct business inference is therefore limited but valuable:
ScratchWorld does not prove that all enterprise GUI agents will fail. It does show that for drag-and-drop, snap-sensitive, visual-construction environments, reasoning ability alone is a poor proxy for operational reliability.
That is the translation.
The benchmark’s composite mode resembles a semantic API layer. Primitive mode resembles direct GUI operation. The gap between them suggests a product design principle: when possible, do not force an AI agent to operate like a human if the system can expose semantic controls underneath.
A serious AI automation product should prefer:
| Product design choice | Why ScratchWorld supports it |
|---|---|
| Semantic APIs over pixel dragging | Composite mode dramatically outperforms primitive mode |
| Snap-aware control policies | Endpoint localization is the dominant low-level failure |
| Runtime validation after each operation | Execution-based evaluation catches errors that visual plausibility misses |
| Closed-loop correction rather than one-shot action | Static perception does not guarantee successful manipulation |
| Task-specific interaction policies | Drag-and-drop decomposes into different spatial subskills |
| Human review for high-stakes visual edits | Primitive-mode reliability is not yet strong enough for blind autonomy |
The implication is not that GUI agents are useless. The implication is that deployment should match the reliable part of the stack.
Use the model to plan. Use APIs to execute. Use runtime checks to validate. Use the GUI only when you have no better interface, and even then, treat each action as a hypothesis that must be confirmed.
Yes, this is less magical than “the agent uses the computer like a human.”
It is also how systems survive contact with real work.
Planning is increasingly cheap; dependable operation is still expensive
The paper’s results suggest a useful separation for AI product teams.
Reasoning over structured tasks is becoming easier to obtain. Composite-mode results show that strong models can solve a substantial share of ScratchWorld when the interaction layer gives them semantic operations. This is not trivial. Scratch programs still require variables, loops, events, sprite behavior, and functional tests. The models are doing meaningful reasoning.
But once the problem becomes low-level action, more reasoning does not automatically convert into reliability.
This matters for ROI. Many AI automation pitches quietly assume that intelligence is the scarce ingredient. Add a stronger model, and the workflow will work. ScratchWorld suggests that in visual-construction workflows, the scarce ingredient may instead be execution infrastructure: adapters, validators, interface abstractions, and recovery loops.
That changes the cost model.
A company trying to automate a low-code workflow builder should not spend all its effort choosing the most eloquent model. It should ask whether the platform exposes a semantic API. It should test whether actions can be validated through runtime state. It should identify which GUI operations require pixel-level precision. It should measure per-step reliability, not just final demo success.
The boring measurement is the strategic one.
Where the result applies, and where it should not be overextended
The paper’s strongest claim applies to Scratch-style program-by-construction through a GUI. It is especially relevant when tasks involve drag-and-drop, snapping, block connection, slot insertion, and long sequences of structural edits.
It applies less directly to tasks where GUI actions are mostly navigational: opening pages, clicking buttons, reading dashboards, or filling forms. Those tasks can still fail, but their failure modes are different. A button click has a larger target and a more binary outcome. A snapped program block has geometry, hierarchy, and runtime meaning.
The result also does not imply that planning is solved. Extend and Compute tasks remain challenging, and some failures still involve logic. The point is narrower: in this benchmark, the dramatic gap between composite and primitive modes shows that execution mechanics are a major bottleneck even when reasoning is comparatively strong.
Finally, ScratchWorld is a benchmark, not a production deployment study. It evaluates agents under controlled task definitions, specific prompts, a local Scratch GUI, and a designed action space. That is exactly what makes the diagnosis clean, but it also means the numbers should not be copy-pasted into enterprise ROI spreadsheets. Please do not tell a CFO that all visual automation has a 14.46% success rate. That would be numerically precise and strategically unserious.
The better use is architectural: separate reasoning from execution, test the execution layer directly, and do not infer operational readiness from planning ability.
The next GUI-agent breakthrough may look less like a bigger model and more like a better hand
ScratchWorld’s uncomfortable finding is that multimodal GUI agents can be conceptually right and operationally wrong.
That is a serious distinction. A wrong plan is easy to blame on the model. A right plan executed badly is harder to diagnose because the product still sounds intelligent. It may explain the correct solution while failing to produce it. That is the most annoying kind of assistant: articulate, confident, and mechanically clumsy.
The paper therefore points toward a more mature view of agentic AI. We should stop treating “can reason about the interface” and “can operate the interface” as the same capability. They are coupled, but they are not identical. In visual programming and low-code environments, the coupling runs through a fragile spatial layer where small errors compound.
The practical future may not be an AI agent moving a cursor around like a tiny office intern trapped in a browser. It may be a hybrid system: multimodal reasoning for interpretation, semantic APIs for manipulation, snap-aware controllers for unavoidable GUI operations, and runtime validators that check whether the world changed as intended.
Less cinematic. More useful.
ScratchWorld’s message is simple enough to be useful and precise enough to be uncomfortable:
The agent can think in blocks.
It just cannot reliably drop them yet.
Cognaptus: Automate the Present, Incubate the Future.
-
Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, and Xiangfeng Wang, “See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch,” arXiv:2602.10814, 2026. ↩︎