When Images Pretend to Be Interfaces: Stress‑Testing Generative Models as GUI Environments

Screenshots are easy to love. They sit still, look polished, and ask very little from the viewer. Interfaces are less polite. Click one wrong icon, place a menu twenty pixels away from where it belongs, blur one label, or forget what happened three screens ago, and the whole interaction becomes decorative theatre.

That difference matters because image generation models are increasingly discussed as possible GUI environments: cheap, flexible simulators where autonomous agents could practice app workflows without needing every real device, app state, account, and edge case wired into a conventional simulator. In theory, a model sees the current screen and a user action, then generates the next screen. Repeat this long enough, and perhaps we get scalable synthetic training data for GUI agents.

GEBench asks the awkward question hiding behind that idea: can current image generation models actually behave like interfaces, or are they merely producing convincing interface-shaped images?¹

The paper’s answer is useful precisely because it is not a blanket dismissal. The models are not hopeless. They are often good at local, one-step GUI transformations. But when the task requires temporal consistency, precise grounding, text fidelity, or multi-step logic, the gap between “looks right” and “works right” becomes visible. It turns out that a beautiful fake interface is still fake. This is not shocking. It is just inconvenient.

The main evidence is a gap, not a leaderboard

GEBench evaluates image generation models as GUI environments using 700 curated samples across five task types. The design is simple enough to understand, but strict enough to expose the failure modes that ordinary image benchmarks tend to miss.

Task type	What the model must do	Why it matters for GUI agents
Single-step visual transition	Generate the next GUI state from a current screen and a specific action	Tests whether the model can perform a local UI change
Multi-step planning	Generate a five-step GUI trajectory from a high-level objective	Tests whether the model can preserve state and logic over time
Fictional app	Generate a plausible unseen app interface from instructions alone	Tests out-of-distribution layout and interface imagination
Real app rare trajectory	Generate less common real-app interaction flows	Tests reasoning beyond frequent visual patterns
Grounding-based generation	Use normalized coordinates to generate the correct next state	Tests spatial precision, not just visual plausibility

This task split is the paper’s strongest editorial asset. It prevents the usual benchmark flattening, where one impressive aggregate score quietly hides the fact that the model can do one thing well and another thing badly. A GUI environment is not a wallpaper contest. The important question is not whether the screen looks professional. The question is whether the generated state is the correct consequence of the user’s action.

On the main table, the best commercial models perform strongly on single-step transitions. Nano Banana Pro scores 84.50 on the Chinese single-step subset and 84.32 on the English single-step subset. GPT-image-1.5 scores 83.79 and 80.80 on the same two single-step subsets. These are not weak results. For a narrow local edit, current image generators can often map instruction plus screen context into a plausible next screen.

Then the task gets longer.

Multi-step scores drop meaningfully. Nano Banana Pro remains the strongest overall, with 68.65 on Chinese multi-step and 69.51 on English multi-step, but many models fall far below their single-step performance. GPT-image-1.5 drops from 83.79 to 56.97 on the Chinese subset and from 80.80 to 58.87 on the English subset. Nano Banana moves from 64.36 to 34.16 on Chinese multi-step, although it does better on English multi-step at 50.75. Open-source models show the weakness more sharply: Bagel scores 13.45 on Chinese multi-step and 8.61 on English multi-step; Longcat-Image scores 12.75 and 8.44.

The pattern is more important than any single model ranking. Single-step GUI imitation is becoming workable. Multi-step GUI simulation is still fragile.

That distinction should shape how businesses interpret the paper. If a company wants generated GUI states for mockups, testing simple screen transitions, or creating narrow UI variants, the evidence is encouraging. If the company wants a synthetic environment where agents can learn long workflows safely and reliably, the evidence is much less comforting. The interface may look alive, but it has a short memory and poor motor control. A charming intern, perhaps. Not yet an operations platform.

GE-Score separates visual polish from functional behavior

The authors introduce GE-Score to evaluate generated GUI states across five dimensions: Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality. Each dimension is scored on a 0–5 ordinal scale, then normalized into an aggregate score.

This is not just metric decoration. The five dimensions map to different operational requirements.

GE-Score dimension	Plain meaning	Business interpretation
Goal Achievement	Did the generated state satisfy the requested action or objective?	Can this output serve as a valid next step in a workflow?
Interaction Logic	Does the transition follow realistic GUI behavior?	Would an agent trained on this learn plausible software behavior?
Consistency	Are unchanged regions preserved across states?	Does the synthetic environment avoid state drift?
UI Plausibility	Does the interface structure look native and coherent?	Could humans or agents parse the layout as software?
Visual Quality	Is the image clear, readable, and artifact-free?	Is the screen usable at the perception layer?

The important move is that Visual Quality is only one dimension. That sounds obvious until one remembers how many AI demos still depend on “look, the screen is beautiful” as if beauty were a substitute for causality.

A generated checkout page may look crisp while changing product names, losing the selected option, hallucinating a button, or placing a pop-up menu in the wrong area. A model may render a familiar app style while failing to preserve the state that matters for the next action. GE-Score forces those errors into view.

This is where the paper earns its relevance for GUI agents. Agent training does not only need attractive screenshots. It needs state transitions that are useful enough to become experience. If the generated environment teaches an agent that clicking one icon sometimes changes an unrelated widget, or that coordinates are approximate suggestions, the agent is not being trained. It is being gently gaslit.

Grounding failure is the most practical warning sign

The grounding task is especially important because real GUI work is spatial. Agents click, tap, drag, hover, scroll, and select. A workflow is not just a sequence of semantic intentions; it is a sequence of spatially located actions.

GEBench’s grounding results are weak in exactly the way a business user should care about. The paper reports that even the top-performing Nano Banana Pro reaches only 23.9% on the GOAL score for grounding, while most other models fall below 10% on that specific measure. In the main table, grounding task scores remain noticeably constrained even when overall visual quality is high. The model may understand what should appear, but not reliably where it should appear.

That distinction is easy to underestimate. In natural image generation, a small spatial offset can be acceptable. If a tree appears slightly to the left, the picture still works. In a GUI, a small offset can break the action. A context menu that appears near the wrong item is not a harmless aesthetic variation. It changes the meaning of the state.

The qualitative analysis makes this concrete. The paper identifies three bottlenecks: text rendering, icon interpretation, and localization precision. These are not random defects. They are the load-bearing parts of interface behavior.

Text is not texture. A label, price, warning, username, or menu item must be exact enough to guide the next action. Icon interpretation is not visual ornament. The icon is often the action boundary. Localization is not composition. It is the connection between input and consequence.

In other words, the hard part of generative GUI environments is not making software-shaped images. It is preserving the functional contract between what the user does and what the screen should become.

The visual-functional paradox is the real business lesson

The paper’s discussion names a useful paradox: visual fidelity does not equal functional plausibility. Some models can produce polished screens while still hallucinating widgets, distorting icons, drifting layout elements, or generating transitions that no real app would produce.

For business readers, this is the point where the article should stop sounding like a model benchmark and start sounding like a procurement warning.

A generated interface can be valuable in at least three different ways:

It can support design ideation, where visual plausibility is enough to explore layout options.
It can support testing augmentation, where generated states expose edge cases or help evaluate perceptual robustness.
It can support agent environment simulation, where generated states must behave like reliable software.

GEBench suggests that current models are much closer to the first two uses than the third. That is not a failure of the technology. It is a boundary condition. And boundary conditions are useful because they keep expensive enthusiasm from wandering into production under a fake moustache.

The danger is not that businesses will use generated GUIs. They probably should, in controlled ways. The danger is using them as if they were already faithful simulators.

A synthetic GUI environment used for design brainstorming can tolerate ambiguity. A synthetic GUI environment used for training an autonomous workflow agent cannot. The latter requires temporal coherence, stable affordances, correct text, and precise spatial causality. GEBench shows that those properties remain uneven.

What the experiments support, and what they do not

The paper’s experimental structure is worth reading carefully because not every result plays the same role.

Evidence component	Likely purpose	What it supports	What it does not prove
Main GEBench table across Chinese and English subsets	Main evidence	Model performance differs sharply across task types; single-step strength does not guarantee multi-step or grounding reliability	Real-world deployment failure rates in specific enterprise software
Radar chart across task suites	Main evidence / interpretation aid	The weakness is structural across task categories, not only one isolated bad example	Exact causal mechanism inside model architectures
Grounding GOAL comparison	Main evidence	Coordinate-based GUI generation is a major bottleneck	That all spatial failures are equally severe in every app domain
Human–VLM correlation test	Evaluation validity check	VLM judging aligns strongly with sampled human expert judgments in their setup	That automated evaluation is universally reliable for all future GUI benchmarks
Three-judge appendix tables	Robustness / cross-check	Results are not solely dependent on a single judge’s scoring preferences	That judge choice has no effect on model ordering or interpretation
Detailed rubrics	Implementation detail / reproducibility support	The scoring criteria are decomposed into observable dimensions	That the rubric captures every business-specific requirement

The human-alignment check is particularly useful but should not be overread. The paper reports a very high Pearson correlation between VLM-based evaluation and human expert scores on a sampled validation set: $r = 0.9892$ overall, with $r = 0.9926$ for Nano Banana Pro and $r = 0.9833$ for GPT-Image-1. This supports the use of VLM-as-a-judge in this benchmark setting.

It does not mean every automated GUI evaluation is now solved. The sample is limited to selected models and tasks under the authors’ rubric. For article interpretation, this test is best treated as a validity check on the benchmark pipeline, not as a second thesis about replacing human evaluation everywhere. The appendix is doing the unglamorous but necessary work of reducing measurement doubt. Thank it quietly and move on.

Why multi-step GUI simulation is harder than it looks

The intuitive explanation for the multi-step weakness is error accumulation. A small visual deviation in step one becomes a state mismatch in step two. By step five, the trajectory may still look like an app, but it is no longer the intended app state.

That is true, but incomplete.

The deeper issue is that GUI workflows are not just images over time. They are state machines with visual surfaces. A user action changes hidden state, visible state, available actions, and sometimes the rules for the next step. A model that mainly learns visual correspondences can imitate the surface but still fail to preserve the state logic underneath.

Consider ordering coffee in a mobile app, the kind of high-level objective used to motivate the multi-step setup. A plausible trajectory requires the model to preserve product choice, quantity, customization, cart state, checkout progress, and screen structure. If it forgets the selected drink, changes the button labels, or reverts to an earlier page while maintaining a visually clean interface, the image may still impress a casual viewer. It will not train a reliable agent.

This is why the single-step results are not enough. A model can learn many local mappings: click search, show search results; tap profile, show profile page; select item, show detail page. But environment simulation requires the model to maintain constraints across the full trajectory. Local mimicry is not global reasoning. The difference is where automation projects quietly fail.

How businesses should use generative GUIs now

The practical conclusion is not “do not use generative GUIs.” That would be lazy, and worse, unhelpful. The better conclusion is to classify use cases by how much functional correctness they require.

Use case	Current suitability	How to use generated GUIs responsibly
UI concept exploration	High	Generate layout variants, style directions, onboarding flows, and mock states
Synthetic examples for perception testing	Moderate to high	Use them to test whether agents can recognize buttons, labels, and layouts under variation
Edge-case visualization	Moderate	Generate rare screens for human review or semi-automated testing, not as ground truth
Single-step workflow augmentation	Moderate	Use generated next states only when checked against rules or real UI traces
Multi-step agent training environment	Low to moderate	Avoid treating generated trajectories as trusted simulators without validation
Coordinate-sensitive interaction training	Low	Do not rely on generated grounding unless spatial correctness is externally verified

This is also where Cognaptus-style automation work needs discipline. A generated GUI can reduce the cost of diagnosis. It can help teams ask, “Where does our agent fail when the screen changes?” It can support scenario expansion before expensive real-environment testing. It can help non-technical stakeholders see workflow variants faster than a conventional prototype cycle.

But it should not be quietly promoted from “synthetic aid” to “environment of record.” That promotion requires evidence the paper does not provide.

For enterprises, the responsible architecture is hybrid. Use generative models to propose states, diversify screens, and simulate simple transitions. Then constrain them with real traces, UI schemas, app logic, OCR checks, coordinate validation, and human spot review where necessary. The model can create candidate worlds. The system still needs to decide which worlds are lawful.

The benchmark is a diagnostic tool, not an ROI calculator

GEBench does not tell a company how much money it will save by using image generation in GUI-agent development. It does not measure enterprise deployment outcomes, downstream agent performance after training on generated screens, or maintenance cost in real app stacks. Those would be different studies.

What it does provide is more foundational: a way to diagnose whether an image generation model is ready to act like a GUI environment.

That diagnostic value is not small. In business automation, bad evaluation is expensive. It lets teams confuse polished demos with robust systems. It rewards models for looking competent in exactly the settings where failure is hardest to see. A benchmark like GEBench makes the failure more inspectable: which task category breaks, which dimension breaks, and whether the break is visual, logical, temporal, or spatial.

The next useful research step would be to connect GEBench-style scores to downstream agent outcomes. For example: if a model improves by ten points on grounding GOAL, does a GUI agent trained or tested against that environment perform better in real apps? If multi-step consistency improves, does it reduce workflow completion failures? That link is where benchmark relevance becomes operational value.

Until then, GEBench should be used as a stress test, not a purchase order.

The useful replacement belief

The misconception worth retiring is simple: if a generated GUI looks realistic, it is probably a useful environment.

A better belief is more demanding: a generated GUI becomes useful as an environment only when its state transitions preserve intent, logic, text, layout, and spatial causality well enough for the downstream task.

That replacement belief is less glamorous, but much safer. It also creates a clearer product roadmap. Better generative GUI systems will need more than prettier rendering. They will need explicit state modeling, stronger coordinate grounding, text treated as symbolic content, and evaluation that punishes broken interaction logic even when the screenshot looks lovely.

This is the value of GEBench. It shifts the question from “Can the model draw software?” to “Can the model behave like software?” The first question is good for demos. The second is necessary for automation.

And as usual, the second question is where the bill arrives.

Cognaptus: Automate the Present, Incubate the Future.

Haodong Li et al., “GEBench: Benchmarking Image Generation Models as GUI Environments,” arXiv:2602.09007, 2026, https://arxiv.org/abs/2602.09007. ↩︎

The main evidence is a gap, not a leaderboard#

GE-Score separates visual polish from functional behavior#

Grounding failure is the most practical warning sign#

The visual-functional paradox is the real business lesson#

What the experiments support, and what they do not#

Why multi-step GUI simulation is harder than it looks#

How businesses should use generative GUIs now#

The benchmark is a diagnostic tool, not an ROI calculator#

The useful replacement belief#