Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

A model can generate a pretty sequence of images. Good. So can a slide deck.

The harder question is whether those images actually help it think.

That is the uncomfortable point behind MentisOculi: Revealing the Limits of Reasoning with Mental Imagery, a new benchmark paper that tests whether frontier multimodal models can do something closer to human mental imagery: form a visual state, keep it stable, transform it step by step, and use the transformed state to decide what to do next.¹ Not merely “look at an image and answer a question.” Not “draw a plausible intermediate picture.” Actual visual reasoning, with consequences.

The distinction matters because the market is drifting toward a convenient assumption: once models can generate intermediate images or videos, they will reason more like humans. Give the model a visual scratchpad, and intelligence will politely arrive.

MentisOculi’s answer is less polite. Current models can often perceive the input. They can sometimes solve a symbolic version of the problem. They can sometimes generate reasonable-looking intermediate visuals. But the pipeline breaks when those pieces must work together. The model sees, draws, narrates, and still fails to use the visual trace as actionable evidence.

This is not just an academic embarrassment. For enterprise teams considering multimodal agents for design review, spatial planning, visual inspection, workflow simulation, robotics-like planning, or UI manipulation, the lesson is simple: visual reasoning traces are not valuable because they look visual. They are valuable only if they improve decisions under test. A beautifully hallucinated state is still a hallucinated state. It just has better lighting.

The missing pipeline: visual state is not visual reasoning

The paper is useful because it refuses to treat “visual reasoning” as one ability. It breaks the problem into a pipeline:

Form the state: extract the relevant objects, geometry, constraints, and relationships from the input image.
Maintain the state: preserve object identity, shape, position, and lawful constraints across steps.
Manipulate the state: apply moves, folds, rotations, placements, or tile shifts without drifting into impossible worlds.
Use the state: interpret the updated visual state as evidence for the next action.

Most public demos emphasize the first and third steps. The model sees the scene. The model produces an intermediate image. The demo looks alive. The paper asks the less comfortable question: did the generated visual state actually become part of the reasoning process, or was it just a decorative by-product?

That question is why the benchmark is designed around tasks where visual state tracking should help. MentisOculi includes five families of puzzles:

Task	What the model must preserve or transform	Why it stresses mental imagery
Form Board	Shape composition and part–whole fit	The model must compare spatial constraints, not just classify objects
Hinge Folding	Connected shapes and hinge rotations	One rotation changes downstream geometry
Paper Fold	Reflections, folds, and hole positions	The visible state hides latent spatial structure
Rush Hour	Vehicle positions, motion axes, blockers, and legal moves	Planning depends on stepwise state updates
Sliding Puzzle	Tile positions and visual coherence	Local moves must restore a global image

Each task is procedurally generated across five difficulty levels. Difficulty scales with the minimum number of operations required to solve the instance. The initial release contains 30 instances per level per task, producing 750 puzzle instances. More importantly, each instance comes with a ground-truth solution and a ground-truth visual chain of thought: the correct sequence of intermediate visual states.

That last feature is the diagnostic weapon. It lets the authors distinguish between several failure modes that are usually blurred together.

If a model fails with self-generated images, perhaps the images were wrong. If it still fails with oracle visual states, the problem is deeper: it cannot use correct visual evidence. MentisOculi is built to separate those two failures. This is exactly what most shiny multimodal demos avoid doing, for understandable reasons. Nobody wants the magic trick audited frame by frame.

What MentisOculi tests that ordinary image benchmarks do not

Many benchmarks test reasoning about images. A model receives an image and answers a question. That can be valuable, but it does not prove the model can maintain and manipulate a visual representation over time.

MentisOculi targets reasoning with imagery. The difference is not semantic hair-splitting. It changes the evaluation problem.

A static VQA-style task can often be solved by extracting a few facts from an image and converting them into text. A grid task can sometimes be reduced to a symbolic table. A one-step transformation can be guessed from local cues. MentisOculi deliberately pushes against those shortcuts. Its tasks use geometric constraints, off-grid transformations, continuous positions, sequential dependencies, and visual details that are awkward to compress into a short textual representation.

The benchmark is not claiming that humans solve every such task perfectly or that visual imagery is always superior to symbols. The point is narrower: these are controlled tasks where visual state maintenance should be useful, and where failure can be analyzed step by step.

The paper evaluates several model families along a spectrum:

Model family	Visual reasoning style tested	Practical interpretation
Text-only multimodal LLMs	The model sees the input but reasons in text	Baseline for “implicit” multimodal reasoning
Latent visual reasoning	The model uses visually grounded latent tokens	Tests whether hidden visual scratchpads help
Unified multimodal models	The model interleaves generated images and text	Tests explicit image-based visual thoughts
Video models	The model produces a visual rollout directly	Tests fully visual state evolution

This is an unusually useful comparison because it tests a popular assumption from multiple angles. If visual thought helps, there are several places it might appear: latent tokens, generated images, or video rollouts. The paper finds no reliable evidence that any of them currently improves text-only reasoning on these multi-step visual problems.

That does not mean all approaches are equally bad. Some latent visual reasoning improves mid-level Rush Hour performance. Newer image-generating models produce cleaner and more coherent visual rollouts than older ones. Video models sometimes show meaningful attempts to execute the task. But the central result remains: explicit visual thought does not yet produce robust, reliable gains.

The first break: models often fail before long-horizon planning begins

A tempting interpretation would be: “Of course the models fail. Multi-step planning is hard.”

That is too generous.

The paper reports that, except for Gemini 3, models often fail to reliably exceed chance even at Level 1 on all tasks except Form Board. By Level 5, all models operate at or below chance. The authors also note that sub-chance performance is often driven by early termination and under-use of the available action budget, not merely by making one wrong state transition in an otherwise plausible plan.

This matters because it changes where we locate the bottleneck. The failure is not only “the model cannot plan five moves ahead.” In many cases, the model struggles to extract one valid action from the visual state.

For product teams, that distinction is painful but useful. If a visual agent cannot reliably identify the first lawful move in a controlled puzzle, then adding a longer planning horizon, a prettier rollout, or a more confident explanation is not a fix. It may simply produce a longer invalid procedure.

This is the difference between an agent that says “move panel B before rotating hinge C” and an agent that actually understands that panel B still exists, has the same geometry, and cannot pass through another object. Business workflows are full of these constraints. Warehouses, floor plans, dashboards, UI states, forms, inspection photos, CAD-like layouts, and process diagrams all punish state drift. Reality has a tiresome habit of preserving object identity.

The second break: generated images drift into impossible worlds

The paper’s qualitative appendix is where the mechanism becomes concrete. Unified multimodal models do not merely make final-answer mistakes. Their intermediate visual states often lose rule consistency.

Across tasks, generated visual rollouts show familiar pathologies:

objects change shape or identity;
new elements appear;
existing elements disappear;
board layouts change illegally;
motion directions become invalid;
geometry stops obeying the task rules.

In easier Rush Hour cases, Gemini 3-I can produce cleaner intermediate states than Gemini 2.5-I. That is progress. But at higher difficulty, the rollout still drifts: extra exits appear, cars are added or removed, motion constraints become inconsistent, and errors accumulate over steps.

This is a very specific kind of failure. The model is not simply “bad at images.” It can generate plausible images. The problem is lawful continuity. The state after step three must be the consequence of step two, which must be the consequence of step one. A plausible image that violates the transition rules is worse than useless for reasoning; it is false evidence.

This is where the human analogy becomes dangerous. Humans use mental imagery not because inner pictures are decorative, but because they are constrained by the task. When someone mentally folds a paper, the imagined holes do not spontaneously multiply in random places. When someone imagines sliding a tile, the blank square does not teleport to improve the composition. Human imagery is imperfect, but it is often tied to an internal physical model.

Current machine imagery, as tested here, often lacks that binding. It can produce frames without maintaining a simulator.

That distinction should be printed on the wall of every team building visual agents:

A generated visual trace is not a state representation unless it preserves identity, constraints, and transition rules.

Without that, the trace is not reasoning. It is cinematography.

The third break: even oracle visuals are not fully used

The strongest part of the paper is not the finding that self-generated visual chains are unreliable. That is almost expected. Image generation is still noisy, and multi-step visual consistency is hard.

The sharper result is the oracle visual chain-of-thought experiment.

The authors replace self-generated intermediate images with ground-truth visualizations. In other words, they remove the generation problem. The model is given correct intermediate visual states and instructed to use them.

Performance improves. That matters. It shows that generation errors are real and costly. On static spatial tasks such as Form Board, oracle visuals can lift performance far above chance. So the answer is not “visual aids never help.”

But the improvement is still not enough on many tasks. The model can fail even when the visual states are correct. That reveals a second failure mode: interpretation error. The model does not reliably convert the correct visual trace into the next decision.

This is the paper’s most important business lesson. Many teams think the problem is representation quality: better diagrams, better rendered states, better video rollouts, better visual explanations. MentisOculi says: not so fast. A correct visualization still has to be read as evidence.

For enterprise systems, this means “human-readable trace” and “model-usable trace” are different objects. A workflow diagram may reassure the manager. It may not improve the agent’s decision. A generated UI screenshot may look like progress. It may not help the model choose the next click. A video rollout may help a human evaluator see the intended plan. It may still fail as an internal control mechanism.

The paper’s mechanism-first diagnosis can be summarized like this:

Pipeline stage	What can go wrong	Evidence in the paper	Business meaning
Visual state formation	The model fails to extract objects, geometry, or constraints	Weak Level 1 performance on several tasks	Perception-plus-captioning is not enough for planning
State maintenance	Objects, geometry, or identities drift across steps	Generated rollouts hallucinate pieces, exits, and invalid layouts	Visual traces need state validation
State manipulation	Moves, folds, or rotations violate rules	Intermediate images become impossible configurations	Add simulators or constraint checkers, not just images
State interpretation	Correct visuals are not converted into correct actions	Oracle visual CoT helps but does not solve many tasks	Better representations still require decision grounding
Effort allocation	The model does not spend more reasoning on harder visual problems	Human time rises with difficulty; Gemini 3 token use does not rise from Level 3 to Level 5	Token budgets are a poor proxy for adaptive reasoning

That is a more useful takeaway than “models are bad at visual reasoning.” The point is not to mock the models. The point is to locate the leak in the pipe.

Why text sometimes beats visual thought

One of the paper’s more revealing experiments is the Rush Hour transcription test. The authors provide a lossless, simulator-derived textual representation of the Rush Hour state: parking lot size, exit location, vehicle centers, extents, rotations, and legal motion axes. This is not a cute natural-language prompt. It is verbose, precise, and awkward for humans.

Yet Gemini 3 and GPT-5.1 can perform strongly in this symbolic version. That implies the task is not inherently beyond their reasoning capacity. When the visual problem is converted into a structured symbolic representation, the models can solve it.

This is not a victory for “text is all you need.” It is a clue. The model’s reasoning machinery may be adequate once the state is represented in a form it can manipulate reliably. The failure is the bridge between visual perception, visual state maintenance, and action selection.

For business design, this points toward a less glamorous but more reliable architecture:

Use visual models to extract state.
Convert state into structured objects, constraints, and relations.
Use a simulator, solver, rules engine, or verifier to update that state.
Let the language model reason over validated state, not over free-floating pixels.
Use generated visuals as explanations or inspection surfaces, not as the sole reasoning substrate.

This is not as fashionable as “native visual agentic reasoning.” It is also less likely to drive into a wall because a car vanished between frames.

Standard reasoning tricks do not fix the visual bottleneck

The authors also test several familiar interventions: in-context learning, prompt optimization, larger reasoning budget, and tool use. These are not the main thesis of the paper; they are closer to robustness and diagnostic checks. Their purpose is to ask whether the observed failures are shallow prompting failures.

The answer is mostly no.

Providing in-context examples gives no systematic improvement beyond Level 1. Visual examples do not clearly outperform non-visual examples. Prompt optimization across many variants does not improve over the default prompt. Increasing reasoning budget yields negligible and inconsistent accuracy changes. Tool use does not meaningfully help; the model mainly uses tools for image preprocessing such as cropping and resizing, without improving downstream accuracy.

This is important because many AI deployment plans still rely on the sacred enterprise ritual: when the model fails, add examples, add instructions, add tools, add budget, and hope the invoice becomes intelligence.

MentisOculi suggests the failure is not that simple. The bottleneck is not merely that the model needs a better instruction saying “please reason visually.” It is that the model lacks a reliable mechanism for maintaining and using visual state under transformations.

The paper’s result does not mean prompting, tools, or budget never matter. It means they do not solve this class of failure by themselves. If the internal representation drifts, more tokens can just narrate the drift at greater length. Very helpful, in the same way a verbose wrong map is helpful.

Humans do not just think longer; they adapt effort to difficulty

The human comparison in the paper is not a broad claim about human intelligence. It is a narrow reference point for Rush Hour, using a small group of PhD students in a psychophysical setup. That boundary matters.

Still, the comparison is revealing.

Humans spend more time on harder puzzles. This sounds obvious, but it reveals an adaptive difficulty signal: when the visual-spatial problem becomes more complex, humans allocate more effort. Gemini 3, by contrast, does not increase token usage from Level 3 to Level 5. The model’s internal effort does not scale in the same way with visual-spatial complexity.

This matters because “reasoning budget” is often treated as a controllable dial. Set the model to think harder. Spend more tokens. Receive better reasoning. Nice theory. Slightly too clean.

For visual reasoning, the model may not know when it needs more state-tracking effort. Worse, it may not know what kind of effort is missing. A longer chain of text is not the same as maintaining a more faithful visual state. A longer generated rollout is not the same as a validated transition system.

The human comparison is useful because it reframes the issue: adaptive reasoning is not just more computation. It is computation allocated to the right representation at the right step.

The cost problem: expensive imagery needs evidence, not vibes

The paper makes one economic point that should interest product teams immediately. The authors report that generating a video reasoning trace with Veo 3.1 costs $3.20 per sample, more than 21 times Gemini 2.5-I and more than 60,000 times Gemini 2.5 in their setup, while all three approaches yield roughly similar performance.

This number should not be read as permanent pricing law. Model prices change. But the underlying ROI logic is stable: visual reasoning traces are expensive, and they need to earn their cost through measurable gains.

A product team should therefore ask four questions before adding generated images or video rollouts to a reasoning workflow:

Question	Bad answer	Better answer
Does the visual trace improve final task accuracy?	“It looks more transparent.”	A/B tested accuracy or error reduction
Does it preserve state lawfully across steps?	“The frames look plausible.”	Transition validation against rules or simulator state
Does the model actually use the trace?	“The trace is in the context.”	Performance improves when trace quality improves
Is the cost justified?	“Multimodal is strategic.”	Incremental value exceeds incremental inference cost

The third question is the one many teams will skip. They should not. MentisOculi shows that placing correct visuals in the model’s context does not guarantee that the model uses them correctly. Context is not comprehension. It is just proximity.

What Cognaptus infers for business use

The paper directly shows that current frontier models struggle with controlled multi-step visual reasoning tasks, and that explicit visual thoughts do not reliably improve performance over text-only reasoning. It also shows that failures include both generation errors and interpretation errors.

From this, Cognaptus draws three practical inferences.

First, \ast\astvisual reasoning should be evaluated as a state-update problem, not as a demo aesthetic\ast\ast. If an agent is meant to manipulate layouts, diagrams, forms, dashboards, inspection photos, or simulated environments, the evaluation should test whether it preserves object identity and legal transitions across steps. Screenshots alone are not enough.

Second, \ast\aststructured state is still the safest middle layer\ast\ast. For many business workflows, the best architecture may not be a model that “thinks in images.” It may be a model that extracts visual information into a structured state, then relies on deterministic or probabilistic validators to update that state. This is less romantic. It also has the advantage of being debuggable.

Third, \ast\astvisual traces are better treated as auditable artifacts than as trusted reasoning substrates\ast\ast. Generated images and videos can help humans inspect what the model intended. They can support debugging, user communication, and training data construction. But until the trace is shown to improve decisions, it should not be assumed to be the engine of reasoning.

Here is the operational translation:

Use case	Naive multimodal approach	More robust approach
UI automation	Let the agent inspect screenshots and decide clicks	Extract UI elements into structured state; verify actions against DOM or accessibility tree
Visual inspection	Ask the model to reason over images and explain	Pair model perception with rule-based defect checks, measurement tools, and human escalation
Spatial planning	Generate visual plans step by step	Use geometry-aware planners or simulators, then render visual explanations
Process diagrams	Let the model visually follow the flow	Convert diagram to graph/state machine, then reason over graph transitions
Design review	Ask for visual chain-of-thought	Evaluate against constraints: dimensions, collisions, dependencies, compliance rules

The pattern is consistent: visuals should enter the system, but validated structure should carry the reasoning burden.

Boundaries: what the paper does not prove

MentisOculi is a controlled benchmark of synthetic and puzzle-like tasks. It is not a benchmark of real-world perception, social understanding, medical imaging, legal judgment, autonomous driving, or general human-level intelligence. It should not be used to claim that models cannot reason visually in any possible setting.

The benchmark is also deliberately stylized. That is a feature for diagnosis, but a limitation for direct deployment inference. The puzzles expose state-tracking failures under clean constraints; real business images add noise, context, ambiguity, domain-specific semantics, and messy human expectations. Performance could be better in some operational settings and worse in others.

The human study is also a reference point, not a sweeping comparison. It focuses on Rush Hour and uses a small, high-performing participant group. Its value is not that it proves a universal human-machine gap. Its value is that it highlights a qualitative difference in effort allocation: humans appear to adjust time to difficulty; models do not show the same adaptive pattern in token usage.

Finally, the cost comparison reflects the authors’ experimental setup. Prices and model capabilities will change. The durable point is not the exact dollar amount. The durable point is that expensive visual reasoning must produce measurable task gains, not merely more persuasive intermediate media.

The real lesson: build a simulator before buying the movie camera

The most useful way to read MentisOculi is not as a pessimistic paper about multimodal AI. It is a diagnostic paper about representational alignment.

The models are not empty. They can perceive. They can reason symbolically. They can generate images. Some newer systems produce cleaner rollouts. Oracle visuals help on some tasks. There is real capability here.

But capability is not integration.

The gap is between seeing, maintaining, transforming, and deciding. Current models often fail because those stages are not reliably connected. The generated image is not guaranteed to be a lawful state. The correct visual state is not guaranteed to become evidence. The token budget is not guaranteed to rise with visual difficulty. And the model’s confidence is, as usual, not a warranty.

For businesses, the correct response is not “avoid multimodal AI.” That would be lazy. The correct response is to stop treating visual output as proof of visual reasoning.

Use benchmarks like MentisOculi. Build task-specific simulators. Convert images into structured state. Add validators. Compare text-only, structured-state, image-trace, and video-trace versions under the same scoring rules. Ask whether the expensive visual path actually improves decisions.

Seeing is useful. Seeing plus drawing is impressive. Seeing, remembering, transforming, and acting lawfully is the thing we actually need.

We are not there yet.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, and Wieland Brendel, “MentisOculi: Revealing the Limits of Reasoning with Mental Imagery,” arXiv:2602.02465, 2026. https://arxiv.org/abs/2602.02465 ↩︎

The missing pipeline: visual state is not visual reasoning#

What MentisOculi tests that ordinary image benchmarks do not#

The first break: models often fail before long-horizon planning begins#

The second break: generated images drift into impossible worlds#

The third break: even oracle visuals are not fully used#

Why text sometimes beats visual thought#

Standard reasoning tricks do not fix the visual bottleneck#

Humans do not just think longer; they adapt effort to difficulty#

The cost problem: expensive imagery needs evidence, not vibes#

What Cognaptus infers for business use#

Boundaries: what the paper does not prove#

The real lesson: build a simulator before buying the movie camera#