Seeing Is Not Reasoning: Why Mental Imagery Still Breaks Multimodal AI

Opening — Why this matters now

Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them.

This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way.

Background — From “reasoning about images” to “reasoning with imagery”

Most multimodal benchmarks evaluate reasoning about images: classification, captioning, VQA, or short‑horizon spatial queries. These tasks rarely require maintaining a persistent internal world model.

Mental imagery is different. It demands that a system:

Form an internal visual state
Apply rule‑governed transformations to that state
Preserve object identity and geometry across steps
Use the evolving visual state as evidence for future decisions

Humans do this effortlessly when solving puzzles like paper folding or sliding blocks. Whether modern multimodal models do the same has largely been assumed—not rigorously tested.

What the paper does — Introducing MENTISOCULI

The authors introduce MENTISOCULI, a procedurally generated benchmark designed specifically to isolate visual mental imagery as a reasoning mechanism.

The five task families

Task	Core cognitive demand
Form Board	Spatial composition and part–whole reasoning
Hinge Folding	Sequential geometric transformation
Paper Fold	Predictive mental simulation
Rush Hour	State‑based planning from vision
Sliding Puzzle	Persistent state tracking

Each task is stratified into five difficulty levels, defined by the minimum number of required operations. Crucially, every instance comes with a ground‑truth visual chain‑of‑thought—an oracle sequence of intermediate states.

This allows the benchmark to distinguish between:

Failures of perception
Failures of representation
Failures of reasoning over representations

Findings — Where multimodal reasoning breaks

1. Visual intermediates don’t reliably help

Across tasks, adding visual rollouts or intermediate images produces inconsistent and often negligible accuracy gains. Any improvements vanish as difficulty increases.

More strikingly, allocating a higher reasoning budget (more tokens, more images) does not systematically improve performance.

Seeing more does not mean thinking better.

2. State drift is the dominant failure mode

Qualitative inspection reveals a recurring pathology:

Objects subtly change shape or identity
New elements hallucinate into existence
Valid moves become illegal
Errors compound rather than self‑correct

Once the internal visual state drifts, models do not recover. They continue reasoning confidently—on an impossible world.

3. Rollout length is poorly calibrated

If models were truly reasoning via imagery, harder problems should induce longer visual chains. Instead, generated rollouts are weakly correlated—or entirely uncorrelated—with the true number of required steps.

Humans spend more time on harder puzzles. Models do not.

4. Newer models are cleaner, not smarter

Later‑generation models produce sharper, more temporally consistent images. But this reflects better image generation, not improved visual reasoning. The underlying inability to maintain a lawful internal state remains.

Why this matters — Practical implications

For model builders

Visual generation quality is not a proxy for reasoning capacity
Training on interleaved vision–language data does not guarantee usable internal world models
Benchmarks must separate representational fidelity from reasoning success

For enterprise and product teams

Video or image‑based reasoning traces are expensive and rarely cost‑effective
Visual explanations can look persuasive while being logically invalid
For planning, verification, and compliance, text‑centric or symbolic hybrids remain safer

For evaluation and governance

MENTISOCULI demonstrates why “multimodal intelligence” needs sharper definitions. Without controlled diagnostics, visual reasoning claims remain marketing‑adjacent narratives.

Conclusion — Mental imagery is still an unsolved problem

This paper does not argue that visual reasoning is impossible for machines. It argues something more uncomfortable: we are mistaking representation for reasoning.

Until models can reliably maintain internal visual states under rule‑based transformation, mental imagery remains a mirage—impressive, expensive, and misleading.

MENTISOCULI does not close the gap. It finally measures it.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From “reasoning about images” to “reasoning with imagery”#

What the paper does — Introducing MENTISOCULI#

The five task families#

Findings — Where multimodal reasoning breaks#

1. Visual intermediates don’t reliably help#

2. State drift is the dominant failure mode#

3. Rollout length is poorly calibrated#

4. Newer models are cleaner, not smarter#

Why this matters — Practical implications#

For model builders#

For enterprise and product teams#

For evaluation and governance#

Conclusion — Mental imagery is still an unsolved problem#