Opening — Why this matters now

Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them.

This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way.

Background — From “reasoning about images” to “reasoning with imagery”

Most multimodal benchmarks evaluate reasoning about images: classification, captioning, VQA, or short‑horizon spatial queries. These tasks rarely require maintaining a persistent internal world model.

Mental imagery is different. It demands that a system:

  • Form an internal visual state
  • Apply rule‑governed transformations to that state
  • Preserve object identity and geometry across steps
  • Use the evolving visual state as evidence for future decisions

Humans do this effortlessly when solving puzzles like paper folding or sliding blocks. Whether modern multimodal models do the same has largely been assumed—not rigorously tested.

What the paper does — Introducing MENTISOCULI

The authors introduce MENTISOCULI, a procedurally generated benchmark designed specifically to isolate visual mental imagery as a reasoning mechanism.

The five task families

Task Core cognitive demand
Form Board Spatial composition and part–whole reasoning
Hinge Folding Sequential geometric transformation
Paper Fold Predictive mental simulation
Rush Hour State‑based planning from vision
Sliding Puzzle Persistent state tracking

Each task is stratified into five difficulty levels, defined by the minimum number of required operations. Crucially, every instance comes with a ground‑truth visual chain‑of‑thought—an oracle sequence of intermediate states.

This allows the benchmark to distinguish between:

  • Failures of perception
  • Failures of representation
  • Failures of reasoning over representations

Findings — Where multimodal reasoning breaks

1. Visual intermediates don’t reliably help

Across tasks, adding visual rollouts or intermediate images produces inconsistent and often negligible accuracy gains. Any improvements vanish as difficulty increases.

More strikingly, allocating a higher reasoning budget (more tokens, more images) does not systematically improve performance.

Seeing more does not mean thinking better.

2. State drift is the dominant failure mode

Qualitative inspection reveals a recurring pathology:

  • Objects subtly change shape or identity
  • New elements hallucinate into existence
  • Valid moves become illegal
  • Errors compound rather than self‑correct

Once the internal visual state drifts, models do not recover. They continue reasoning confidently—on an impossible world.

3. Rollout length is poorly calibrated

If models were truly reasoning via imagery, harder problems should induce longer visual chains. Instead, generated rollouts are weakly correlated—or entirely uncorrelated—with the true number of required steps.

Humans spend more time on harder puzzles. Models do not.

4. Newer models are cleaner, not smarter

Later‑generation models produce sharper, more temporally consistent images. But this reflects better image generation, not improved visual reasoning. The underlying inability to maintain a lawful internal state remains.

Why this matters — Practical implications

For model builders

  • Visual generation quality is not a proxy for reasoning capacity
  • Training on interleaved vision–language data does not guarantee usable internal world models
  • Benchmarks must separate representational fidelity from reasoning success

For enterprise and product teams

  • Video or image‑based reasoning traces are expensive and rarely cost‑effective
  • Visual explanations can look persuasive while being logically invalid
  • For planning, verification, and compliance, text‑centric or symbolic hybrids remain safer

For evaluation and governance

MENTISOCULI demonstrates why “multimodal intelligence” needs sharper definitions. Without controlled diagnostics, visual reasoning claims remain marketing‑adjacent narratives.

Conclusion — Mental imagery is still an unsolved problem

This paper does not argue that visual reasoning is impossible for machines. It argues something more uncomfortable: we are mistaking representation for reasoning.

Until models can reliably maintain internal visual states under rule‑based transformation, mental imagery remains a mirage—impressive, expensive, and misleading.

MENTISOCULI does not close the gap. It finally measures it.

Cognaptus: Automate the Present, Incubate the Future.