Opening — Why this matters now
Multimodal AI is having its cinematic moment. Video generation, image rollouts, and interleaved vision–language reasoning are being marketed as steps toward models that can think visually. The implicit promise is seductive: if models can generate images while reasoning, perhaps they can finally reason with them.
This paper delivers a colder verdict. When tested under controlled conditions, today’s strongest multimodal models fail at something deceptively basic: maintaining and manipulating internal visual representations over time. In short, they can see—but they cannot mentally imagine in any robust, task‑reliable way.
Background — From “reasoning about images” to “reasoning with imagery”
Most multimodal benchmarks evaluate reasoning about images: classification, captioning, VQA, or short‑horizon spatial queries. These tasks rarely require maintaining a persistent internal world model.
Mental imagery is different. It demands that a system:
- Form an internal visual state
- Apply rule‑governed transformations to that state
- Preserve object identity and geometry across steps
- Use the evolving visual state as evidence for future decisions
Humans do this effortlessly when solving puzzles like paper folding or sliding blocks. Whether modern multimodal models do the same has largely been assumed—not rigorously tested.
What the paper does — Introducing MENTISOCULI
The authors introduce MENTISOCULI, a procedurally generated benchmark designed specifically to isolate visual mental imagery as a reasoning mechanism.
The five task families
| Task | Core cognitive demand |
|---|---|
| Form Board | Spatial composition and part–whole reasoning |
| Hinge Folding | Sequential geometric transformation |
| Paper Fold | Predictive mental simulation |
| Rush Hour | State‑based planning from vision |
| Sliding Puzzle | Persistent state tracking |
Each task is stratified into five difficulty levels, defined by the minimum number of required operations. Crucially, every instance comes with a ground‑truth visual chain‑of‑thought—an oracle sequence of intermediate states.
This allows the benchmark to distinguish between:
- Failures of perception
- Failures of representation
- Failures of reasoning over representations
Findings — Where multimodal reasoning breaks
1. Visual intermediates don’t reliably help
Across tasks, adding visual rollouts or intermediate images produces inconsistent and often negligible accuracy gains. Any improvements vanish as difficulty increases.
More strikingly, allocating a higher reasoning budget (more tokens, more images) does not systematically improve performance.
Seeing more does not mean thinking better.
2. State drift is the dominant failure mode
Qualitative inspection reveals a recurring pathology:
- Objects subtly change shape or identity
- New elements hallucinate into existence
- Valid moves become illegal
- Errors compound rather than self‑correct
Once the internal visual state drifts, models do not recover. They continue reasoning confidently—on an impossible world.
3. Rollout length is poorly calibrated
If models were truly reasoning via imagery, harder problems should induce longer visual chains. Instead, generated rollouts are weakly correlated—or entirely uncorrelated—with the true number of required steps.
Humans spend more time on harder puzzles. Models do not.
4. Newer models are cleaner, not smarter
Later‑generation models produce sharper, more temporally consistent images. But this reflects better image generation, not improved visual reasoning. The underlying inability to maintain a lawful internal state remains.
Why this matters — Practical implications
For model builders
- Visual generation quality is not a proxy for reasoning capacity
- Training on interleaved vision–language data does not guarantee usable internal world models
- Benchmarks must separate representational fidelity from reasoning success
For enterprise and product teams
- Video or image‑based reasoning traces are expensive and rarely cost‑effective
- Visual explanations can look persuasive while being logically invalid
- For planning, verification, and compliance, text‑centric or symbolic hybrids remain safer
For evaluation and governance
MENTISOCULI demonstrates why “multimodal intelligence” needs sharper definitions. Without controlled diagnostics, visual reasoning claims remain marketing‑adjacent narratives.
Conclusion — Mental imagery is still an unsolved problem
This paper does not argue that visual reasoning is impossible for machines. It argues something more uncomfortable: we are mistaking representation for reasoning.
Until models can reliably maintain internal visual states under rule‑based transformation, mental imagery remains a mirage—impressive, expensive, and misleading.
MENTISOCULI does not close the gap. It finally measures it.
Cognaptus: Automate the Present, Incubate the Future.