Paper is a good trap for artificial intelligence.
Fold it, punch it, unfold it, and ask where the holes are. A person may not solve the problem instantly, but the mind knows what to do: imagine the folded sheet opening step by step. The reasoning is not mainly verbal. We do not narrate every cell of the paper grid like a bored accountant reading inventory codes. We see the transformation.
That is the simple intuition behind Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models, a January 2026 paper from researchers at Tsinghua University and ByteDance Seed.1 The paper’s argument is not merely that multimodal models should use images. We already knew models could ingest images, describe images, and generate images. Lovely. The harder question is whether generated images can function as intermediate reasoning states.
The authors’ answer is conditional and therefore more useful: visual generation helps when the reasoning task needs a visual world model. It does not automatically help because a model can draw. A doodle is not intelligence. Sometimes it is just expensive latency with pixels attached.
The paper’s value lies in this distinction. It formalizes multimodal chain-of-thought as reasoning over evolving observations of a hidden world state. It separates two atomic capabilities of world models: reconstructing a world from partial views and simulating how a world changes over time. Then it builds a benchmark, VisWorld-Eval, to test when interleaved visual-verbal chain-of-thought actually improves reasoning.
The business lesson is equally conditional. For robotics, visual inspection, spatial question answering, digital twins, physical planning, and embodied AI, visual generation may become a reasoning substrate. For ordinary symbolic tasks, workflow routing, text summarization, or simple grid-state tracking, forcing image generation into the loop may be mostly theatrical. Enterprise AI already has enough theatre.
The problem is not visual input; it is the reasoning medium
Most current multimodal systems are still text-first reasoners. They may accept images, but their intermediate reasoning is usually expressed in language. The visual input gets encoded, aligned with a language model, and then the model talks its way toward an answer.
That works well when the task is naturally symbolic. Math proofs, code, legal clauses, spreadsheet formulas, and policy rules all tolerate language as a reasoning medium. Text is compact, compositional, and easy to inspect.
Physical space is less cooperative.
Try describing the changing shape of a folded sheet, the trajectory of a bouncing ball, or the unseen side of a cube stack using only words. The description quickly becomes either vague or painfully long. This is the representational bottleneck the paper targets. Language can say “the ball reflects off the wall,” but it struggles to preserve the precise geometry needed to know which hole the ball enters. Language can say “the object is behind the camera,” but it often loses the viewpoint transformation that makes the answer obvious.
The paper frames this as a world-model problem. A reasoning system does not merely need facts. It needs an internal representation of the task world, and that representation must be useful for the next reasoning step.
In the authors’ formulation, a task world has hidden states. We observe the world through different views: text, images, camera angles, sketches, symbolic matrices, or other modality-specific observations. Reasoning then becomes a process of generating and updating intermediate observations. Sometimes those observations are verbal. Sometimes they are visual. Sometimes the model keeps them implicit in hidden activations.
That last detail matters. The paper does not claim that every useful world model must be visible to the user. It distinguishes three forms of chain-of-thought:
| Reasoning form | What the model externalizes | When it is plausible |
|---|---|---|
| Implicit world modeling | No explicit intermediate state | Simple state tracking where hidden representations are enough |
| Verbal world modeling | Textual or symbolic state descriptions | Structured problems where coordinates, matrices, or rules are compact |
| Visual world modeling | Generated intermediate images interleaved with text | Physical, spatial, or viewpoint-heavy tasks where images preserve useful structure |
This is the mechanism-first reading: the model is not “thinking better because it draws.” It is thinking better when the generated visual state carries task-relevant information more faithfully than a text state.
Reconstruction and simulation are the two jobs that matter
The paper’s cleanest move is to split visual world modeling into two atomic capabilities.
The first is world reconstruction. Given partial observations, infer the underlying structure and generate a new view. This is the mental rotation problem: from an isometric view and two orthographic views of a cube stack, infer what the stack looks like from another side. It is also the real-world spatial reasoning problem: from several camera views, infer where an object or region is relative to the observer.
The second is world simulation. Given the current state and an action or dynamic rule, predict what happens next. This is paper folding, object manipulation, ball tracking, maze navigation, and Sokoban-like planning.
These two capabilities are not just convenient benchmark labels. They describe different operational failures.
A reconstruction failure happens when a system cannot build a coherent hidden structure from incomplete views. A warehouse robot may see only parts of a shelf. A real estate inspection model may see different rooms from different camera angles. A digital twin may need to infer occluded equipment layout. The system fails not because it lacks a caption, but because it lacks a stable spatial model.
A simulation failure happens when a system cannot roll the world forward. A robot arm moves an object, a ball reflects, a folded material unfolds, a worker changes the arrangement of items on a table. The system must update the state, not merely label the scene.
This distinction gives the paper its practical bite. “Visual reasoning” is too broad a category. Reconstruction and simulation are deployable capability requirements.
| Capability | Technical question | Business analogue | Typical failure if done only in text |
|---|---|---|---|
| World reconstruction | Can the model infer a coherent world from limited views? | Inspection, spatial QA, site mapping, digital twins | Hallucinated layout, wrong viewpoint relation, brittle occlusion handling |
| World simulation | Can the model predict state changes over steps? | Robotics planning, process simulation, physical troubleshooting | Lost state, vague transformations, arithmetic-like tracking errors |
| Implicit tracking | Can hidden activations carry the state without external artifacts? | Simple routing, grids, checklist states | Usually sufficient when the state is low-dimensional |
| Verbal state tracking | Can text or matrices compactly represent the state? | Rules, forms, symbolic planning | Works until the description becomes the problem |
The paper’s visual superiority hypothesis follows from two claims. First, visual observations can be more informative for physical and spatial states. Second, visual pretraining may contain prior knowledge that language pretraining does not capture as naturally. Many people have watched folding, bouncing, rotating, and moving in the visual world. Fewer have read precise textual transcripts of every fold and reflection. A model trained on visual data may therefore carry useful priors in the image-generation pathway.
That is not mysticism. It is distribution alignment. If the downstream task resembles patterns learned visually during pretraining, the visual pathway may require less post-training data than the verbal pathway.
VisWorld-Eval tests the mechanism, not the slogan
The authors construct VisWorld-Eval to isolate tasks where visual world modeling should or should not help. This is important because earlier evaluations of visual chain-of-thought often mixed together tasks where images were useful, decorative, or actively distracting.
VisWorld-Eval contains seven tasks:
| Task | Capability tested | Domain | What the model must do |
|---|---|---|---|
| Paper folding | Simulation | Synthetic | Unfold a folded grid after hole punching and count the final cutouts |
| Multi-hop manipulation | Simulation | Synthetic | Track object additions, removals, color changes, and spatial relations |
| Ball tracking | Simulation | Synthetic | Predict a reflected ball trajectory and final hole |
| Maze | Simulation | Synthetic grid-world | Navigate through a simple grid |
| Sokoban | Simulation | Synthetic grid-world | Push a box to a target position |
| Cube 3-view projection | Reconstruction | Synthetic | Infer a novel view of a cube stack |
| Real-world spatial reasoning | Reconstruction | Real-world | Answer positional questions from multiple scene views |
This design allows the evidence to answer a sharper question: when the task requires reconstructing or simulating a spatial world, does interleaved visual-verbal chain-of-thought help more than implicit or verbal chain-of-thought?
The authors use BAGEL, an open-source unified multimodal model, as the main base model. They train task-specific variants with supervised fine-tuning. For some tests, they also apply reinforcement learning from verifiable rewards. The important comparison is not zero-shot leaderboard glamour. It is controlled post-training under different reasoning formats.
The benchmark’s zero-shot results also set context. Strong proprietary vision-language models perform unevenly across the tasks. Gemini 3 Flash is reported as the best overall among the listed models, but even it remains weak on harder physical and spatial tasks such as paper folding, ball tracking, cube projection, and real-world spatial reasoning. This supports the paper’s premise: these are not solved tasks waiting for a prettier UI.
The main evidence: images help when the intermediate state is spatial
The central result comes from comparing three chain-of-thought formulations across VisWorld-Eval: implicit world modeling, verbal world modeling, and visual world modeling.
The pattern is not subtle.
On paper folding, visual world modeling reaches 39.2% accuracy, compared with 27.4% for verbal world modeling and 21.1% for implicit reasoning. On multi-hop manipulation, visual world modeling reaches 66.6%, compared with 40.0% for implicit reasoning. On ball tracking, visual world modeling reaches 57.6%, compared with 40.7% for implicit reasoning.
For reconstruction tasks, the same pattern appears. On cube 3-view projection, visual world modeling reaches 76.8%, compared with 63.7% for implicit and 60.2% for verbal world modeling. On the two reported MMSI positional relationship subsets, visual modeling improves camera-object reasoning from 46.5% to 60.9%, and camera-region reasoning from 37.3% to 54.4%.
The mechanism explains the numbers. Paper folding needs symmetry and progressive unfolding. Ball tracking needs spatial dynamics. Cube projection needs a stable 3D structure and viewpoint transformation. Real-world spatial reasoning needs coherent scene reconstruction from limited views.
In each case, the generated image is not a final illustration. It is an intermediate belief state. The model creates a visual approximation of what the world currently looks like or would look like from a new perspective, and then uses that generated state to answer.
The result table is useful, but only if read with the right discipline:
| Evidence | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 4: SFT comparison across seven tasks | Main evidence | Visual world modeling improves performance on selected simulation and reconstruction tasks | Visual generation helps all reasoning tasks |
| Figure 6a: paper-folding sample efficiency | Sensitivity / sample-efficiency analysis | Visual world modeling can use fewer SFT samples for a visual-prior-heavy task | All visual tasks will be data-efficient |
| Figure 6b: cube difficulty and fidelity | Robustness / diagnostic test | Visual world models preserve novel-view structure better than verbal matrices, including harder cube sizes | Generated images are always faithful |
| Figure 6c: maze probing | Exploratory mechanistic analysis | BAGEL can internally encode simple maze state without explicit coordinates | All hidden model states are interpretable or reliable |
| Figure 7: Qwen2.5-VL comparison | Control / comparison with prior architecture | The result is unlikely to be caused merely by BAGEL having weak verbal reasoning | UMMs dominate VLMs on all tasks |
| Figure 8: RLVR learning curves | Post-training robustness test | RL improves several formats, but does not close the visual-modeling gap | Current RL fully optimizes interleaved visual reasoning |
| Appendix MMSI and qualitative failures | Boundary and failure analysis | Gains are selective; hallucination and blurry generation remain real issues | Visual CoT is deployment-ready without safeguards |
This is the difference between reading the paper and just harvesting the biggest bars from the chart. The main evidence supports a mechanism: visual intermediate states help when they carry the relevant world structure. The supplementary analyses explain why the mechanism appears, where it weakens, and why simpler explanations are less convincing.
The sample-efficiency result is about prior knowledge, not magic
One of the more business-relevant results is the paper-folding sample-efficiency test. The authors compare visual and verbal world modeling under smaller supervised fine-tuning sets. They report that visual world modeling reaches performance comparable to verbal world modeling while using more than four times less SFT data.
This is easy to overread, so let us not.
The result does not mean “images require four times less data” in general. It means that on paper folding, a task whose transformations are naturally visual and likely better aligned with visual pretraining, the visual pathway appears more sample-efficient.
That matters because enterprise AI budgets often die in the gap between “the model can do it in a demo” and “we can collect enough clean task data to make it reliable.” If visual world modeling reduces the amount of task-specific data needed for some physical workflows, the ROI implication is real. Not because pixels are magical, but because the pretraining distribution may already contain useful priors.
For example, a factory process assistant might not have millions of labeled examples of every machine state transition. But the visual model may already know common spatial transformations: rotation, occlusion, stacking, deformation, flow, collision, opening, closing, folding. When the downstream task asks the model to reason through those transformations, visual generation may provide a better starting point than a verbose textual state ledger.
The paper’s theory section gives this a formal backbone. It argues that explicit world modeling trades off two things: the informativeness of the intermediate observation and the fidelity cost of generating that observation. More detail is not always better. A generated visual state helps only if it preserves the task-relevant structure better than the alternatives.
In business terms: do not add image generation because it looks advanced. Add it when the generated state reduces uncertainty for the next decision.
The cube result exposes the difference between answer accuracy and state fidelity
The cube 3-view projection task is especially revealing because the paper evaluates not only final answers but also intermediate world-model fidelity.
The model must infer a new view of a stack of colored cubes. Verbal world modeling represents intermediate views using symbolic character matrices. Visual world modeling generates the view as an image.
The final accuracy advantage is already clear: visual world modeling reaches 76.8%, while implicit and verbal versions sit around the low 60s. But the more interesting diagnostic is fidelity. Under a relaxed shape-only evaluation, verbal world modeling degrades toward near-zero fidelity as cube-stack size increases. Visual world modeling keeps fidelity scores consistently above 50%.
That tells us the improvement is not merely answer-pattern learning. The generated visual intermediate is closer to the intended structure.
This distinction matters for deployment. In many enterprise settings, answer accuracy alone is not enough. If a system gives the right final answer for the wrong intermediate reason, it may fail unpredictably under small distribution shifts. A model used for inspection, simulation, or planning must preserve the relevant state, not just guess the label.
The paper also reports that visual world modeling maintains an advantage on an out-of-distribution cube size of six, where training used smaller stack sizes. The improvement is around 10 percentage points. That is not a license to trust the system blindly, but it does suggest that the visual pathway may generalize better when the task is a spatial transformation rather than a memorized verbal rule.
Still, the appendix is honest about failure cases. Visual generations can blur details, corrupt structure, or infer colors incorrectly. This is exactly why the paper’s mechanism is useful: visual world modeling is a state representation with fidelity limits, not a guarantee of truth. Generated images can help reasoning and still hallucinate. The model has not become a physicist. It has acquired a better scratchpad for certain jobs.
The grid-world exception is the best part of the paper
The most useful result may be the negative one: visual world modeling does not clearly help maze and Sokoban in the same way.
On maze, implicit world modeling performs best at 77.0%, while verbal and visual variants score lower at 73.1% and 70.6%. On Sokoban, visual world modeling scores 39.3%, slightly above verbal at 36.8% and implicit at 29.6%, but the authors do not treat this as the same kind of strong visual advantage seen in paper folding, ball tracking, cube projection, or MMSI.
Why?
Because simple grid-world states can be represented compactly. The model may need to track one or two coordinates. Text, hidden activations, or simple symbolic state updates can handle that without generating images. Here, visual generation adds complexity without much additional useful information.
The authors probe this directly in the maze setting. They fine-tune BAGEL on chain-of-thought data where explicit point coordinates are masked. Then they train small probes on the model’s hidden representations to predict the masked coordinates. A randomly initialized model fails, as expected. The pretrained model already contains some predictive internal representations. After supervised fine-tuning, the hidden states become near-perfectly predictive of the coordinates.
This is an exploratory mechanistic analysis, not a universal law. But it explains why explicit visual generation is unnecessary for simple maze reasoning. The model already has an implicit world model good enough for the task.
For business use, this is the part worth taping to the wall:
Use visual generation when the intermediate state is hard to preserve in text or hidden state. Do not use it merely because the task contains an image.
A simple warehouse routing grid may not need visual CoT. A multi-camera occlusion problem might. A form-processing workflow definitely does not need a generated image of its own spreadsheet having a moment of self-discovery.
The control tests narrow the explanation
The paper also addresses two alternative explanations.
First, perhaps BAGEL performs better with visual world modeling only because its verbal reasoning is weak. To test this, the authors fine-tune Qwen2.5-VL-7B-Instruct on the same verbal chain-of-thought datasets for representative tasks. The Qwen-VL verbal results are comparable to BAGEL’s verbal results and still lag behind BAGEL when BAGEL uses visual world modeling.
This control matters. It reduces the chance that the paper is merely comparing a weak verbal baseline against a favored visual pathway. The advantage appears tied to the modality of world modeling, not simply to a crippled text reasoner.
Second, perhaps reinforcement learning from verifiable rewards could train verbal chain-of-thought to catch up. The authors run RLVR on representative tasks. RL improves several chain-of-thought formats, but the performance gap persists. Even more interestingly, visual world modeling improves under RL even though only the verbal generation component is directly optimized, while visual generation is regularized.
That suggests the current result is not just a supervised fine-tuning artifact. It also points to an obvious future direction: RL methods designed specifically for interleaved verbal-visual generation. At the moment, the visual side is still not being optimized as directly as it could be. So the paper is probably not showing the ceiling. It is showing a plausible early floor.
What the paper directly shows, and what businesses should infer
The paper directly shows that, under controlled post-training on a BAGEL-style unified multimodal model, interleaved visual-verbal chain-of-thought improves accuracy on selected spatial and physical reasoning tasks designed around world simulation and reconstruction. It also shows that the advantage weakens or disappears when the task state is simple enough for implicit or verbal tracking.
Cognaptus’ business inference is narrower but valuable: visual generation should be treated as an optional reasoning substrate for workflows where the intermediate state is spatial, physical, occluded, or dynamic.
That gives us a design rule:
| Workflow question | Recommended reasoning substrate | Why |
|---|---|---|
| Is the task mostly symbolic, textual, or tabular? | Verbal / tool-based reasoning | Images add little and may increase cost |
| Does the task require tracking a small discrete state? | Implicit or symbolic state tracking | Coordinates or tables are cheaper than generated images |
| Does the task require viewpoint transformation or occlusion recovery? | Visual world reconstruction | Generated views may preserve spatial relations better |
| Does the task require predicting physical state changes? | Visual world simulation | Images may better encode dynamics and geometry |
| Does the task require auditable intermediate states? | Visual plus structured checks | Generated images need fidelity validation |
| Is the answer safety-critical? | Multimodal reasoning with external verification | Visual CoT is not enough by itself |
For robotics, this could mean generating intermediate predicted frames before selecting an action. For inspection, it could mean reconstructing likely hidden structures before flagging anomalies. For digital twins, it could mean using generated views as intermediate hypotheses, not final truth. For spatial QA, it could mean synthesizing a missing viewpoint before answering a positional question.
The ROI pathway is not “prettier outputs.” It is fewer failed decisions in workflows where text collapses too much state.
But the cost side is also real. Visual generation is slower, heavier, and harder to verify than text. If the model generates a wrong intermediate image, it can confidently reason from a false state. That is not an edge case. The appendix explicitly shows failures involving blurred details, corrupted views, subtle shape errors, and color inference mistakes.
In practical systems, visual world modeling needs a verifier. The verifier may be geometric rules, simulation constraints, sensor feedback, object detectors, multi-view consistency checks, or a human-in-the-loop review layer. The generated image should be treated as a hypothesis, not a photograph from heaven.
The deployment boundary is not “multimodal”; it is “state fidelity”
The most common mistake after reading this paper would be to say: “Great, let’s add visual chain-of-thought to our multimodal AI product.”
No. Not great. Try again.
The deployment boundary is whether generated intermediate observations improve state fidelity for the task. If they do, visual world modeling may be worth the compute. If they do not, it is decoration with an invoice.
There are four practical tests before adopting this architecture.
First, ask whether the task state is naturally visual. If the important information is relative position, motion, geometry, occlusion, deformation, or viewpoint, visual modeling deserves consideration. If the important information is a policy clause, a customer tier, or an invoice amount, please leave the pixels alone.
Second, compare against cheaper state representations. A matrix, graph, table, coordinate list, or physics engine may outperform a generated image. The paper itself shows verbal matrices are viable in some settings, although weak for complex cube projection. The right question is not “visual or verbal?” It is “which representation preserves the decision-relevant state with the lowest error and cost?”
Third, measure intermediate fidelity, not just final answer accuracy. The cube analysis is a good example. If the generated state is wrong, the system may still occasionally answer correctly by luck or bias. That is not a system one should deploy into physical operations unless one enjoys incident reports.
Fourth, route dynamically. Some cases need visual generation; others do not. A mature AI workflow should select the reasoning substrate based on task structure. The paper’s own results justify selective multimodality, not blanket multimodality.
Where the paper stops
The paper is careful about scope, and the boundaries matter.
The experiments focus mainly on spatial and physical reasoning tasks. That is appropriate for the hypothesis, but it means the results should not be generalized to all multimodal reasoning. Visual jigsaw tasks, mathematical diagram editing, and other visual reasoning settings may fit the same world-model lens, but the paper does not fully settle them.
The main model is BAGEL, with Qwen2.5-VL used as a comparison for verbal reasoning. Stronger future unified multimodal models may change the absolute numbers. They may also reduce some current failure modes in generation quality and spatial understanding. The mechanism may survive, but the performance frontier will move.
The reinforcement learning setup also does not fully optimize visual generation. RL is applied to the verbal component, with visual generation regularized. Better RL methods for interleaved visual-verbal reasoning may produce larger gains or reveal new failure modes.
Finally, the interpretability probe for maze reasoning is intriguing but preliminary. It shows that hidden representations can encode simple coordinates in this setting. It does not mean we can reliably inspect or trust implicit world models in complex environments.
These limitations do not weaken the paper’s main contribution. They sharpen it. The paper is not claiming that generated images are universally better than text. It is giving us a framework for asking when the reasoning medium matches the world being reasoned about.
Seeing becomes thinking only when the image carries the state
The useful slogan is not “models should think in pictures.” That is too broad, and therefore probably wrong.
The better conclusion is: models should reason in the representation that best preserves the task state.
For many business workflows, that representation will remain text, tables, graphs, code, or tool calls. For spatial and physical workflows, generated images may become a serious intermediate reasoning medium. They can reconstruct hidden structure, simulate state changes, and preserve relations that language compresses badly.
This paper gives visual generation a more disciplined role. It is not output decoration. It is not a universal upgrade. It is not a magical substitute for physics, sensors, or verification.
It is a world-modeling instrument.
Used selectively, that instrument could matter for embodied AI, robotics, inspection, planning, and digital twins. Used indiscriminately, it will produce beautiful intermediate hallucinations at enterprise scale. Naturally, someone will try that too.
Cognaptus: Automate the Present, Incubate the Future.
-
Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long, “Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models,” arXiv:2601.19834, 2026, https://arxiv.org/abs/2601.19834. ↩︎