A room is a cruelly simple test for artificial intelligence.

Put a person inside it. Tell them they are facing an avocado. Ask them to turn right by 270 degrees, then left by 90 degrees. Give them a few observations along the way. After the final turn, ask what they can see.

Most humans do not need a research grant, a GPU cluster, or a motivational quote about emergent reasoning. They keep a small mental map, update their facing direction, and answer. The task is not glamorous. That is exactly why it is useful.

The paper behind this article asks whether large language models and vision-language models can perform this kind of viewpoint rotation using only text.1 Not images. Not video. Not a 3D simulator. Just a sequence of textual actions and observations.

The uncomfortable result is not merely that many models fail. The more interesting result is where they fail. The models are usually not confused about whether the prompt says “left” or “right.” They are not generally blind to the angle. The paper’s interpretability evidence suggests a sharper diagnosis: models can encode local rotation facts, partially infer orientation, and then lose the binding between that inferred orientation and the object that should be visible from it.

In plainer business language: the system reads the procedure, tracks pieces of state, and still picks the wrong outcome. Elegant, confident, and wrong. A familiar product category, unfortunately.

The real failure is not parsing the turn

A quick summary of the task is necessary before the mechanism matters.

The authors construct VRUBench, a benchmark of 19,591 text-only viewpoint rotation instances. Each instance describes an indoor-room scenario with observations attached to viewpoint rotations. The model starts with an initial observation, receives a sequence of turns, sees objects after intermediate turns, and must predict the observation after the final turn. If the final viewpoint has not previously been observed, the correct answer is “unknown.”

The dataset is synthetic, but not frivolous. It uses 100 everyday indoor objects and varies the number of rotations from 2 to 5 steps. The counts are also reasonably balanced: 4,614 two-step cases, 4,977 three-step cases, and 5,000 cases each for four-step and five-step rotations.

The task separates four abilities that are often collapsed under the lazy label “reasoning”:

Capability being tested What the model must do Why it matters
Instruction parsing Recognize direction and angle from text Basic workflow reading
State update Translate each turn into a new orientation Procedural tracking
Memory binding Link each visited orientation to the object seen there World-state consistency
Answer selection Choose the observation for the final orientation, or “unknown” Decision reliability

The trap is that a model may pass the first category and still fail the last three. That matters because many enterprise AI evaluations over-test instruction parsing and under-test state binding. They check whether the model can explain a workflow, not whether it can maintain the changing facts inside that workflow.

That distinction is not academic hair-splitting. A warehouse robot, field-service copilot, AR maintenance assistant, CAD helper, logistics planner, digital-twin analyst, or autonomous inspection system must keep track of changing positions, dependencies, and observations. If the model treats state as decorative text rather than an evolving structure, the failure will not always look dramatic. It may simply recommend the wrong shelf, valve, asset, room, lane, or part.

Small details. Expensive details.

VRUBench is easy for humans because the hidden map is tiny

The headline contrast is stark: human evaluators reached 100% accuracy on the sampled evaluation cases, with perfect inter-annotator agreement reported in the appendix. Models were much less consistent.

The authors evaluate a range of LLMs and VLMs, including LLaMA, Qwen, Qwen-VL, and Gemini models. The strongest numbers are not uniformly terrible. Some large or reasoning-enabled models perform well. But the pattern is uneven enough to be useful.

Model group / setting Representative result from the paper Interpretation
Weak or older LLMs LLaMA2-7B-chat averages 18.90% Some systems barely form the needed state representation
Stronger text-only LLMs Qwen2.5-32B averages 72.84% Scale helps, but does not close the human gap
VLMs without image input Gemini3-Flash averages 75.91%; Qwen3-VL-32B averages 69.98% Visual training appears to transfer into text-only spatial reasoning
Reasoning-enabled VLMs Qwen3-VL-32B-thinking averages 96.55%; Gemini3-Flash-thinking averages 86.32% Explicit reasoning can help, but the benefit depends heavily on model and setting
Humans 100% The benchmark is not hard for ordinary spatial cognition

The 96.55% result for Qwen3-VL-32B with thinking enabled is important, but it should not be misread. It shows that some current systems can get close on this benchmark when scale, multimodal training, and reasoning mode align. It does not show that textual spatial state tracking is solved. The same table contains weaker cases, degradation with longer sequences for many models, and non-uniform effects from enabling reasoning. For example, Qwen3-VL-4B performs worse with thinking enabled than without it.

This is one reason the paper is more interesting than a leaderboard. A leaderboard would invite the usual comfort blanket: bigger models improve, reasoning mode helps, multimodal training helps, please clap. The mechanism tells a less convenient story.

Visual training helps even when the test is text-only

One consistent finding is that VLMs tend to outperform comparable LLMs, even though the evaluation prompt contains no image.

This is the first useful operational clue. Multimodal training may teach models spatial regularities that remain available when the input is purely linguistic. The model has learned something about space from visual data, and that latent structure can support a text-only reasoning task.

For product teams, this weakens a common procurement shortcut: “Our application is mostly text, so a text-only model is enough.” That may be true for summarization, extraction, classification, and email-like automation. It is less obviously true for workflows that describe physical or spatial processes in text. A safety manual, repair log, CAD note, warehouse instruction, or construction report may be “text” as a file format while being spatial as a cognitive task.

The paper’s result does not mean every text product needs a VLM. That would be the expensive interpretation, which naturally some vendors would enjoy. The cleaner inference is that model selection should follow the latent task structure, not the surface data type. If the text describes space, motion, orientation, or changing physical relationships, evaluate spatial reasoning directly.

Reasoning mode helps, but verbosity is not the same as state integrity

The paper also tests direct-answer settings against “think-then-answer” settings for models that support reasoning mode. In many cases, reasoning improves accuracy. Qwen3-VL-32B jumps from 69.98% to 96.55% average accuracy; Gemini3-Flash rises from 75.91% to 86.32%.

That is a real gain. It would be silly to dismiss it.

But the gain is not uniform. Smaller models may improve only modestly or even degrade. Qwen3-VL-4B drops from 48.44% to 43.11% when thinking is enabled. Qwen3-32B as a text-only model improves from 53.08% to 57.99%, which is helpful but hardly a miracle.

The point is not “chain-of-thought works” or “chain-of-thought fails.” The point is narrower and more useful: explicit reasoning can help when it forces the model to externalize a state update procedure, but it does not guarantee that the model’s internal representation remains correctly bound to the answer. A model can write a plausible scratchpad and still output the wrong object. Anyone who has watched an AI system produce a beautiful explanation for a bad answer has already met this creature in the wild.

So the practical evaluation question should not be “Does the model reason out loud?” It should be “Does reasoning mode improve the specific state transition and binding failure we care about?”

That is a much less marketable sentence. It is also the one worth putting into an evaluation plan.

The model often knows the direction and angle

The strongest part of the paper is not the benchmark. It is the interpretability work that follows.

The authors use layer-wise probing to ask what information is encoded inside the model’s hidden states. They probe for three kinds of information: rotation direction, rotation angle, and absolute orientation after each rotation.

The direction and angle results are almost boring, which is useful. Probing accuracy for direction and angle exceeds 99% in most layers. This means the model has access to the local facts in the prompt. It knows, in representation space, that a step says left or right and what angle is involved.

That finding kills a tempting but shallow explanation. The models are not merely failing because they cannot read the instruction. They can read it. The failure happens after the easy part.

Orientation is different. Absolute orientation is harder for the models to encode, and the paper reports that VLMs encode it better than LLMs. More interestingly, in Qwen2.5-VL-7B the ability to probe orientation emerges in early-to-middle layers and then declines in later layers.

At first glance, that decline looks like the model forgetting orientation. The paper’s interpretation is more subtle: later layers are no longer mainly representing orientation; they are transitioning toward answer selection. That shift is normal in a transformer. Early layers gather and transform information. Later layers increasingly prepare the output.

The problem is that the handoff is fragile.

Internal stage What the paper finds Business translation
Early processing Direction and angle are encoded very strongly The model can read local workflow instructions
Early-to-middle processing Orientation information begins to emerge The model partially builds a spatial state
Middle-to-late processing A small set of heads has causal influence on answer generation Final decisions depend on sparse internal components
Output preparation The model struggles to bind orientation to the correct observation State tracking can fail exactly when the answer is chosen

This is the article’s central mechanism. The model does not simply “lack spatial intelligence” in a vague way. It performs parts of the computation and then fails at binding. That is a much more actionable diagnosis.

The late-layer heads behave like a small decision committee

To understand what happens in the later layers, the authors use path patching, a causal intervention method. They construct clean and corrupted examples by flipping the rotation direction at the final step, then test which attention heads causally affect the output logits.

The result is sparse. Only a small fraction of attention heads matter strongly for VRU performance, and these key heads are mainly located in middle-to-upper layers. The appendix reports similar sparse, late-layer patterns across several other models, which makes the observation more than a one-model curiosity.

The paper then inspects attention patterns and proposes three functional head types:

Head type Observed role Why it matters
Proposal head Attends to candidate answers in the prompt, including “unknown” It gathers the possible output options
Answer decision head Focuses attention on the selected answer while reducing attention to alternatives It helps choose the final object token
Unknown head Shows strong attention to “unknown” and appears linked to cautious answering It may reflect uncertainty-handling behavior rather than ordinary object selection

The “unknown head” is especially revealing. When the model is supposed to answer “unknown” if the final observation cannot be determined, one head shows strong attention to that token. When the authors ablate this head, the proportion of “unknown” outputs drops from 65.78% to 40.73%. They also test alternative tokens and find that semantically unrelated replacements do not trigger the same pattern, while the Chinese translation of “unknown” does.

This does not prove a complete theory of uncertainty alignment. The authors themselves frame it cautiously. But it does suggest that uncertainty-handling behavior may be localized in specific internal components. That is commercially interesting for a reason nobody should over-romanticize: if model behavior can be localized, it can sometimes be modified without retraining the entire system.

Still, the main failure remains the binding problem. The proposal and decision heads can locate candidates and prepare an answer, but they do not reliably connect the earlier orientation representation with the correct object at that orientation. The model can have the ingredients and still assemble the wrong sandwich. AI, in its culinary phase.

The experiments are not all serving the same purpose

A useful way to read the paper is to separate main evidence from diagnostic support and repair experiments. Otherwise, it is easy to blur benchmark performance, interpretability, and fine-tuning into one cheerful narrative.

Experiment or analysis Likely purpose What it supports What it does not prove
VRUBench evaluation Main evidence Current LLMs and VLMs vary widely and often lag humans on text-only viewpoint rotation General spatial intelligence in all real-world settings
LLM vs VLM comparison Main comparative evidence Visual training can improve text-only spatial reasoning That VLMs are always better for every text workflow
Reasoning vs non-reasoning Sensitivity / method comparison Explicit reasoning can improve VRU in many cases That reasoning mode fixes state tracking generally
Layer-wise probing Mechanism evidence Direction and angle are encoded; orientation emerges less reliably That probed representations are directly used perfectly by the model
Path patching Causal interpretability evidence Sparse middle-to-late heads affect VRU answer generation A complete causal circuit for every model and every prompt
Head ablation Validation of interpretability Identified heads matter more than random heads That removing heads is a deployable repair method
Selective fine-tuning Intervention / improvement experiment Updating key heads can improve VRU while preserving generic ability better than full SFT That selective tuning will scale unchanged to larger proprietary systems
SpinBench transfer results Out-of-distribution extension Textual VRU tuning may transfer to related visual-spatial tasks Broad transfer across all embodied or spatial tasks

This distinction matters for business readers because the paper is not merely saying “models are bad at rotation.” It is showing a possible engineering loop:

  1. Identify a task-specific failure.
  2. Probe where the relevant information exists.
  3. Use causal intervention to locate influential components.
  4. Tune the localized components instead of the whole model.
  5. Check both task gain and general-capability retention.

That loop is more valuable than the benchmark itself.

Selective fine-tuning is a repair strategy, not a magic wand

The authors use the interpretability findings to guide selective supervised fine-tuning. Instead of updating the full model, they update the parameters associated with the top 32 key attention heads and freeze the rest.

The result is a trade-off.

Full SFT achieves higher VRUBench accuracy. For Qwen2.5-VL-7B, full SFT raises VRUBench accuracy from 48.7% to 96.3%. Selective SFT raises it to 78.7%. If the only objective is the in-domain benchmark, full SFT wins.

But enterprise systems rarely optimize one metric in isolation unless someone has hidden the post-deployment cost center. Full SFT damages generic ability in the paper’s evaluation. On Qwen2.5-VL-7B, full SFT drops MMLU from 60.3% to 55.6% and BBH from 49.2% to 35.8%. Selective SFT keeps MMLU unchanged at 60.3% and reduces BBH only slightly to 48.4%.

For Qwen2.5-VL-3B, the pattern is similar. Full SFT lifts VRUBench from 37.6% to 88.5%, while selective SFT reaches 80.1%. But selective SFT trains fewer parameters, runs faster in the reported setup, and preserves or slightly improves the generic benchmarks reported in the table.

Model Method VRUBench gain Generic ability effect Operational reading
Qwen2.5-VL-3B Full SFT +50.9 points MMLU -1.0, BBH -5.8 Strong task gain, visible forgetting
Qwen2.5-VL-3B Selective SFT +42.5 points MMLU +0.4, BBH +0.7 Slightly lower task gain, better preservation
Qwen2.5-VL-7B Full SFT +47.6 points MMLU -4.7, BBH -13.4 Best in-domain score, costly forgetting
Qwen2.5-VL-7B Selective SFT +30.0 points MMLU +0.0, BBH -0.8 More conservative repair with less collateral damage

The paper also reports that selective SFT uses fewer tuned parameters and about half the GPU hours compared with full fine-tuning in their setting. This is the practical value of interpretability: not interpretability as a compliance poster, but interpretability as a way to reduce the blast radius of adaptation.

There is another detail worth noting. Selective SFT using textual data improves performance on the related visual-spatial SpinBench subset. The authors interpret this as evidence of transfer between textual and visual spatial learning. That is plausible, and the appendix reports improvements across SpinBench subtasks including viewpoint, object rotation, face rotation, and object identity. Still, this should be read as an encouraging extension rather than a universal transfer theorem. A few related benchmarks are not the physical world.

The business lesson is evaluation design, not “buy the biggest model”

The easiest business interpretation would be: use larger multimodal reasoning models. That is not wrong. It is just incomplete, and therefore expensive in the usual way incomplete ideas become expensive.

A better interpretation is that teams should evaluate the specific cognitive bottleneck their workflow requires. For spatial or procedural workflows, the bottleneck may not be language fluency. It may be dynamic state binding.

Consider three enterprise scenarios:

Workflow Surface form Hidden requirement Failure if binding breaks
Field maintenance assistant Text manuals and technician notes Track component position through a sequence of actions Recommends the wrong component or inspection step
Warehouse operations copilot Pick paths, shelf descriptions, inventory observations Maintain changing location and object state Sends staff to the wrong shelf or misidentifies stock
Digital twin / asset monitoring Logs, map labels, sensor descriptions Bind events to spatial entities over time Confuses which asset or zone is affected

In all three cases, the input can look like text. The task is not “text understanding” in any simple sense. It is stateful reasoning over a changing environment. A model that summarizes the instruction well may still fail the operational logic.

Cognaptus’ inference from this paper is therefore practical: evaluation suites should include small synthetic state-tracking tests tailored to the workflow before moving to costly real-world pilots. Synthetic tests are not substitutes for field validation, but they are excellent filters. They reveal whether the model can preserve basic invariants before it is allowed to improvise around messy reality.

This is also where mechanism-first evaluation beats leaderboard shopping. If a model fails, the question should not stop at “accuracy is low.” The next questions are:

  • Did it parse the instruction?
  • Did it update the intermediate state?
  • Did it bind the state to the right observed entity?
  • Did the final decision layer select the right candidate?
  • Did uncertainty handling interfere with answer selection?

Those questions turn a vague model-quality problem into a diagnostic workflow.

What the paper directly shows, what we infer, and what remains uncertain

The paper directly shows four things.

First, VRUBench exposes a gap between human and model performance on text-only viewpoint rotation, especially for weaker or non-reasoning settings. Second, VLMs generally outperform LLMs on this task even without visual inputs. Third, probing and path patching suggest that models encode direction and angle, partially encode orientation, and rely on sparse late-layer heads for answer selection. Fourth, selective SFT on key heads improves VRU while preserving generic capabilities better than full SFT in the tested models.

Cognaptus infers three business lessons.

First, enterprise AI evaluation should test state binding directly when workflows involve physical, spatial, or procedural changes. Second, multimodal training may be useful even for text-heavy products if the underlying task is spatial. Third, interpretability-guided adaptation can become a cost-control tool: it may help teams improve a domain capability without damaging unrelated abilities as much as full fine-tuning.

The uncertain parts are just as important.

The benchmark is synthetic. Its rooms are simplified, the object list is controlled, and rotations are discrete. Real operational environments include noisy observations, ambiguous language, partial maps, sensor errors, and consequences that do not fit neatly into “object” or “unknown.”

The paper also notes prompt sensitivity as a limitation. That matters because enterprise deployments are prompt ecosystems, not laboratory prompts. Users phrase things badly. Logs are inconsistent. Procedures include abbreviations, omissions, and local jargon. A model that performs well under one clean template may still wobble in production.

The interpretability analysis focuses mainly on direct-answer behavior, not the full internal dynamics of explicit chain-of-thought reasoning. Since reasoning mode substantially changes performance for some models, this leaves an important open question: does externalized reasoning repair the same internal binding failure, route around it, or create a different failure mode?

Finally, the selective and full fine-tuning experiments are conducted on models no larger than 7B. The results are promising, but scaling them to larger, closed, production-grade models is not automatic. Attention-head localization may remain useful. The exact repair method may not transfer neatly.

A small room is a serious benchmark

The reason this paper is useful is not that it proves models are hopeless at space. It does not. Some models perform quite well, especially with multimodal training and reasoning enabled.

Its value is that it makes the failure legible.

The model reads “turn left.” It encodes the angle. It begins to form an orientation. Then, near the point of answer selection, a small late-layer decision process may fail to bind that orientation to the correct observed object. The result is not random nonsense. It is structured failure: the most dangerous kind, because it can look like competence until the final step.

That is the broader lesson for AI products. Many failures in business workflows will not come from models misunderstanding every word. They will come from models understanding enough pieces to sound useful, while failing to preserve the relationships that make the answer true.

The industry likes to sell reasoning as a personality trait: the model is smart, thoughtful, agentic, reflective, whatever adjective survived the investor deck. VRUBench reminds us that reasoning is also bookkeeping. You turned. The room changed. The object in front of you is different now.

If a model cannot keep that straight, perhaps it should not be trusted with the warehouse just yet.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, and Wenpeng Lu, “How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study,” arXiv:2604.15294v1, 16 April 2026. ↩︎