Seeing Is Believing—Planning Is Not: What SpatialBench Reveals About MLLMs

A robot in a parking lot does not need poetry. It needs to know where the car is, which way the road bends, what happens if it turns right, and how to reach the exit without performing an expensive interpretation of modern sculpture on someone’s bumper.

That sounds simple until we ask a multimodal large language model to do it.

The problem is not that today’s MLLMs cannot see. Many of them can identify objects, describe scenes, count visible entities, estimate relationships, and produce very confident paragraphs about what is in front of them. The problem is that seeing is not the same as holding a usable spatial model of the world. A model can recognize a white car, a road, a wall, a turning direction, and still lose the plot when the camera moves, the reference frame changes, or a plan must be assembled across time.

That is the useful discomfort in SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition.¹ The paper does not merely ask whether MLLMs understand visual scenes. It asks whether their spatial intelligence survives the climb from observation to planning. Spoiler, because subtlety is overrated: often, it does not.

Spatial cognition is a stack, not a label

The central contribution of SpatialBench is a five-level hierarchy for evaluating spatial cognition. This matters because “spatial reasoning” is too often treated as one lump of capability, as if identifying the chair and planning a safe route around the chair were adjacent skills separated only by a slightly longer prompt. They are not.

The paper decomposes spatial cognition into five levels:

Level	Capability	What the model must do
L1 Observation	Recognize objects and basic attributes	Count objects, estimate size, infer room size, estimate absolute distance
L2 Topology and relation	Understand spatial arrangement	Track appearance order, relative distance, relative direction, route-based order, relative counting
L3 Symbolic reasoning	Convert visual cues into abstract constraints	Perform multi-hop spatial reasoning, infer affordances, localize pose under landmark constraints
L4 Causality	Predict spatial consequences	Reason about how movement or actions change a scene
L5 Planning	Integrate perception, relation, causality, and goals	Follow visual commands or plan routes

This hierarchy is the paper’s real analytical engine. It changes the question from “Which model got the highest score?” to “Where does the model’s spatial cognition break?” That shift is more useful for businesses than another leaderboard screenshot, although screenshots do have the comforting illusion of governance.

A warehouse robot, an inspection drone, an AR navigation assistant, or an embodied customer-service agent does not fail only when it cannot see a box. It can fail when it sees the box but cannot decide whether the box blocks the route, whether moving around it changes the goal path, or whether “left” means camera-left, robot-left, or scene-left. SpatialBench makes that distinction explicit.

SpatialBench tests the climb from pixels to plans

SpatialBench is built from 117 real-world videos and 3,193 question-answer pairs across 15 task types. The videos are captured in indoor and outdoor settings, including offices, residential areas, city streets, wooded regions, and underground environments. The dataset is not just a tidy synthetic puzzle set wearing a lab coat. The authors use a synchronized RGB camera and 3D LiDAR sensor; the RGB stream supports visual question generation, while LiDAR provides geometric ground truth for measurement-related questions such as size and distance.

The construction process matters because spatial evaluation can easily become a benchmark of annotation convenience rather than spatial intelligence. For non-metric tasks, human annotators design questions and commercial models assist in generating answers and evidence summaries. For metric questions, LiDAR-derived geometry supplies the answers. The paper then applies verification protocols: lower-level questions use consistency checks and spot audits, while higher-level questions undergo mandatory human verification.

That does not make the benchmark perfect. No benchmark is a production safety certificate, and 117 videos cannot represent the full mess of physical deployment. But the design is pointed: the benchmark is structured enough to diagnose cognitive stages, while realistic enough to avoid being merely a geometry worksheet.

The paper also introduces a capability-oriented overall score. Instead of averaging task scores blindly, the metric weights the five cognitive levels in a way intended to reflect the increasing complexity of higher-level spatial cognition. The appendix formalizes this as an optimization problem that enforces monotonic weighting across levels. For article purposes, the exact optimization is less important than its purpose: the score is designed to reward the full hierarchy, not just cheap wins on easy observation questions.

The main evidence: models often see before they understand

The main experimental evidence comes from benchmarking a set of proprietary and open-source MLLMs on SpatialBench. The paper evaluates models including Gemini-2.5-pro, GPT-4o-mini, Claude-sonnet-4.5, GPT-5-chat-latest, Qwen variants, GLM variants, ERNIE, and spatial-focused open-source models such as Cambrian-S-3B and VST-3B-RL.

The top overall performer is Gemini-2.5-pro, with an overall score of 75.79. The strongest open-source models in the reported table include Cambrian-S-3B and VST-3B-RL, both scoring around the high-50s overall. Several large general MLLMs score much lower, with GPT-5-chat-latest reported at 22.45 under the default setting.

That table is interesting, but the ranking is not the article. The mechanism is.

The paper finds that observation and topology are generally easier than higher-level reasoning. Models can often extract visual evidence and handle basic relational queries, but struggle to convert perception into robust symbolic rules, infer causal consequences, and generate coherent spatial plans. This is the cascade: a small weakness in the lower levels becomes a larger failure when the task demands abstraction, continuity, or action.

A simplified reading of the results looks like this:

Evidence type	Likely purpose in the paper	What it supports	What it does not prove
Full model benchmark	Main evidence	Current MLLMs vary widely, and high scores on perception do not guarantee high-level spatial cognition	That one ranked model is universally best for every deployment context
Five-level score breakdown	Main diagnostic evidence	Failure patterns can be localized by cognitive stage	That the five levels are the only possible taxonomy
One-shot evaluation	Prompting / sensitivity test	Some models improve sharply with minimal examples; others degrade	That prompting solves spatial reasoning in production
Human benchmark	Comparison with prior / natural intelligence baseline	Humans perform far better on high-level symbolic, causal, and planning tasks	That all human-like spatial reasoning can be reduced to the benchmark
Case studies and appendix failures	Mechanism illustration	Models confuse continuity and reference frames even when object recognition works	That every failure in deployment will look exactly like these examples
Overall score formulation	Implementation and metric design detail	The benchmark tries to weight higher cognition rather than raw easy-task accuracy	That the metric is the final word on spatial intelligence

This distinction matters because the paper is easy to misread. It is not simply saying “models are bad at spatial reasoning.” That would be too crude and, frankly, not worth a whole article. The better interpretation is: MLLMs often possess fragments of spatial competence, but those fragments are not yet assembled into a stable cognitive map.

The failure mechanism: perception does not automatically become a map

A human navigating a space does not store every visible object with equal importance. We filter. We select. We bind objects to directions, routes, goals, and consequences. If a car turns right, we do not mentally narrate every parked vehicle unless it matters to the turn. We construct a task-relevant spatial model.

SpatialBench’s examples suggest that MLLMs often do something different. They describe the scene exhaustively, but without stable spatial intent. The paper compares human and model reasoning in a parking-lot case. The human focuses on the crucial directional cue: the turning path of the white Volvo. The model, by contrast, provides a broad description of vehicles and areas but misses the key relation needed to answer the question.

That is not “lack of vision.” It is lack of disciplined abstraction.

The same pattern appears in the appendix case study on causal reasoning. A stronger model can identify the Volvo, reconstruct part of the scene, and mention relevant landmarks. But when the camera performs a U-turn, the model loses continuity. It implicitly treats the camera’s motion as if it still corresponds to the hypothetical trajectory of the car. A weaker model fails even earlier, confusing the camera’s viewing direction with the car’s movement direction.

This is the exact kind of failure that should make embodied AI product teams mildly allergic to demo videos. A demo can show recognition. A deployment requires continuity.

The paper’s egocentric reasoning breakdown is even more revealing. In one indoor example, the model correctly identifies the projector, whiteboard, and AC control panel. The failure occurs at the final spatial transformation: it cannot reliably convert the room layout into the agent’s egocentric frame. In another outdoor example, the model identifies objects but misinterprets the robot’s orientation after moving onto the road, choosing the wrong turning direction.

The lesson is precise: object localization can be correct while frame-of-reference reasoning is wrong. That is a dangerous combination because the model’s answer may sound visually grounded while being geometrically invalid.

The one-shot results are useful, but not a rescue story

The paper also conducts a one-shot evaluation. For each task, the model receives a single annotated example with a question-answer pair, reasoning explanation, and key frames before answering a test question from the same task.

The results are mixed and therefore more interesting than a clean “prompting helps” story.

Gemini-2.5-pro declines from 75.79 to 68.12 in the reported one-shot setting. GPT-5-chat-latest rises sharply from 22.45 to 61.08. Qwen3-VL-235B-A22B-Instruct improves from 37.79 to 63.20. VST-3B-RL and Cambrian-S-3B decline under the same setting.

So what is this test doing? It is best read as a sensitivity test of in-context guidance, not as main evidence that prompts solve spatial cognition. Some models appear to benefit strongly from explicit examples, especially where linguistic reasoning can scaffold the task. Others may already have a better internal strategy or may be disrupted by example formatting.

For business users, this is useful but not comforting. Prompt templates can improve spatial reasoning on some task classes, but they are not a substitute for stable coordinate frames, scene memory, or 3D representations. Prompting can help a model remember how to answer. It does not guarantee the model knows where it is.

The human benchmark exposes what the models are missing

The paper reports a human benchmark with 33 participants. Humans reach an overall score of 96.40, with essentially perfect performance on symbolic reasoning, causality, and planning. The lower-level observation category is less perfect, especially on metric-style size and distance estimation, which is not surprising. Humans are not LiDAR units with coffee habits.

This detail matters. The human advantage is not that people estimate every distance with machine precision. The advantage is that people preserve goal-directed structure. They know which relation matters, which direction is agent-relative, and which path remains feasible after movement.

That is why the paper’s human comparison should not be interpreted as “humans are better at all visual tasks.” The sharper point is that humans are better at using imperfect perception to support coherent spatial reasoning. MLLMs may sometimes see more surface detail, yet reason less usefully.

In business terms, the difference is between a visual reporting system and a spatial decision system. The first says, “I see a corridor, a sign, and a blocked passage.” The second says, “The blocked passage invalidates the planned route, the sign indicates an alternate exit, and the agent should turn right after the doorway.” Many AI demos sell the first while implying the second. SpatialBench is rude enough to separate them.

Why this matters for physical and semi-physical automation

The paper’s most practical value is not model ranking. It is diagnostic design.

For robotics, warehouse automation, autonomous inspection, AR guidance, smart mobility, and physical digital twins, the important question is not “Can the model describe the scene?” The question is: which layer of spatial cognition does the application require, and has that layer been tested?

A warehouse inventory assistant may need only observation and basic topology: identify shelves, count visible boxes, detect whether an aisle is blocked. A mobile picking robot needs more: route planning, affordance reasoning, and causal prediction. An AR maintenance assistant may need symbolic reasoning: interpreting arrows, labels, diagrams, tool positions, and safe motion constraints. An autonomous inspection drone needs continuity across moving viewpoints. A robot navigating around humans needs causal reasoning about movement, not just a transcript of the current frame.

SpatialBench gives product teams a useful checklist:

Deployment question	SpatialBench lens	Practical implication
Does the system only describe what is visible?	L1 Observation	MLL lens	Practical implication
—	—	—
Does the system onlyMs may be useful, but metric precision still needs sensors or calibration
Does it compare object positions or directions?	L2 Topology and relation	Test relative direction and sequence under viewpoint shifts
Does it interpret signs, arrows, affordances, or landmarks?	L3 Symbolic reasoning	Do not rely on visual captioning alone; test rule transfer
Does it predict what happens after movement?	L4 Causality	Add dynamic scene tracking and explicit state updates
Does it generate routes or action sequences?	L5 Planning	Require validation through maps, planners, simulators, or closed-loop feedback

This is where the business relevance becomes concrete. SpatialBench does not tell a logistics company whether a specific robot will fail in a specific aisle. It tells the company which kind of failure to look for before the robot is allowed near inventory, equipment, or human ankles.

The business value is cheaper diagnosis, not instant autonomy

The temptation is to treat every new benchmark as a buying guide. That would be convenient. It would also be lazy, which is a known industry feature.

SpatialBench is more useful as a capability audit. It helps teams decompose spatial intelligence into testable layers. Instead of asking vendors whether their model “understands space,” buyers can ask:

Can it maintain agent-centric orientation after camera movement?
Can it distinguish camera viewpoint from robot heading?
Can it convert landmarks into route constraints?
Can it preserve scene continuity after a turn?
Can it explain not just what it sees, but what changes after an action?
Does prompting improve the task, or does it destabilize performance?
Which failures appear only at L3–L5 after L1–L2 succeeds?

That last question is the most important. In many deployments, the dangerous failure is not the obvious one. If a model cannot identify a forklift, nobody should deploy it for forklift navigation. That is easy. The harder case model cannot identify a forklift, nobody is when the model identifies the forklift, describes its location, and then chooses a route as if the forklift’s motion, orientation, or blocking effect did not matter.

This is why explicit spatial modules still matter. Scene graphs, 3D geometric priors, map-based planners, memory over viewpoints, robotics simulators, and closed-loop feedback systems are not old-fashioned accessories. They are the scaffolding that language-vision systems currently lack when the task moves from description to decision.

The paper itself points toward this direction: progressive cognitive tasks, explicit spatial representations, curriculum learning, agent-based interactive environments, and feedback mechanisms where planning errors refine perceptual modules. That is not a small patch. It is a product architecture.

What the paper does not settle

SpatialBench should not be overclaimed. It is a benchmark built from 117 videos, not a universal certification regime for embodied AI. Its annotation process combines human design, model-assisted answer generation, LiDAR-derived metric answers, and verification. That is thoughtful, but it still reflects the task definitions, environments, and evaluation choices of the benchmark.

The overall score is also a designed metric. Its complexity-aware weighting is reasonable for the authors’ purpose, but a production system may care about different trade-offs. A factory robot might treat one planning error as more important than ten object-counting errors. An AR education app might tolerate occasional route ambiguity but not wrong symbolic instructions. The right weight is deployment-specific.

The one-shot evaluation should also be interpreted carefully. Prompting can meaningfully improve some models, but the mixed results show that examples are not a universal upgrade. For some models, extra guidance may conflict with their default reasoning strategy. For others, it may unlock latent linguistic reasoning without fixing spatial representation. Useful, yes. Magic, no. We already tried magic; it was called “just make the prompt longer.”

Finally, SpatialBench focuses on benchmarked QA performance, not closed-loop embodied execution. A model can answer a question correctly and still fail under real-time latency, sensor noise, actuator constraints, or adversarial environmental change. That boundary does not weaken the paper. It tells us where the next evaluation layer begins.

The uncomfortable takeaway: MLLMs need spatial discipline

SpatialBench is valuable because it cuts through a common misunderstanding: multimodal perception is not spatial cognition. A model can recognize the world without organizing it into a stable, actionable structure.

The five-level hierarchy gives us a better vocabulary. Observation is not topology. Topology is not symbolic reasoning. Symbolic reasoning is not causality. Causality is not planning. These levels depend on each other, but they are not interchangeable. When one layer is weak, higher layers inherit the weakness and often amplify it.

For businesses, the implication is pragmatic. Use MLLMs where visual-language understanding is enough. Be much more careful where the task requires coordinate frames, scene continuity, causal prediction, or route planning. And when vendors claim their model can “understand the physical world,” ask which level of the stack they mean. If the answer is a demo video, ask again.

SpatialBench does not tell us that embodied AI is impossible. It tells us that embodiment is not achieved by attaching a camera to a language model and hoping the geometry sorts itself out. The model may see the parking lot. It may even describe the Volvo. But until it can preserve the right frame, track the right continuity, and plan the right action, seeing is only the opening act.

Planning is still the expensive part.

Cognaptus: Automate the Present, Incubate the Future.

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang, “SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition,” arXiv:2511.21471v4, 2026. ↩︎

Spatial cognition is a stack, not a label#

SpatialBench tests the climb from pixels to plans#

The main evidence: models often see before they understand#

The failure mechanism: perception does not automatically become a map#

The one-shot results are useful, but not a rescue story#

The human benchmark exposes what the models are missing#

Why this matters for physical and semi-physical automation#

The business value is cheaper diagnosis, not instant autonomy#

What the paper does not settle#

The uncomfortable takeaway: MLLMs need spatial discipline#