Seeing Is Believing—Planning Is Not: What SpatialBench Reveals About MLLMs

Opening — Why This Matters Now

Spatial reasoning is quietly becoming the new battleground in AI. As multimodal LLMs begin taking their first steps toward embodied intelligence—whether in robotics, autonomous navigation, or AR/VR agents—we’re discovering a stubborn truth: recognizing objects is easy; understanding space is not. SpatialBench, a new benchmark introduced by Xu et al., enters this debate with the subtlety of a cold audit: it measures not accuracy on toy tasks, but the full hierarchy of spatial cognition.

And the results are…scathing.

In essence, today’s MLLMs can see, but they cannot think in space—at least not reliably. That gap has serious implications for automation, safety, and any business looking to deploy AI in physical or semi‑physical environments.

Background — Context and Prior Art

Prior benchmarks often fixate on narrow skillsets: object recognition, grounding, or single-hop reasoning. Useful, but incomplete. Spatial cognition, as the authors point out, is inherently hierarchical. Humans don’t just see; we interpret, infer, and plan through a layered cognitive ladder. Until now, no benchmark has captured this progression.

SpatialBench introduces a five-level taxonomy—Observation → Topology → Symbolic Reasoning → Causality → Planning—that aligns closely with how humans process space. It is grounded in cognitive map theory, not the usual computer-vision grab bag.

This distinction matters. A benchmark built around cognitive stages feels less like a leaderboard and more like a diagnostic.

Analysis — What the Paper Actually Does

SpatialBench constructs 50 egocentric videos and 1,347 QA pairs spanning 15 task types. Everything—from object size estimation to multi-hop spatial inference—is aligned to one of the five cognitive levels. The authors also introduce a mathematically principled weighting system to produce a complexity-aware overall score.

The dataset is not synthetic. It uses synchronized RGB and LiDAR capture, meaning the ground truth for distances and dimensions isn’t guesswork—it’s geometry.

The evaluation results (excerpted below) tell a very consistent story:

  • L1–L2 (Observation & Topology): Most models do fine. They can see objects and describe relations.
  • L3–L5 (Symbolic → Causal → Planning): Performance collapses. Even state-of-the-art models show brittle reasoning.
  • Human performance? Near-perfect. A full humiliation.

Benchmark Performance Snapshot

Cognitive Level Human Best Proprietary (Gemini 2.5 Pro) Best Open Source (Qwen3-VL-235B)
L1 Observation 96% ~76% ~44%
L2 Topology 100% ~73% ~38%
L3 Symbolic 100% ~91% ~31%
L4 Causality 100% ~86% ~33%
L5 Planning 100% ~74% ~33%

Source: SpatialBench dataset summary fileciteturn0file0

This pattern demonstrates a widening gap: MLLMs extract perceptual facts but often fail to transform those facts into structured internal representations.

Findings — How Models Fail (and Why)

1. MLLMs lack selective, goal-driven abstraction.

Humans ignore irrelevant details. MLLMs recite them.

2. Scene continuity breaks easily.

A simple camera U-turn is enough to confuse even top-tier models. They lose track of landmarks and directional invariance.

3. Perspective-taking is a catastrophe zone.

The paper’s failure cases show models confusing agent-centric vs. camera-centric vs. scene-centric coordinates—sometimes producing perfectly articulated reasoning that is geometrically wrong.

4. Causal inference remains primitive.

Predicting outcomes (“If the car accelerates, what leaves the field of view?”) exposes the limits of MLLMs’ implicit physics.

5. Planning is the final boss—and remains undefeated.

Even strong models hallucinate routes or ignore feasible paths because they never truly constructed a cognitive map.

Visualization — The Spatial Cognition Ladder

Level Description Example Task Difficulty for MLLMs
L1 Observation Identify objects, sizes, distances Count chairs ★☆☆☆☆
L2 Topology Relational structure Which object is closest? ★★☆☆☆
L3 Symbolic Reasoning Interpret visual symbols, multi-hop steps Follow directional arrows ★★★★☆
L4 Causality Predict effects of movement or actions What happens if the car turns? ★★★★★
L5 Planning End-to-end navigation and sequencing Route to exit ★★★★★

SpatialBench’s structured format makes the gap painfully visible: models excel at seeing pixels, fail at seeing possibilities.

Implications — Why This Matters for Business & AI Deployment

1. Robotics & warehouse automation

If your robot vacuums your floor flawlessly but crashes into furniture when furniture moves, SpatialBench explains why.

2. Autonomous systems

Autonomy needs stable cognitive maps, not textual approximations of scenes.

3. Compliance, assurance, and risk management

Deploying spatially naïve agents in environments with physical consequences is an open invitation for liability.

4. Agentic AI workflows

As businesses explore agentic automation, SpatialBench’s message is sharp: do not assume navigation, causality, or multi-step spatial reasoning. Augment it or avoid tasks requiring it.

5. AI product strategy

The future of multimodal AI will be constrained less by vision and more by geometry, inference, and planning. Investment in spatial modules—3D scene reconstruction, explicit maps, learned coordinate transforms—will define the next leaders.

Conclusion — The Real Gap Isn’t Visual; It’s Cognitive

SpatialBench forces an uncomfortable but necessary reflection: multimodal LLMs remain perceptual savants and cognitive novices. They over-attend to irrelevant details, lose track of spatial continuity, and fail at even modest planning tasks.

For businesses, the takeaway is pragmatic: treat MLLMs as high-bandwidth perception engines, not spatial thinkers. And if you need the latter, supplement with purpose-built geometry, 3D mapping, or symbolic planners.

Smart automation isn’t just about seeing the world. It’s about understanding it—and we’re not there yet.

Cognaptus: Automate the Present, Incubate the Future.