When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Opening — Why This Matters Now

Multimodal large language models (MLLMs) can caption images, describe scenes, and even explain memes with unsettling confidence. Yet ask them a textbook-level geometry problem involving orthographic projections or cube folding, and their composure dissolves.

According to the newly proposed MathSpatial framework, humans solve structured spatial reasoning tasks with 96%+ accuracy, while most leading MLLMs struggle below 60%. Even frontier systems plateau far below human baselines.

This is not a cosmetic flaw.

Spatial reasoning underpins robotics, CAD automation, industrial inspection, embodied agents, and autonomous navigation. If models cannot reliably reason about views, projections, and geometric constraints, their promise in physical-world deployment remains… aspirational.

The MathSpatial paper does something rare in AI benchmarking: it isolates the problem rather than amplifying it with visual noise. The result is less flattering — and far more informative.

Background — The Illusion of Spatial Competence

Prior spatial benchmarks often mix perception and reasoning. Models fail, but we cannot tell whether they misread pixels or misunderstood geometry.

MathSpatial addresses three structural weaknesses in existing evaluation:

Challenge	Typical Benchmark Issue	Consequence
Perceptual Confounds	Complex scenes with texture/noise	Errors before reasoning even begins
Data Scarcity	Small, fragmented datasets	No systematic improvement loop
Black-Box Reasoning	Free-form CoT explanations	No interpretability or error diagnosis

The authors separate perception from reasoning by using clean educational geometry problems — minimal clutter, precise diagrams, objective answers. If a model fails here, it is not because of blurry pixels.

That is uncomfortable. And useful.

The Framework — Evaluation, Data, and Structure

MathSpatial is built around three pillars:

MathSpatial-Bench (2,000 problems) — Diagnostic benchmark
MathSpatial-Corpus (8,000 problems) — Large-scale supervised training set
MathSpatial-SRT (Structured Reasoning Traces) — A reasoning decomposition framework

Together, they create a closed training–evaluation loop.

1. Benchmark Design: Reasoning Without Excuses

The benchmark spans three reasoning levels:

Category	Focus	Example Tasks
Holistic Recognition	Multi-view alignment	3-view matching, cube counting
Generative Inference	View generation & constraints	Missing view completion
Abstract Deduction	Property calculation	Surface area, feasibility checks

Distribution across 2K problems:

Category	# Problems	% Share
Holistic Recognition	518	25.9%
Generative Inference	636	31.8%
Abstract Deduction	846	42.3%

Notably, advanced property calculation tasks (GPC) dominate the hardest segment.

Humans exceed 95% accuracy across all categories.

Models do not.

2. The Core Insight — Spatial Reasoning is Composable

Rather than treating geometry as a black-box mapping from image → answer, the authors decompose reasoning into three atomic operations:

Correlate (corr) — Align entities across views
Constrain (cons) — Apply geometric rules (projection, visibility, alignment)
Infer (infer) — Deduce final attributes or numeric answers

They argue — and formally prove within their task space — that:

$$ {corr, cons, infer}^* $$

is sufficient to solve all benchmark tasks, and removing any primitive strictly reduces expressive power.

In other words, spatial reasoning is not mystical. It is structured.

And current models rarely follow structure.

Findings — The Performance Reality Check

Overall Accuracy (MathSpatial-Bench)

Model Type	Example Model	Accuracy
Human	Students (closed-book)	96.3%
Best Closed-Source	GPT-5	58.5%
Strong Commercial	Gemini-2.5-Flash	48.5%
Open-Source Baseline	Qwen2.5-VL-7B	17.8%
Fine-Tuned (SRT)	MathSpatial-Qwen2.5-VL-7B	22.1%

Even GPT-5 achieves barely 60% of human-level accuracy.

The gap is structural, not incremental.

Efficiency Gains Through Structured Traces

Fine-tuning with SRT improves both accuracy and efficiency:

Model	Accuracy	Avg Tokens
Qwen2.5-VL-7B	17.8%	465
+ Free-form CoT	~19–20%	~450
+ SRT (Structured)	22.1%	351

Token usage drops ~25%.

Structured reasoning not only improves correctness — it compresses cognitive waste.

In enterprise deployment, fewer tokens means lower cost, lower latency, and more predictable behavior.

That matters.

Error Anatomy — Where Models Break

Error distribution across all models:

Failure Mode	Share of Errors
Reasoning Gaps	34.4%
Geometry Violations	33.0%
Projection Errors	12.6%
Feature Omission	10.5%
Scale Errors	7.2%
Deduction Failures	2.3%

Two dominant weaknesses emerge:

Inability to maintain multi-step logical coherence
Failure to enforce geometric consistency rules

This is not a vision problem. It is a structured reasoning problem.

Generalization — Does It Transfer?

One might suspect overfitting to clean diagrams. The authors tested on perception-heavy external benchmarks (SOLIDGEO, GeoEval, 3DSRBench).

Fine-tuned SRT models show:

+1.3% to +3.4% accuracy improvements
7–21% reduction in token usage

The structured primitives appear to transfer beyond sanitized educational diagrams.

That suggests the decomposition captures something fundamental.

Business Implications — From Geometry to Robotics

Why should operators care?

Because spatial reasoning is foundational in:

Robotic manipulation planning
Warehouse automation
AR/VR simulation
Engineering design assistants
Construction modeling
Industrial inspection AI

If your multimodal system cannot reliably align projections and enforce constraints, it will hallucinate in 3D just as confidently as it does in text.

MathSpatial implies a strategic direction:

Instead of scaling models blindly, impose structure on reasoning.

For applied AI firms, this translates to:

Domain-specific reasoning schemas
Intermediate supervision rather than pure outcome optimization
Trace-level validation pipelines
Operational decomposition of cognitive tasks

In short: treat reasoning as architecture, not emergent magic.

What This Signals About Multimodal AI

Three broader signals emerge:

Scaling alone does not guarantee spatial intelligence. Even GPT-5 remains far from human baselines.
Interpretability improves efficiency. Structured traces reduce token bloat.
Decomposition is a competitive advantage. Systems with explicit reasoning modules outperform black-box counterparts in reliability.

The paper quietly challenges a popular narrative: that multimodal LLMs are nearing general reasoning parity.

They are not — at least not in space.

Conclusion — Intelligence Needs Structure

MathSpatial does not claim to solve spatial reasoning.

It does something more valuable: it isolates the weakness, formalizes its structure, and proves that interpretability and efficiency can improve together.

If large multimodal models are to power embodied agents and physical automation systems, structured reasoning primitives may be less optional than previously assumed.

Models can describe a cube.

Understanding it is another matter.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — The Illusion of Spatial Competence#

The Framework — Evaluation, Data, and Structure#

1. Benchmark Design: Reasoning Without Excuses#

2. The Core Insight — Spatial Reasoning is Composable#

Findings — The Performance Reality Check#

Overall Accuracy (MathSpatial-Bench)#

Efficiency Gains Through Structured Traces#

Error Anatomy — Where Models Break#

Generalization — Does It Transfer?#

Business Implications — From Geometry to Robotics#

What This Signals About Multimodal AI#

Conclusion — Intelligence Needs Structure#