When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Geometry looks clean.

A cube has edges. A projection has rules. A missing view should follow from the views already shown. This is not the messy world of occluded street scenes, motion blur, shadows, or a warehouse camera pointed at the wrong shelf. It is the kind of visual reasoning many students learn before they are trusted with anything more dangerous than a compass, a ruler, and mild boredom.

That is exactly why the new MathSpatial paper is awkward for multimodal large language models.¹ The authors evaluate leading MLLMs on textbook-style mathematical spatial reasoning tasks and find a gap large enough to be impolite: human students reach 96.3% accuracy, while the best tested model, GPT-5, reaches 58.5%. Most models remain far below that. Several open-source baselines cluster around the high teens or low twenties.

This matters because the benchmark is not asking models to interpret a cluttered photograph. It is asking them to reason over clean diagrams with minimal perceptual noise. If a model fails, the usual excuse — “the image was hard to see” — has mostly left the building.

The result is not just another benchmark leaderboard. We have enough of those to tile a small airport. The more useful point is diagnostic: MathSpatial suggests that many MLLMs can look spatially competent while still failing at the operations that physical-world AI systems need — aligning views, preserving geometric constraints, and carrying a multi-step reasoning chain without quietly dropping the plot.

The unpleasant number comes first: 58.5% versus 96.3%

The paper’s central evidence is simple enough to hurt.

MathSpatial-Bench contains 2,000 evaluation problems across three broad categories and 11 subtypes. The authors report that human students, tested under closed-book conditions, achieve 96.3% micro-averaged accuracy. GPT-5 leads the tested models at 58.5%. Gemini-2.5-Flash follows at 48.5%, Gemini-2.5-Pro at 44.9%, while GPT-4o, GPT-4.1, Claude variants, and several open-source models perform substantially lower.

The useful comparison is not “which logo wins this week.” It is the size and nature of the remaining gap.

Group	Representative result	What the number means	What it does not mean
Human students	96.3%	Clean geometry tasks are highly solvable by people under controlled conditions	Human performance on all spatial tasks is solved or effortless
Best tested closed-source model	GPT-5: 58.5%	Frontier MLLMs still fail many clean spatial reasoning cases	GPT-5 is weak generally; the result is task-specific
Strong second-tier closed-source model	Gemini-2.5-Flash: 48.5%	Some models handle selected transformations better than others	High performance on some subtypes guarantees robust geometry
Open-source baseline range	roughly 15%–21% for several tested models	Base open-source MLLMs remain far from reliable on this benchmark	Open-source models cannot improve with targeted training

This is why the paper is better read evidence-first rather than dataset-first. A normal summary would begin with the dataset construction pipeline, then explain the benchmark, then eventually show results. That is polite. It is also the wrong cognitive order.

The reader’s default assumption is likely: if a model can read images and solve math problems, then clean geometry diagrams should be within reach. MathSpatial’s contribution is to puncture that assumption before we admire the dataset architecture.

The benchmark removes the easy excuse: perception noise

A common weakness of spatial reasoning benchmarks is that perception and reasoning get tangled together. Put a model in a complex 3D scene and ask a spatial question; if it fails, we do not know whether it misread the image, misunderstood the relation, ignored a constraint, or simply guessed with theatrical confidence.

MathSpatial tries to separate those failures.

The benchmark uses clean mathematical diagrams sourced from educational materials. These are not photorealistic kitchen scenes. They are formalized problems with objective answers: multiple-choice or numeric. The authors collect raw candidates from public educational repositories and textbooks, filter incomplete or non-spatial items, standardize images and text, deduplicate across splits, check geometric consistency, and verify solutions.

The curation pipeline is not just clerical. It supports the paper’s main interpretive claim. If the benchmark minimizes perceptual distractions, then model errors become more informative about spatial reasoning itself.

The authors began with 35,428 raw candidates, retained 21,673 after preliminary curation, reduced the pool to about 11,000 unique high-quality samples after standardization and deduplication, removed about 0.4K geometrically inconsistent items, and ended with 10,000 verified problems. These were split into:

MathSpatial-Bench: 2,000 problems for diagnostic evaluation.
MathSpatial-Corpus: 8,000 problems for training, with verified solutions and structured reasoning traces.

The benchmark is not perfect, but its design is disciplined. It does not say, “models fail in the wild.” It says something narrower and more damaging: models fail even when the spatial problem is cleaned up for them.

That is a better benchmark design than throwing models into visual chaos and then pretending the resulting confusion is a precise measurement.

What MathSpatial actually tests: recognition, generation, deduction

MathSpatial-Bench is organized into three categories: Holistic Recognition, Generative Inference, and Abstract Deduction. These are not just labels. They represent different levels of spatial burden.

Category	Problem count	Share of benchmark	What the model must do
Holistic Recognition	518	25.9%	Recognize or match spatial structures, often across views
Generative Inference	636	31.8%	Complete or transform views under geometric constraints
Abstract Deduction	846	42.3%	Infer properties, feasibility, or calculations from spatial rules

This category design matters because “spatial reasoning” is often treated as one ability. It is not. A model may recognize a familiar arrangement yet fail to generate a missing view. It may handle a view-matching problem but collapse when asked to calculate a property from constraints.

The paper’s results follow that pattern. Holistic Recognition is relatively easier. Top models reach stronger scores on subtypes such as image-view identification and three-view matching. Generative Inference is mixed: missing-view completion is tractable for stronger models, while visual transformation remains brutal for many. Abstract Deduction is the hardest overall, especially geometric property calculation.

That last point deserves more attention than a leaderboard rank. Geometric property calculation accounts for 24.1% of the benchmark and is a near-universal failure point. GPT-5 reaches 52.3% on GPC. Several other closed-source models score near zero. All base open-source models tested score at or below 9.3%, with multiple models at 0.0%.

This is not a small “more data will fix it eventually, please clap” result. It suggests that current MLLMs often lack reliable mechanisms for combining visual structure, formal constraints, and multi-step computation.

The error analysis says models lose rules, not just pixels

The paper’s error analysis is useful because it moves the discussion from “accuracy is low” to “what kind of thinking is breaking?”

The authors group errors into six failure modes:

Failure mode	Share of errors	Interpretation
Reasoning gaps	34.4%	The model’s chain is incomplete or internally inconsistent
Geometry violations	33.0%	The output breaks projection, visibility, or geometric rules
Projection errors	12.6%	The model misinterprets top, side, or front views
Feature errors	10.5%	The model omits parts, invents edges, or mishandles components
Scale errors	7.2%	The model fails to preserve relative sizes
Deduction failures	2.3%	The model cannot synthesize cues into a final conclusion

The top two categories — reasoning gaps and geometry violations — account for roughly two-thirds of observed errors. This is the heart of the paper.

If the main failures were feature errors or projection errors, we might treat MathSpatial as mainly a perception problem. Better visual encoders, more diagram pretraining, or higher resolution might carry the day. But the error distribution points elsewhere. Models often fail to sustain a valid reasoning process or enforce geometric rules across steps.

That is a different kind of weakness. It is closer to watching a model understand each sentence in a contract but still violate the contract logic by paragraph four. The local pieces look fine. The global constraint system quietly collapses.

The authors also report different error profiles across model families. GPT-5 and GPT-4 series are dominated by reasoning gaps. Claude models frequently violate geometric rules. Gemini-2.5-Flash suffers more from projection and scale errors. Open-source models such as Qwen2.5-VL-7B show primarily reasoning gaps but fewer feature errors.

For applied AI teams, this distinction matters. “The model failed” is not a diagnosis. “The model misread the input” and “the model understood the input but violated the constraint system” imply different interventions.

Structured traces are training signal, not magic powder

MathSpatial’s second contribution is the 8,000-problem training corpus. Each problem includes the image, textual description, final answer, and detailed solution. The more interesting part is MathSpatial-SRT: structured reasoning traces organized around three atomic operations.

The three operations are:

Correlate: establish correspondences across views or geometric entities.
Constrain: apply projection, visibility, or geometric rules.
Infer: deduce latent attributes or final answers.

This decomposition is valuable because it turns spatial reasoning from a black-box answer-generation task into a traceable process. Instead of asking the model to produce an answer and hoping its intermediate reasoning is not decorative fog, the corpus encourages operation-level reasoning.

The traces are generated with GPT-4o under a constrained schema and then checked through a dual-role review process. A Reviewer agent audits each step for operation-type errors, contradictions, or missing steps; a Checker agent rewrites problematic traces while preserving the schema. The paper reports that this process detects and fixes about 10% of generated traces.

This part should be interpreted carefully. The paper does not prove that Correlate, Constrain, and Infer are a complete theory of spatial cognition. It does not prove that models trained on these traces “understand geometry” in the human sense. It shows something more modest and more useful: structured intermediate supervision can improve performance and reduce token use on the tested models.

The fine-tuning results support that point. The authors fine-tune Qwen2.5-VL-7B, InternVL3-8B, and Llama3-8B on MathSpatial-Corpus. All three improve overall accuracy. MathSpatial-InternVL3-8B rises from 17.4% to 22.6%. MathSpatial-Qwen2.5-VL-7B rises from 17.8% to 22.1%. MathSpatial-Llama3-8B rises from 15.0% to 20.3%.

The gains are real, but not miraculous. Nobody should read a move from 17.8% to 22.1% as “problem solved.” That would be an aggressive interpretation, the kind usually found in pitch decks and poorly supervised press releases.

The more interesting result is that token use falls at the same time. Qwen2.5-VL-7B drops from 465.3 average tokens to 351.9 after MathSpatial fine-tuning. InternVL3-8B drops from 473.5 to 318.3. Llama3-8B drops from 785.4 to 397.3.

For businesses, that dual movement matters: slightly better answers, shorter reasoning traces, and potentially lower inference cost. The paper’s immediate practical value is not that MathSpatial creates spatially reliable MLLMs. It shows that structured supervision can make reasoning less wasteful and somewhat more correct.

How to read the experiments without over-reading them

The paper contains several components that serve different evidentiary roles. Mixing them together leads to bad conclusions. Here is the cleaner map.

Paper component	Likely purpose	What it supports	What it does not prove
MathSpatial-Bench evaluation	Main evidence	Current leading MLLMs are far below human performance on clean mathematical spatial reasoning	Failure rates in every real-world robotics or CAD environment
Human closed-book baseline	Calibration	The benchmark is highly solvable for humans under controlled conditions	Humans never fail spatial reasoning outside textbook settings
Fine-tuning on MathSpatial-Corpus	Training-value test	Structured corpus improves tested open-source models	Fine-tuning alone closes the human-model gap
Structured reasoning traces	Implementation and diagnostic contribution	Intermediate supervision can guide models toward more interpretable reasoning	The three operations are a complete cognitive theory
Error taxonomy	Diagnostic analysis	Dominant failures are reasoning gaps and geometry violations	Every individual model failure has a single clean cause
Static clean diagrams	Scope control	The benchmark isolates reasoning from perception noise	Performance transfers automatically to dynamic embodied environments

This table is the practical way to read the paper. The main result is the human-model gap. The training experiment shows that the dataset is useful, not sufficient. The error taxonomy tells researchers and builders where to intervene. The static-diagram design makes the benchmark cleaner, but also narrows its claim.

The paper is strongest when it is treated as a diagnostic instrument.

Business meaning: spatial AI needs pre-deployment geometry gates

The direct paper finding is academic: MLLMs perform poorly on a curated mathematical spatial reasoning benchmark, and structured training improves several open-source models modestly.

The business inference is sharper: companies building AI systems for physical-world workflows should not assume that a multimodal model’s general visual competence implies spatial reliability.

That matters in several domains:

Domain	Why MathSpatial-style failure matters
Robotics	A robot planner must preserve geometry, affordances, and object relations across steps
Warehouse automation	Bin picking, packing, and shelf reasoning require stable spatial constraints
CAD and engineering assistants	Projection, section views, and geometry-derived properties are not optional decorations
Industrial inspection	A system must distinguish visible defects from impossible or inconsistent geometry
Construction and facilities	Spatial consistency affects layout, measurement, and conflict detection
Embodied agents	Navigation and manipulation require more than fluent image descriptions

The Cognaptus reading is not “use MathSpatial directly in production.” Most firms will not deploy a geometry exam as a workflow. The better lesson is architectural.

If an AI system touches physical layouts, tools, vehicles, machines, diagrams, or 3D representations, it needs explicit spatial evaluation gates. Those gates should not only test final answers. They should test intermediate consistency:

Did the model correctly align entities across views?
Did it preserve visibility and projection rules?
Did it carry scale relationships through the reasoning chain?
Did it invent edges, surfaces, or components?
Did it reach the answer through a valid sequence, or merely land near the answer by luck?

This is where MathSpatial becomes operationally relevant. It offers a pattern for diagnosis: define the reasoning primitives, construct clean tests, record trace-level failures, and use the errors to decide whether the model needs better perception, better constraint handling, better training data, or a non-LLM verification layer.

For enterprise AI, that last phrase is important. Sometimes the correct fix is not “ask the model harder.” It is to bind the model to rule-based geometry checks, simulation engines, CAD kernels, or domain-specific validators. The model can propose. The constraint system should verify. Trusting a language model to remember orthographic projection rules because it once wrote a nice paragraph about engineering drawing is not a strategy. It is a vibe with an API key.

The ROI is cheaper diagnosis, not just cheaper inference

The paper reports token reductions after fine-tuning with structured traces. That naturally points to cost savings. Lower token use can reduce inference cost and latency.

But the larger business value is diagnosis.

Without intermediate supervision, an enterprise team sees a failed answer and has to guess why the system failed. Was the image unreadable? Was the prompt ambiguous? Did the model misunderstand a rule? Did it reason correctly but calculate incorrectly? Did it hallucinate a hidden edge because it was feeling creative, as models occasionally do?

Structured traces make failures inspectable. That changes the improvement loop.

Without structured diagnosis	With structured diagnosis
Debugging starts from final wrong answers	Debugging starts from the failing reasoning step
Failure modes blur together	Projection, feature, scale, geometry, and reasoning errors can be separated
Teams rely on broad model upgrades	Teams can target data, prompts, validators, or architecture
Cost control focuses only on model size	Cost control can also reduce reasoning waste

This is the practical bridge from paper to deployment. A benchmark like MathSpatial does not merely rank models. It helps teams ask whether their system’s reasoning pipeline is auditable.

And in physical-world AI, auditability is not philosophical garnish. If a system recommends a robot movement, a warehouse layout, or a design modification, it is useful to know whether it maintained the relevant constraints before the final answer appeared.

The boundary: clean geometry is not the whole physical world

MathSpatial’s discipline is also its boundary.

The benchmark focuses on static geometric diagrams. This is a strength because it minimizes perceptual noise and isolates reasoning. It is also a limitation because many business applications involve motion, occlusion, sensor uncertainty, deformable objects, irregular environments, and feedback loops.

A model that improves on MathSpatial is not automatically ready for autonomous driving, robotic manipulation, or industrial inspection. The paper itself acknowledges this scope and points to future extensions involving dynamic spatial transformations and complex 3D embodied environments.

There is also a dataset-origin boundary. The problems come from public educational materials, with a portion translated from Chinese into English and spot-checked. That supports scale and bilingual coverage, but it does not eliminate all possible distribution artifacts. Educational diagrams have conventions. Real production data has surprises, and surprises are where systems often reveal their true personality, usually at the least convenient moment.

Finally, the fine-tuning gains are modest. They demonstrate direction, not destination. A 20%–30% token reduction is operationally interesting. A few percentage points of accuracy improvement is scientifically useful. But the remaining gap to human performance is still large enough that spatial reliability should be treated as an open engineering problem.

What this paper changes

MathSpatial does not show that MLLMs are useless for spatial tasks. That would be too broad and too easy.

It shows something more specific: clean spatial reasoning remains a bottleneck even for strong multimodal models; the bottleneck is not reducible to perception; and structured intermediate supervision helps, but does not close the gap.

That combination is valuable because it tells builders where not to be lazy.

Do not assume image understanding implies spatial reasoning. Do not assume chain-of-thought text means the model maintained constraints. Do not assume scale will quietly solve geometry while everyone is busy refreshing leaderboards. And do not treat a model’s confident diagram explanation as evidence that it can preserve a view relationship across multiple steps.

The paper’s best contribution is not that it gives the field another score table. It gives the field a cleaner failure. Clean failures are useful. They remove excuses, narrow the problem, and make progress measurable.

Models can describe a cube. Some can even sound very thoughtful while doing it.

Understanding the cube — its views, constraints, hidden edges, and measurable properties — remains another matter.

Cognaptus: Automate the Present, Incubate the Future.

Shuo Lu et al., “Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation,” arXiv:2602.11635v2, 2026. https://arxiv.org/abs/2602.11635 ↩︎

The unpleasant number comes first: 58.5% versus 96.3%#

The benchmark removes the easy excuse: perception noise#

What MathSpatial actually tests: recognition, generation, deduction#

The error analysis says models lose rules, not just pixels#

Structured traces are training signal, not magic powder#

How to read the experiments without over-reading them#

Business meaning: spatial AI needs pre-deployment geometry gates#

The ROI is cheaper diagnosis, not just cheaper inference#

The boundary: clean geometry is not the whole physical world#

What this paper changes#