Geometry looks clean.
A cube has edges. A projection has rules. A missing view should follow from the views already shown. This is not the messy world of occluded street scenes, motion blur, shadows, or a warehouse camera pointed at the wrong shelf. It is the kind of visual reasoning many students learn before they are trusted with anything more dangerous than a compass, a ruler, and mild boredom.
That is exactly why the new MathSpatial paper is awkward for multimodal large language models.1 The authors evaluate leading MLLMs on textbook-style mathematical spatial reasoning tasks and find a gap large enough to be impolite: human students reach 96.3% accuracy, while the best tested model, GPT-5, reaches 58.5%. Most models remain far below that. Several open-source baselines cluster around the high teens or low twenties.
This matters because the benchmark is not asking models to interpret a cluttered photograph. It is asking them to reason over clean diagrams with minimal perceptual noise. If a model fails, the usual excuse — “the image was hard to see” — has mostly left the building.
The result is not just another benchmark leaderboard. We have enough of those to tile a small airport. The more useful point is diagnostic: MathSpatial suggests that many MLLMs can look spatially competent while still failing at the operations that physical-world AI systems need — aligning views, preserving geometric constraints, and carrying a multi-step reasoning chain without quietly dropping the plot.
The unpleasant number comes first: 58.5% versus 96.3%
The paper’s central evidence is simple enough to hurt.
MathSpatial-Bench contains 2,000 evaluation problems across three broad categories and 11 subtypes. The authors report that human students, tested under closed-book conditions, achieve 96.3% micro-averaged accuracy. GPT-5 leads the tested models at 58.5%. Gemini-2.5-Flash follows at 48.5%, Gemini-2.5-Pro at 44.9%, while GPT-4o, GPT-4.1, Claude variants, and several open-source models perform substantially lower.
The useful comparison is not “which logo wins this week.” It is the size and nature of the remaining gap.
| Group | Representative result | What the number means | What it does not mean |
|---|---|---|---|
| Human students | 96.3% | Clean geometry tasks are highly solvable by people under controlled conditions | Human performance on all spatial tasks is solved or effortless |
| Best tested closed-source model | GPT-5: 58.5% | Frontier MLLMs still fail many clean spatial reasoning cases | GPT-5 is weak generally; the result is task-specific |
| Strong second-tier closed-source model | Gemini-2.5-Flash: 48.5% | Some models handle selected transformations better than others | High performance on some subtypes guarantees robust geometry |
| Open-source baseline range | roughly 15%–21% for several tested models | Base open-source MLLMs remain far from reliable on this benchmark | Open-source models cannot improve with targeted training |
This is why the paper is better read evidence-first rather than dataset-first. A normal summary would begin with the dataset construction pipeline, then explain the benchmark, then eventually show results. That is polite. It is also the wrong cognitive order.
The reader’s default assumption is likely: if a model can read images and solve math problems, then clean geometry diagrams should be within reach. MathSpatial’s contribution is to puncture that assumption before we admire the dataset architecture.
The benchmark removes the easy excuse: perception noise
A common weakness of spatial reasoning benchmarks is that perception and reasoning get tangled together. Put a model in a complex 3D scene and ask a spatial question; if it fails, we do not know whether it misread the image, misunderstood the relation, ignored a constraint, or simply guessed with theatrical confidence.
MathSpatial tries to separate those failures.
The benchmark uses clean mathematical diagrams sourced from educational materials. These are not photorealistic kitchen scenes. They are formalized problems with objective answers: multiple-choice or numeric. The authors collect raw candidates from public educational repositories and textbooks, filter incomplete or non-spatial items, standardize images and text, deduplicate across splits, check geometric consistency, and verify solutions.
The curation pipeline is not just clerical. It supports the paper’s main interpretive claim. If the benchmark minimizes perceptual distractions, then model errors become more informative about spatial reasoning itself.
The authors began with 35,428 raw candidates, retained 21,673 after preliminary curation, reduced the pool to about 11,000 unique high-quality samples after standardization and deduplication, removed about 0.4K geometrically inconsistent items, and ended with 10,000 verified problems. These were split into:
- MathSpatial-Bench: 2,000 problems for diagnostic evaluation.
- MathSpatial-Corpus: 8,000 problems for training, with verified solutions and structured reasoning traces.
The benchmark is not perfect, but its design is disciplined. It does not say, “models fail in the wild.” It says something narrower and more damaging: models fail even when the spatial problem is cleaned up for them.
That is a better benchmark design than throwing models into visual chaos and then pretending the resulting confusion is a precise measurement.
What MathSpatial actually tests: recognition, generation, deduction
MathSpatial-Bench is organized into three categories: Holistic Recognition, Generative Inference, and Abstract Deduction. These are not just labels. They represent different levels of spatial burden.
| Category | Problem count | Share of benchmark | What the model must do |
|---|---|---|---|
| Holistic Recognition | 518 | 25.9% | Recognize or match spatial structures, often across views |
| Generative Inference | 636 | 31.8% | Complete or transform views under geometric constraints |
| Abstract Deduction | 846 | 42.3% | Infer properties, feasibility, or calculations from spatial rules |
This category design matters because “spatial reasoning” is often treated as one ability. It is not. A model may recognize a familiar arrangement yet fail to generate a missing view. It may handle a view-matching problem but collapse when asked to calculate a property from constraints.
The paper’s results follow that pattern. Holistic Recognition is relatively easier. Top models reach stronger scores on subtypes such as image-view identification and three-view matching. Generative Inference is mixed: missing-view completion is tractable for stronger models, while visual transformation remains brutal for many. Abstract Deduction is the hardest overall, especially geometric property calculation.
That last point deserves more attention than a leaderboard rank. Geometric property calculation accounts for 24.1% of the benchmark and is a near-universal failure point. GPT-5 reaches 52.3% on GPC. Several other closed-source models score near zero. All base open-source models tested score at or below 9.3%, with multiple models at 0.0%.
This is not a small “more data will fix it eventually, please clap” result. It suggests that current MLLMs often lack reliable mechanisms for combining visual structure, formal constraints, and multi-step computation.
The error analysis says models lose rules, not just pixels
The paper’s error analysis is useful because it moves the discussion from “accuracy is low” to “what kind of thinking is breaking?”
The authors group errors into six failure modes:
| Failure mode | Share of errors | Interpretation |
|---|---|---|
| Reasoning gaps | 34.4% | The model’s chain is incomplete or internally inconsistent |
| Geometry violations | 33.0% | The output breaks projection, visibility, or geometric rules |
| Projection errors | 12.6% | The model misinterprets top, side, or front views |
| Feature errors | 10.5% | The model omits parts, invents edges, or mishandles components |
| Scale errors | 7.2% | The model fails to preserve relative sizes |
| Deduction failures | 2.3% | The model cannot synthesize cues into a final conclusion |
The top two categories — reasoning gaps and geometry violations — account for roughly two-thirds of observed errors. This is the heart of the paper.
If the main failures were feature errors or projection errors, we might treat MathSpatial as mainly a perception problem. Better visual encoders, more diagram pretraining, or higher resolution might carry the day. But the error distribution points elsewhere. Models often fail to sustain a valid reasoning process or enforce geometric rules across steps.
That is a different kind of weakness. It is closer to watching a model understand each sentence in a contract but still violate the contract logic by paragraph four. The local pieces look fine. The global constraint system quietly collapses.
The authors also report different error profiles across model families. GPT-5 and GPT-4 series are dominated by reasoning gaps. Claude models frequently violate geometric rules. Gemini-2.5-Flash suffers more from projection and scale errors. Open-source models such as Qwen2.5-VL-7B show primarily reasoning gaps but fewer feature errors.
For applied AI teams, this distinction matters. “The model failed” is not a diagnosis. “The model misread the input” and “the model understood the input but violated the constraint system” imply different interventions.
Structured traces are training signal, not magic powder
MathSpatial’s second contribution is the 8,000-problem training corpus. Each problem includes the image, textual description, final answer, and detailed solution. The more interesting part is MathSpatial-SRT: structured reasoning traces organized around three atomic operations.
The three operations are:
- Correlate: establish correspondences across views or geometric entities.
- Constrain: apply projection, visibility, or geometric rules.
- Infer: deduce latent attributes or final answers.
This decomposition is valuable because it turns spatial reasoning from a black-box answer-generation task into a traceable process. Instead of asking the model to produce an answer and hoping its intermediate reasoning is not decorative fog, the corpus encourages operation-level reasoning.
The traces are generated with GPT-4o under a constrained schema and then checked through a dual-role review process. A Reviewer agent audits each step for operation-type errors, contradictions, or missing steps; a Checker agent rewrites problematic traces while preserving the schema. The paper reports that this process detects and fixes about 10% of generated traces.
This part should be interpreted carefully. The paper does not prove that Correlate, Constrain, and Infer are a complete theory of spatial cognition. It does not prove that models trained on these traces “understand geometry” in the human sense. It shows something more modest and more useful: structured intermediate supervision can improve performance and reduce token use on the tested models.
The fine-tuning results support that point. The authors fine-tune Qwen2.5-VL-7B, InternVL3-8B, and Llama3-8B on MathSpatial-Corpus. All three improve overall accuracy. MathSpatial-InternVL3-8B rises from 17.4% to 22.6%. MathSpatial-Qwen2.5-VL-7B rises from 17.8% to 22.1%. MathSpatial-Llama3-8B rises from 15.0% to 20.3%.
The gains are real, but not miraculous. Nobody should read a move from 17.8% to 22.1% as “problem solved.” That would be an aggressive interpretation, the kind usually found in pitch decks and poorly supervised press releases.
The more interesting result is that token use falls at the same time. Qwen2.5-VL-7B drops from 465.3 average tokens to 351.9 after MathSpatial fine-tuning. InternVL3-8B drops from 473.5 to 318.3. Llama3-8B drops from 785.4 to 397.3.
For businesses, that dual movement matters: slightly better answers, shorter reasoning traces, and potentially lower inference cost. The paper’s immediate practical value is not that MathSpatial creates spatially reliable MLLMs. It shows that structured supervision can make reasoning less wasteful and somewhat more correct.
How to read the experiments without over-reading them
The paper contains several components that serve different evidentiary roles. Mixing them together leads to bad conclusions. Here is the cleaner map.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| MathSpatial-Bench evaluation | Main evidence | Current leading MLLMs are far below human performance on clean mathematical spatial reasoning | Failure rates in every real-world robotics or CAD environment |
| Human closed-book baseline | Calibration | The benchmark is highly solvable for humans under controlled conditions | Humans never fail spatial reasoning outside textbook settings |
| Fine-tuning on MathSpatial-Corpus | Training-value test | Structured corpus improves tested open-source models | Fine-tuning alone closes the human-model gap |
| Structured reasoning traces | Implementation and diagnostic contribution | Intermediate supervision can guide models toward more interpretable reasoning | The three operations are a complete cognitive theory |
| Error taxonomy | Diagnostic analysis | Dominant failures are reasoning gaps and geometry violations | Every individual model failure has a single clean cause |
| Static clean diagrams | Scope control | The benchmark isolates reasoning from perception noise | Performance transfers automatically to dynamic embodied environments |
This table is the practical way to read the paper. The main result is the human-model gap. The training experiment shows that the dataset is useful, not sufficient. The error taxonomy tells researchers and builders where to intervene. The static-diagram design makes the benchmark cleaner, but also narrows its claim.
The paper is strongest when it is treated as a diagnostic instrument.
Business meaning: spatial AI needs pre-deployment geometry gates
The direct paper finding is academic: MLLMs perform poorly on a curated mathematical spatial reasoning benchmark, and structured training improves several open-source models modestly.
The business inference is sharper: companies building AI systems for physical-world workflows should not assume that a multimodal model’s general visual competence implies spatial reliability.
That matters in several domains:
| Domain | Why MathSpatial-style failure matters |
|---|---|
| Robotics | A robot planner must preserve geometry, affordances, and object relations across steps |
| Warehouse automation | Bin picking, packing, and shelf reasoning require stable spatial constraints |
| CAD and engineering assistants | Projection, section views, and geometry-derived properties are not optional decorations |
| Industrial inspection | A system must distinguish visible defects from impossible or inconsistent geometry |
| Construction and facilities | Spatial consistency affects layout, measurement, and conflict detection |
| Embodied agents | Navigation and manipulation require more than fluent image descriptions |
The Cognaptus reading is not “use MathSpatial directly in production.” Most firms will not deploy a geometry exam as a workflow. The better lesson is architectural.
If an AI system touches physical layouts, tools, vehicles, machines, diagrams, or 3D representations, it needs explicit spatial evaluation gates. Those gates should not only test final answers. They should test intermediate consistency:
- Did the model correctly align entities across views?
- Did it preserve visibility and projection rules?
- Did it carry scale relationships through the reasoning chain?
- Did it invent edges, surfaces, or components?
- Did it reach the answer through a valid sequence, or merely land near the answer by luck?
This is where MathSpatial becomes operationally relevant. It offers a pattern for diagnosis: define the reasoning primitives, construct clean tests, record trace-level failures, and use the errors to decide whether the model needs better perception, better constraint handling, better training data, or a non-LLM verification layer.
For enterprise AI, that last phrase is important. Sometimes the correct fix is not “ask the model harder.” It is to bind the model to rule-based geometry checks, simulation engines, CAD kernels, or domain-specific validators. The model can propose. The constraint system should verify. Trusting a language model to remember orthographic projection rules because it once wrote a nice paragraph about engineering drawing is not a strategy. It is a vibe with an API key.
The ROI is cheaper diagnosis, not just cheaper inference
The paper reports token reductions after fine-tuning with structured traces. That naturally points to cost savings. Lower token use can reduce inference cost and latency.
But the larger business value is diagnosis.
Without intermediate supervision, an enterprise team sees a failed answer and has to guess why the system failed. Was the image unreadable? Was the prompt ambiguous? Did the model misunderstand a rule? Did it reason correctly but calculate incorrectly? Did it hallucinate a hidden edge because it was feeling creative, as models occasionally do?
Structured traces make failures inspectable. That changes the improvement loop.
| Without structured diagnosis | With structured diagnosis |
|---|---|
| Debugging starts from final wrong answers | Debugging starts from the failing reasoning step |
| Failure modes blur together | Projection, feature, scale, geometry, and reasoning errors can be separated |
| Teams rely on broad model upgrades | Teams can target data, prompts, validators, or architecture |
| Cost control focuses only on model size | Cost control can also reduce reasoning waste |
This is the practical bridge from paper to deployment. A benchmark like MathSpatial does not merely rank models. It helps teams ask whether their system’s reasoning pipeline is auditable.
And in physical-world AI, auditability is not philosophical garnish. If a system recommends a robot movement, a warehouse layout, or a design modification, it is useful to know whether it maintained the relevant constraints before the final answer appeared.
The boundary: clean geometry is not the whole physical world
MathSpatial’s discipline is also its boundary.
The benchmark focuses on static geometric diagrams. This is a strength because it minimizes perceptual noise and isolates reasoning. It is also a limitation because many business applications involve motion, occlusion, sensor uncertainty, deformable objects, irregular environments, and feedback loops.
A model that improves on MathSpatial is not automatically ready for autonomous driving, robotic manipulation, or industrial inspection. The paper itself acknowledges this scope and points to future extensions involving dynamic spatial transformations and complex 3D embodied environments.
There is also a dataset-origin boundary. The problems come from public educational materials, with a portion translated from Chinese into English and spot-checked. That supports scale and bilingual coverage, but it does not eliminate all possible distribution artifacts. Educational diagrams have conventions. Real production data has surprises, and surprises are where systems often reveal their true personality, usually at the least convenient moment.
Finally, the fine-tuning gains are modest. They demonstrate direction, not destination. A 20%–30% token reduction is operationally interesting. A few percentage points of accuracy improvement is scientifically useful. But the remaining gap to human performance is still large enough that spatial reliability should be treated as an open engineering problem.
What this paper changes
MathSpatial does not show that MLLMs are useless for spatial tasks. That would be too broad and too easy.
It shows something more specific: clean spatial reasoning remains a bottleneck even for strong multimodal models; the bottleneck is not reducible to perception; and structured intermediate supervision helps, but does not close the gap.
That combination is valuable because it tells builders where not to be lazy.
Do not assume image understanding implies spatial reasoning. Do not assume chain-of-thought text means the model maintained constraints. Do not assume scale will quietly solve geometry while everyone is busy refreshing leaderboards. And do not treat a model’s confident diagram explanation as evidence that it can preserve a view relationship across multiple steps.
The paper’s best contribution is not that it gives the field another score table. It gives the field a cleaner failure. Clean failures are useful. They remove excuses, narrow the problem, and make progress measurable.
Models can describe a cube. Some can even sound very thoughtful while doing it.
Understanding the cube — its views, constraints, hidden edges, and measurable properties — remains another matter.
Cognaptus: Automate the Present, Incubate the Future.
-
Shuo Lu et al., “Do MLLMs Really Understand Space? A Mathematical Spatial Reasoning Evaluation,” arXiv:2602.11635v2, 2026. https://arxiv.org/abs/2602.11635 ↩︎