When Models Get Lost in Space: Why MLLMs Still Fail Geometry
Opening — Why This Matters Now Multimodal large language models (MLLMs) can caption images, describe scenes, and even explain memes with unsettling confidence. Yet ask them a textbook-level geometry problem involving orthographic projections or cube folding, and their composure dissolves. According to the newly proposed MathSpatial framework, humans solve structured spatial reasoning tasks with 96%+ accuracy, while most leading MLLMs struggle below 60%. Even frontier systems plateau far below human baselines. ...