Spatial Reasoning

Spatial-Gym and the Illusion of Thinking: Why AI Can’t Walk Before It Runs

Agents are supposed to act. That is the promise hiding behind most enterprise AI demos: the model will not merely answer a question, but inspect a system, choose the next step, correct itself, and reach a useful outcome. The interface changes from chat box to workflow loop, and suddenly everyone starts using the word “agent” with the confidence of a person who has never watched a model get lost in a four-by-four grid. ...

Seeing Is Not Solving: Why AI Still Gets Stuck in 3D Worlds

Wall. That is not the grand philosophical frontier AI companies usually place in their product decks. The frontier is supposed to be reasoning, planning, tool use, autonomy, maybe a tasteful diagram with arrows and a glowing robot hand. But in a visually rich 3D world, a surprisingly large part of “autonomy” still reduces to something less glamorous: can the agent notice that it is stuck against a wall, step back, change angle, and continue? ...

The Map Is Not the Territory—But Your LLM Thinks It Is

Coffee is simple. Parking is annoying. Charging an electric vehicle while also finding a useful nearby stop is where the apparently simple request turns into a small urban planning problem wearing a chatbot costume. A user does not ask for a theorem. They ask something like: “I need to charge my car and grab coffee nearby. Where should I go?” ...

Topology Trouble: Why Even Frontier LLMs Still Get Lost in a Grid

Grid. It looks like the friendliest possible structure. Rows, columns, symbols, rules. No blurry photos, no social nuance, no awkward customer email written at 1:13 a.m. Just a small board and a set of constraints. Naturally, this is where modern reasoning models still manage to embarrass themselves. The paper introducing TopoBench studies a deceptively simple question: can frontier large language models solve topology-heavy grid puzzles where the answer depends on connectivity, loop closure, symmetry, visibility, and state consistency?1 The answer is not “never.” That would be too easy. The answer is more annoying: models often understand enough to start correctly, reason long enough to sound competent, and then lose the structure that makes the solution valid. ...

When Models Get Lost in Space: Why MLLMs Still Fail Geometry

Geometry looks clean. A cube has edges. A projection has rules. A missing view should follow from the views already shown. This is not the messy world of occluded street scenes, motion blur, shadows, or a warehouse camera pointed at the wrong shelf. It is the kind of visual reasoning many students learn before they are trusted with anything more dangerous than a compass, a ruler, and mild boredom. ...

CitySeeker: Lost in Translation, Found in the City

The city does not answer literal questions A person says, “I’m thirsty.” A human does not usually reply, “Please specify whether you require a vending machine, café, convenience store, supermarket, juice shop, water fountain, or bubble tea store.” That would be technically attentive and socially catastrophic. A human looks around, remembers what cities usually contain, infers which places can satisfy the need, and starts walking toward a plausible target. ...

Seeing Is Believing—Planning Is Not: What SpatialBench Reveals About MLLMs

A robot in a parking lot does not need poetry. It needs to know where the car is, which way the road bends, what happens if it turns right, and how to reach the exit without performing an expensive interpretation of modern sculpture on someone’s bumper. That sounds simple until we ask a multimodal large language model to do it. ...

Think Outside the Bounding Box: How SpatialThinker Reinforces 3D Reasoning

A warehouse robot does not need poetry. It needs to know whether the box is behind the pallet, whether the cup is closer than the plate, and whether the object it is about to grab is actually reachable rather than merely visible. Small details. Very irritating when ignored. This is where many multimodal models still become strangely philosophical. They can describe an image fluently, infer intent, and produce a confident answer. Then they miss that one object is in front of another. Apparently, “seeing” and understanding space are not the same occupation. ...