Opening — Why this matters now
Everyone wants AI agents that can act. Navigate systems. Execute workflows. Make decisions.
There’s just one small problem: they still struggle to think spatially.
The recent paper “Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym” fileciteturn0file0 quietly dismantles a widely held assumption in the AI industry—that better reasoning models naturally translate into better agents.
They don’t.
And the gap is not marginal. It’s structural.
Background — From one-shot brilliance to sequential confusion
Most LLM benchmarks operate under a convenient fiction: intelligence can be measured in a single response.
- Solve the puzzle in one go
- Produce a clean output
- Compare with ground truth
This is efficient. It’s also deeply misleading.
Humans don’t solve complex problems this way. We iterate, revise, backtrack, and occasionally get stuck.
The paper introduces Spatial-Gym, a reinforcement-learning-style environment designed to evaluate step-by-step spatial reasoning instead of one-shot answers. The task: navigate a 2D grid while satisfying multiple interacting constraints—dots, gaps, colored regions, shapes, and more.
In other words, a simplified version of real-world planning.
Analysis — What the paper actually does
1. Reframing reasoning as action
Spatial-Gym turns reasoning into a sequential decision process:
| Component | Description |
|---|---|
| State | Grid + current position + path history |
| Actions | Up / Down / Left / Right |
| Constraints | Multiple interacting spatial rules |
| Termination | Success, deadlock, or step limit |
This matters because it removes a long-standing ambiguity:
Is the model failing to reason—or just failing to format its answer?
Spatial-Gym isolates the former.
2. Three evaluation modes
The study compares models under three settings:
| Setting | Description |
|---|---|
| One-shot | Full solution in a single response |
| Step-by-step | Sequential decision-making |
| Step + backtracking | Allows undoing previous steps |
This is where things get interesting—and slightly uncomfortable for the AI narrative.
Findings — Where the illusion breaks
1. The human–AI gap is… not subtle
| Agent | Accuracy |
|---|---|
| Humans | 98.0% |
| Best model (GPT-OSS 120B) | 16.0% |
| Mid-tier models | ~10–11% |
| Small models | <5% |
The gap isn’t incremental. It’s categorical.
This suggests that spatial reasoning is not just another scaling problem—it’s a missing capability class.
2. Step-by-step reasoning helps… until it doesn’t
| Model Type | Effect of Step-by-Step |
|---|---|
| Smaller models | +1% to +5.4% improvement |
| Frontier models | −4.8% to −5.6% decline |
Why?
Because stepwise reasoning introduces a constraint:
You must commit locally before understanding globally.
Weaker models benefit from reduced formatting errors. Stronger models lose their ability to plan holistically.
This is the first hint that “chain-of-thought” and “agentic execution” are not naturally aligned.
3. Backtracking: a feature models don’t really use
Backtracking improves completion rates dramatically:
| Model | Completion (No BT) | Completion (BT) |
|---|---|---|
| GPT-OSS | 85% | 94% |
| R1 Distill | 50% | 88% |
But accuracy?
- Improves only for weaker models
- Declines for stronger ones
Interpretation:
Models explore when they’re lost, not when they’re wrong.
Which is the opposite of how humans debug.
4. More thinking ≠ better thinking
Token usage increases significantly in Spatial-Gym:
- ~5× more tokens in step-by-step
- ~10× more with backtracking
Yet accuracy barely improves.
| Metric | Observation |
|---|---|
| Token usage | Scales with difficulty |
| Path quality | Does NOT scale |
| Outcome | No meaningful gain |
This is a subtle but critical finding:
Compute without direction is just noise.
5. Vision models collapse (yes, really)
| Model | Input Type | Accuracy |
|---|---|---|
| Qwen3-32B | Text | 10.6% |
| Qwen3-VL | Text | 10.2% |
| Qwen3-VL | Image | 2.8% |
Giving the model the actual image of the puzzle makes it worse.
Because it can’t reliably map pixels to structured constraints.
So much for “multimodal reasoning.”
6. A* beats models in navigation—but not reasoning
| Method | Accuracy | Completion |
|---|---|---|
| Random | 2.4% | 31% |
| A* | 6.4% | 100% |
| GPT-OSS | 16.0% | 85% |
Interpretation:
- Navigation is solved (A*)
- Constraint reasoning is not (LLMs)
The bottleneck is not movement. It’s understanding.
Implications — What this means for AI agents
1. Agentic AI is not just “LLM + loop”
The industry trend is clear:
Wrap an LLM in a loop → call it an agent
Spatial-Gym shows why this is fragile.
Sequential decision-making introduces:
- Local commitment errors
- Lack of revision strategies
- Misaligned optimization (shortest path vs correct path)
Agents need search strategies, not just reasoning traces.
2. RL is necessary—but not sufficient
The paper’s RL experiments show modest gains:
| Model | Accuracy Before | After RL |
|---|---|---|
| Small model | 2.6% | 3.6% |
Yes, improvement exists.
No, it doesn’t solve the problem.
This implies:
The issue is not just training—it’s representation.
3. Scaling hits a hard wall
Even at larger model sizes:
- Accuracy increases monotonically in controlled settings
- But converges near zero at high difficulty
This is not a smooth scaling curve.
It’s a capability ceiling.
4. The real bottleneck: constraint integration
The study repeatedly points to one failure mode:
Models can handle rules individually—but fail when rules interact.
This is exactly what real-world systems look like:
- Regulatory constraints
- Operational dependencies
- Multi-step workflows
Which means:
Current LLMs are structurally underprepared for enterprise automation.
Conclusion — The uncomfortable takeaway
Spatial-Gym doesn’t just benchmark models.
It exposes a deeper truth:
Today’s AI can describe solutions better than it can construct them.
And until models can:
- Plan globally
- Revise intelligently
- Allocate reasoning effort proportionally
…agentic AI will remain more demo than deployment.
Quietly impressive. Strategically unreliable.
Cognaptus: Automate the Present, Incubate the Future.