SokoBench: When Reasoning Models Lose the Plot

Opening — Why this matters now

The AI industry has grown comfortable with a flattering assumption: if a model can reason, it can plan. Multi-step logic, chain-of-thought traces, and ever-longer context windows have encouraged the belief that we are edging toward systems capable of sustained, goal-directed action. SokoBench quietly dismantles that assumption.

By stripping planning down to its bare minimum, the paper reveals an uncomfortable truth: today’s large reasoning models fail not because problems are complex—but because they are long.

Background — Planning versus “thinking”

Classical AI planning has always been explicit about state, transitions, and horizon length. Languages like PDDL and search algorithms such as heuristic planners exist precisely because long action sequences are brittle without structured memory and validation.

LLMs, by contrast, learned reasoning as a linguistic behavior. Recent benchmarks—PlanBench, BlocksWorld variants, Towers of Hanoi—already hinted that something was off. Models could talk about plans, decompose goals, even generate plausible action lists, yet still fail catastrophically when those plans were executed.

SokoBench asks a sharper question: what happens if we remove almost all branching, spatial complexity, and ambiguity—and leave only the horizon?

Analysis — The corridor that breaks the model

The benchmark is almost insulting in its simplicity: a one-dimensional Sokoban corridor, one player, one box, one goal. Difficulty scales with a single variable—the corridor length.

No deadlocks. No combinatorial explosion. No hidden traps.

And yet, across multiple state-of-the-art reasoning models, performance collapses once solutions exceed roughly 25–30 moves.

Three regimes emerge consistently:

Corridor Length	Model Behavior
Short (≤20)	Mostly correct planning
Medium (25–40)	Rapid accuracy decay
Long (≥50)	Near-total failure

The failure mode is not strategic confusion—it is counting drift. Each additional step introduces a small probability of miscounting position or action order. That error compounds exponentially until the plan derails.

Findings — Wandering, loops, and phantom states

The paper’s metrics go beyond binary success:

Prefix accuracy shows models often start correctly, then drift.
Manhattan distance reveals many failures are directionally correct but incomplete.
Token traces show models entering repetitive loops, burning tokens without progress.

More troubling are invalid transitions: walking through walls, teleporting over boxes, or placing the agent directly on the goal. These are not logical mistakes—they are state-representation failures.

In the authors’ framing, reasoning models behave less like planners and more like wandering explorers, revisiting the same internal states without awareness of having done so.

LLM-Modulo — Tools help, but only a little

Adding external PDDL solvers improves outcomes—but modestly. Accuracy curves smooth out, catastrophic errors decline, yet the horizon limit remains stubbornly intact.

Why? Because the bottleneck is upstream. If the model misrepresents the map length or spatial layout, even a perfect solver is fed the wrong world.

Tooling can enforce rules. It cannot repair a broken internal map.

Implications — What SokoBench actually tells us

This benchmark doesn’t just critique Sokoban performance. It challenges several industry assumptions:

Long-horizon planning is not an emergent property of reasoning.
Test-time scaling does not fix structural memory decay.
Explicit solvers cannot compensate for faulty world representations.

For businesses deploying agentic systems, this matters. Workflow agents, autonomous operators, and multi-step decision engines inherit these same fragilities—just better disguised.

Conclusion — The corridor is the message

SokoBench is deliberately small. That is its strength.

By removing every excuse except time and sequence length, it exposes a fundamental limitation of current reasoning architectures. Until models can reliably track state across dozens of steps, “autonomous planning” remains more narrative than reality.

The corridor is narrow. The lesson is not.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Planning versus “thinking”#

Analysis — The corridor that breaks the model#

Findings — Wandering, loops, and phantom states#

LLM-Modulo — Tools help, but only a little#

Implications — What SokoBench actually tells us#

Conclusion — The corridor is the message#