SokoBench: When Reasoning Models Lose the Plot

A corridor is not supposed to be hard.

There is one player. One box. One goal. No maze. No clever trap. No branching strategy tree with a thousand tempting wrong turns. The player stands at one end, the goal sits at the other, and the box is between them. Push the box along the corridor until it reaches the goal. That is the task.

SokoBench is interesting because it removes almost every excuse that large reasoning models usually get when they fail at planning.¹ The task is not semantically ambiguous. It does not require domain knowledge. It is not a puzzle where success depends on discovering a hidden trick. It is a long, narrow sequence of valid moves.

And that is precisely the problem.

The paper’s uncomfortable result is not that reasoning models struggle with Sokoban. Full Sokoban is famously difficult, so that would not be news. The sharper result is that advanced reasoning models struggle even when Sokoban is reduced to a one-box linear corridor. Their failure appears once the solution becomes long enough—roughly beyond 25 to 30 moves—not because the task becomes conceptually complex, but because tiny errors in counting, state tracking, and spatial representation accumulate until the plan falls apart.

So the article version of the result is simple:

The model does not lose because the corridor is complicated. It loses because the corridor is long.

That distinction matters for business users building agents, workflow automations, planning assistants, warehouse tools, compliance routers, and multi-step operational systems. Many current AI deployments quietly assume that if a model can explain a plan, it can execute or supervise one. SokoBench suggests a less flattering rule: a fluent plan is not a state machine. Beautiful reasoning traces are not the same thing as reliable transition control. Annoying, but useful.

The failure starts with a small counting error, not a grand strategic mistake

The paper’s design is deliberately minimal. The authors generate corridor-like Sokoban maps with length $\ell$ ranging from 5 to 100 in increments of 5. Each map has the same basic components: one player, one box, and one goal. The only real difficulty parameter is corridor length. For each length, they test four rotations: the original orientation plus 90°, 180°, and 270° variants. This gives 80 distinct maps.

That design choice is the point. Many planning benchmarks mix several sources of difficulty at once: branching factor, search depth, hidden constraints, spatial complexity, and output-format fragility. When a model fails, it is then hard to know what broke. Did it fail because the search tree was large? Because the rule representation was unstable? Because the prompt was awkward? Because the model could not count? Everyone gets to pick their favorite diagnosis. Very convenient. Very unscientific.

SokoBench tightens the question. It asks: what if the branching factor is almost gone, but the action sequence is still long?

In a one-dimensional corridor, the optimal solution is unique. Evaluation can be strict: exact string match between the model’s predicted action sequence and the ground truth. The authors also use prefix accuracy, which rewards correct partial trajectories, and Manhattan distance, which tells whether a failed plan at least moved in the right direction.

Those metrics separate three kinds of failure:

Metric	What it checks	Why it matters
Exact accuracy	Did the model produce the full correct optimal action sequence?	Tests end-to-end planning reliability.
Prefix accuracy	How much of the beginning of the plan was correct before it drifted?	Detects partial competence followed by state loss.
Manhattan distance	How far did the final predicted position end from the goal?	Distinguishes near-misses from spatial nonsense.

This is where the mechanism becomes clearer. A model can begin correctly, push in the right direction, and still fail because it loses track of how many steps remain, where the box is, or whether a proposed transition is physically valid. The problem is not merely “planning” in the abstract. It is the repeated maintenance of a small internal world model across many transitions.

The authors express the intuition with a simple compounding argument. In a deep but narrow problem, if each step carries even a small probability of miscounting or state drift, total success decays with length:

$$ P(\text{success}) \sim (1 - p_w)^\ell $$

Here, $p_w$ is the per-step probability of a wrong internal update, and $\ell$ is the corridor length. The formula is not the paper’s claim that every failure literally follows an independent-error process. It is a useful mental model for why long sequences break systems that look competent on short ones. A small defect that barely matters over 10 steps becomes lethal over 80.

That is the first business lesson. In automation, “works on a five-step demo” tells you very little about “works across a 70-step operational process.” Long-horizon reliability is not a vibe. It is an error-accumulation problem.

The benchmark is narrow on purpose, which makes the result harder to dismiss

A common reader reaction is predictable: “But real planning tasks are more complex than this. Why should we care about a toy corridor?”

Because the toy corridor removes the usual hiding places.

If a model fails on a complex warehouse routing task, one can blame perception, tool access, changing constraints, real-time exceptions, poor system prompts, or insufficient examples. If it fails on a one-box corridor, the diagnosis becomes less flattering. The system is struggling with the low-level operations that more complex tasks quietly depend on: count, update, validate, continue.

The paper’s contribution is therefore not “Sokoban is hard.” The contribution is diagnostic compression. It compresses planning failure into a setting where the main remaining stressor is horizon length.

That gives SokoBench a useful place in the evaluation stack:

Evaluation layer	Example question	What SokoBench contributes
Basic symbolic operation	Can the model count positions and maintain sequence length?	Tests counting and action-sequence persistence.
State-transition tracking	Does each move update the world correctly?	Reveals invalid or hallucinated transitions.
Long-horizon execution	Does accuracy survive as $\ell$ increases?	Shows collapse beyond short horizons.
Tool-mediated planning	Can the model formulate a solvable symbolic problem?	Tests whether solvers help when representation is upstream-broken.

The paper tests DeepSeek R1, GPT-5-mini, and GPT-oss 120B under one-shot inference. The models receive instructions, the Sokoban symbol mapping, and one demonstration. They then must output the action sequence. The experiments cap completion tokens at 32,768, so the models are not being starved of output budget.

The result is a three-region pattern. Short corridors are easier. Intermediate corridors show rapid accuracy decay. Long corridors approach near-total failure. The authors identify the key threshold around solutions requiring more than 25 to 30 moves.

The “more thinking solves it” story also weakens here. The paper reports that output token usage generally increases with corridor length for correctly solved cases. In additional analysis, the authors do not observe models simply refusing to think on hard instances. Instead, wrong answers show higher variability in token counts, and longer corridors often push models toward the maximum output-token limit. Qualitative inspection suggests repetitive loops: the model keeps revisiting the same action or reasoning frame until the budget runs out.

That is not planning. That is a hamster wheel wearing a lab coat.

The reasoning trace can become a loop, not a control system

Reasoning traces are seductive because they look like process. Step-by-step text gives the impression of controlled search, deliberate progress, and internal supervision. SokoBench is a useful reminder that a trace can be procedural without being stateful.

The paper connects the observed looping behavior to the idea that reasoning models can behave like “wandering” solution explorers rather than systematic planners. In a systematic search, an agent should know which states it has already visited. It should prune cycles. It should preserve the current state, apply a valid transition, and update the next state.

In the corridor setting, the required memory is almost embarrassingly small. The model needs a mental tape: where the player is, where the box is, where the goal is, and how many valid pushes remain. Yet the models still drift.

The failure modes are not all identical. The paper’s additional metrics help separate them.

Prefix accuracy suggests that models often begin in the right direction but lose the thread later. Manhattan distance shows that some plans end near the goal but still fail. More troublingly, the authors observe cases where the predicted sequence places the player exactly on the goal, ignoring the fact that the box and walls make that state unreachable under the actual rules.

This is not a formatting typo. It is a world-model error.

A model can output a syntactically plausible sequence of moves while internally operating on a fake version of the game. In that fake world, the player may pass through constraints, skip blocked positions, or act as if the box were not where it should be. The text remains fluent. The state is wrong.

For enterprise agents, that distinction is not academic. A workflow agent can produce a plausible procurement sequence while skipping a required approval state. A compliance agent can describe a process that sounds correct while moving a case into an invalid status. A logistics agent can produce a route that looks efficient while silently violating loading constraints. The surface language remains coherent. The transition graph is broken.

That is where the mechanism-first reading matters. The problem is not that the model lacks “planning intelligence” in some vague way. The problem is that it lacks guaranteed state-transition discipline.

The PDDL solver helps only after the model describes the right world

The paper’s second experiment is the obvious repair attempt: give the model external planning tools.

The authors use an LLM-modulo setup. Instead of asking the model to solve the Sokoban corridor directly, they ask it to generate a PDDL problem. A human-authored and verified PDDL domain defines the rules. The generated problem then goes through parsers, validators, and a planner such as Fast Downward or PyperPlan through the Unified Planning library. The model receives diagnostic feedback and can retry up to three times.

This is the right instinct. In serious planning systems, language models should not be trusted as solo planners. They should help translate user intent, formulate problems, and explain results, while external solvers handle validity and search. That is not an insult to LLMs. It is basic engineering hygiene.

But SokoBench shows a catch: the solver can only solve the world it is given.

The LLM-modulo setup modestly improves performance and produces a smoother accuracy curve. It also eliminates one class of failure: invalid transitions inside the solver’s own plan. A verified PDDL domain will not let the planner walk through walls just because the model had a poetic afternoon.

Yet the improvement is limited. The bottleneck moves upstream. If the model generates a PDDL problem that is syntactically valid but spatially wrong, the solver obediently solves the wrong problem. The paper reports two main failure types: syntax errors in generated PDDL, and semantically wrong PDDL problems that compile but misrepresent the actual Sokoban map. The first type is relatively rare in their data: only 7 cases across all four trials of the 80 corridor configurations. The second type is the more important one.

This is the cleanest business-relevant lesson in the paper:

Tools can enforce rules after representation. They cannot guarantee that the representation is faithful.

That should shape how companies design agentic systems. Adding a solver, database, API connector, workflow engine, or policy checker is necessary but not sufficient. The interface between the model and the tool becomes the dangerous layer. If the model misstates the task, the tool can return a perfectly valid answer to the wrong question. The failure is then harder to notice, because the output comes wrapped in formal correctness.

The vertical-corridor effect is not a side plot; it is a prompt-interface warning

The paper also reports orientation sensitivity. The benchmark rotates each corridor, partly to reduce contamination risk and partly to test whether models behave differently across orientations. In the LLM-modulo setting, the authors observe a significant imbalance between vertical and horizontal corridors, with vertical corridors proving harder.

This is not the central thesis, but it is an important implementation warning. The authors note that prompt formatting may matter, especially because vertical maps contain many newline-separated rows. Alternative encodings—numbered grids, row-column coordinates, or richer textual cues—may reduce the issue.

For business deployment, this is the boring detail that becomes expensive later. A model’s performance may depend not only on the underlying task, but on how the task is serialized into text. Two representations that are equivalent to a human may not be equivalent to a language model.

That matters whenever enterprise systems convert structured reality into prompt text:

Operational object	Fragile text representation	Safer design pattern
Process state	Long prose history of previous steps	Explicit state object with schema validation
Spatial layout	ASCII map or free-form description	Coordinate table plus constraint checker
Compliance workflow	Narrative summary of approvals	Status graph with allowed transitions
Planning task	User-described goal and constraints	Formal problem object reviewed before solving
Multi-agent handoff	Chat transcript	Typed task packet with acceptance criteria

This is not glamorous AI architecture. It is plumbing. Unfortunately, plumbing is what keeps the building from smelling like ambition.

What the evidence supports—and what it does not

SokoBench’s experiments serve different purposes, and mixing them together would blur the interpretation.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
Accuracy by corridor length in one-shot inference	Main evidence	Reasoning models degrade sharply as horizon length grows.	That all real-world planning tasks fail at the same threshold.
Prefix accuracy	Diagnostic evidence	Models often begin correctly and then drift.	That every failure is caused by the same internal mechanism.
Manhattan distance	Diagnostic evidence	Some failures remain directionally close, while others violate spatial constraints.	That distance alone captures plan validity.
Token-count analysis	Mechanism evidence	Harder cases trigger longer, more variable, sometimes looping reasoning.	That more tokens are inherently harmful.
LLM-modulo with PDDL tools	Intervention test	External solvers help but cannot fix wrong problem representation.	That symbolic tools are ineffective in general.
Rotation/orientation analysis	Robustness and interface sensitivity	Textual encoding affects model performance.	That vertical spatial reasoning is universally worse across all encodings.

The paper’s strongest claim is narrow but serious: even simplified long-horizon planning exposes structural weaknesses in current reasoning models. The weaker, broader claim would be: “LLMs cannot plan.” That is too blunt. The paper does not show that models are useless for planning workflows. It shows that unaided internal reasoning is brittle, and that tool-augmented planning still depends on faithful problem formulation.

For business use, that distinction is everything. The correct response is not “never use LLMs for planning.” The correct response is “do not let the LLM be the only place where the plan exists.”

The business lesson is architectural, not motivational

The market keeps trying to solve agent reliability with better prompts, larger models, and longer reasoning traces. These may help. They do not remove the need for architecture.

SokoBench points toward a layered design:

Use the model for language-facing work. Let it interpret the user’s request, identify missing information, propose a candidate plan, and explain trade-offs.
Convert the task into an explicit state representation. Use schemas, coordinates, workflow statuses, preconditions, postconditions, and typed constraints.
Validate every transition. Do not rely on the model’s self-reported reasoning. Check whether each step is allowed by the actual system.
Use external planners or solvers where the domain supports them. PDDL, workflow engines, rule engines, database constraints, and simulation environments are not old-fashioned. They are guardrails with teeth.
Shorten the horizon seen by the model. Break long processes into verified segments. Let the model re-plan after validated state updates rather than hallucinating 80 moves into the future.
Log state, not just conversation. A chat transcript is not an operational ledger. The system needs machine-checkable memory of what has happened.

The ROI argument follows from reduced failure diagnosis cost. A pure LLM agent that fails on step 43 of a process leaves you inspecting text. A structured agent that fails on transition 43 leaves you inspecting a violated precondition. One is a séance. The other is debugging.

This is especially relevant in domains where long action sequences are common:

Domain	SokoBench-style risk	Practical control
Workflow automation	Skipping or duplicating hidden process states	State machine with transition validation
Warehouse or routing support	Plausible route violates physical constraints	Solver-backed planning with coordinate checks
Compliance operations	Narrative plan misses required approvals	Formal approval graph and audit log
Customer-service agents	Long case handling loses earlier commitments	Persistent case state and checkpoint summaries
Financial operations	Multi-step execution drifts from risk limits	Pre-trade and post-step rule validation

The general principle is blunt: if the cost of a wrong transition is material, the transition cannot live only inside the model’s hidden state.

The boundary: this is a sanity check, not a universal benchmark

The paper is intentionally narrow. It uses one-box linear corridors, not full Sokoban with multiple boxes, deadlocks, branching paths, and richer spatial traps. Exact-plan matching is appropriate here because the corridor has a unique optimal solution, but richer domains will need solver-based validation across multiple valid solutions. The authors also acknowledge sensitivity to prompt formatting, possible backend variability from API-based evaluations, and residual contamination risk despite map rotations.

Those limitations do not weaken the core diagnostic point. They define it.

SokoBench should not be read as a complete theory of agent planning. It is better read as a sanity check. Before celebrating an agent that claims to handle complex multi-step operations, ask whether it can maintain a simple state over a long sequence. Before trusting a tool-augmented planner, ask whether the model can faithfully formulate the problem that the tool will solve. Before assuming “reasoning effort” equals reliability, inspect whether the model is making progress or merely looping with confidence.

The corridor is small because the failure is small. That is why it matters.

The corridor is the plot

SokoBench does not say that reasoning models are useless. It says something more operationally valuable: their planning failures may begin below the level where most benchmarks look.

Not at strategic insight. Not at high-level decomposition. Not at clever search.

At counting. At state updates. At remembering which internal position was already visited. At not inventing a physically impossible next state because the sentence still sounds plausible.

That is a better diagnosis than the usual industry argument about whether the next model will “finally plan.” The real question is whether the system around the model can make planning explicit: state outside the model, validation outside the trace, tools fed with faithful representations, and long horizons broken into checkable segments.

A narrow corridor should be easy. When it is not, the lesson is not narrow.

Cognaptus: Automate the Present, Incubate the Future.

Sebastiano Monti, Carlo Nicolini, Giovanni Pellegrini, Jacopo Staiano, and Bruno Lepri, “SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models,” arXiv:2601.20856, 2026, https://arxiv.org/abs/2601.20856. ↩︎

The failure starts with a small counting error, not a grand strategic mistake#

The benchmark is narrow on purpose, which makes the result harder to dismiss#

The reasoning trace can become a loop, not a control system#

The PDDL solver helps only after the model describes the right world#

The vertical-corridor effect is not a side plot; it is a prompt-interface warning#

What the evidence supports—and what it does not#

The business lesson is architectural, not motivational#

The boundary: this is a sanity check, not a universal benchmark#

The corridor is the plot#