Tile by Tile: Why LLMs Still Can't Plan Their Way Out of a 3×3 Box

A board game should not embarrass a frontier model.

That is the uncomfortable charm of the 8-puzzle. It has no hidden information, no vague user intent, no messy database schema, no ambiguous policy exception, and no client saying “just make it pop.” It is a 3×3 grid with eight tiles and one blank space. Slide adjacent tiles into the blank. Reach the goal state. Done.

And yet, in On the Limits of Innate Planning in Large Language Models, Charles Schepanowski and Charles Ling use this tiny puzzle to expose a much larger problem: without tools, current LLMs remain brittle at state tracking and weak at goal-directed planning, even when prompted to reason step by step, guided with algorithm-like examples, corrected after failure, or handed a list of valid moves.¹

That last part is the sting. The paper does not merely show that models make illegal moves. That would be mildly embarrassing, but familiar. It shows that once legality is partly offloaded, planning still collapses. The models stop breaking the rules only to start wandering, looping, or exhausting the move budget. The system becomes less like a chess player and more like a person confidently pacing in a hallway because every door looks “somewhat promising.”

For businesses building AI agents, this distinction matters. The risk is not that LLMs are useless. They are very useful at language, explanation, summarization, classification, and interface work. The risk is treating a language model’s fluent reasoning trace as if it were a durable operational plan. It is not. At least not by itself.

The tiny board is not the point; the moving state is

The 8-puzzle looks like a toy task, so it is tempting to dismiss it as artificial. That would be too quick.

The paper uses the 8-puzzle because it compresses several enterprise-relevant demands into a clean test environment. A model must maintain a current state, obey strict transition rules, avoid cycles, and choose actions that move it toward a goal. Those are also the basic demands of many “agentic” workflows: resolving support cases, routing invoices, coordinating logistics, managing trading rules, updating CRM records, or scheduling tasks across systems.

The puzzle’s simplicity is precisely the value. In a messy business workflow, failure can be blamed on ambiguous instructions, incomplete data, tool latency, bad retrieval, poor UI design, or a dozen other gremlins wearing enterprise lanyards. In the 8-puzzle, the board is explicit, the rules are fixed, and success is mechanically checkable. If failure appears here, the diagnosis becomes harder to evade.

The authors evaluate four models: GPT-5-Thinking, Gemini-2.5-Pro, GPT-5-mini, and Llama 3.1 8B-Instruct. Each model is tested on 50 randomly generated solvable puzzles, divided into difficulty bins by optimal solution length. The prompts vary across three regimes: Zero-Shot, Chain-of-Thought, and Algorithm-of-Thought. Then the authors add feedback conditions and, finally, an external move validator.

That design matters because it separates three questions that are often blurred together:

Question	What the paper tests	Why it matters
Can the model solve the task from instructions alone?	Zero-Shot, CoT, and AoT baseline runs	Measures tool-free planning under common prompting styles
Can feedback repair failure?	Repeat, specific, and suggestive feedback	Tests whether retries and corrections create reliable improvement
Is the main problem illegal moves or weak strategy?	External move validator	Separates move validity from goal-directed planning

This is not just another benchmark table. It is a failure-mechanism study. The interesting part is not only how often models fail, but how they fail.

The experiment removes the usual escape hatches

A familiar defense of LLM planning is: “Let the model use tools.” That is often the correct engineering answer. But it is not the same scientific question.

If a model can write Python code that runs a search algorithm, the puzzle may be solved—but the planning is being performed by the external algorithm. The model’s job shifts from planning to code generation. That may be useful, but it no longer tells us whether the model itself can maintain state and search through the action space.

The authors therefore exclude code execution and other external tools. This is not because tool use is illegitimate. It is because tool use masks the exact capability under investigation.

The prompting conditions are also chosen carefully. Zero-Shot gives the rules and output format. Chain-of-Thought provides worked examples and encourages step-by-step reasoning. Algorithm-of-Thought goes further by demonstrating a simple Manhattan-distance-style search process, closer to a hand-written planning procedure than casual reasoning.

This is the paper’s first useful correction to a common misconception: more reasoning-shaped text does not automatically produce better planning.

GPT-5-Thinking improves under Algorithm-of-Thought, reaching 30% success in the baseline experiments. But the same kind of richer prompting does not reliably help the other models. GPT-5-mini performs worse under CoT and AoT than under Zero-Shot. Gemini-2.5-Pro remains nearly unsuccessful. Llama 3.1 8B-Instruct solves none.

The result is not “prompting never helps.” It clearly can. The result is more precise and more annoying: prompting helps in model-dependent ways, and it can change failure modes as much as success rates. A prompt can encourage better search in one model while confusing output discipline in another. Apparently, “think harder” is still not an architecture.

Failure mechanism one: the model loses the board

The first major failure mechanism is brittle state representation.

In the 8-puzzle, every move changes the board. To solve the puzzle, the model must keep an accurate internal representation of where each tile and the blank are located. A single mistaken update can poison the rest of the plan. After that, the model may continue producing moves that sound coherent but apply to a board that no longer exists.

This is where the task becomes directly relevant to business agents. Many enterprise workflows are also stateful. A customer case changes after each message. An invoice changes after approval. A trading position changes after a partial fill. A logistics route changes after a delay. If an AI system internally updates the wrong state and then acts confidently, the surface output may remain polished while the operational premise is already broken.

The paper’s failure categories make this visible. Models often terminate because they propose invalid moves, repeat states, stop early, hit token or time limits, fail to parse correctly, or refuse to continue. Invalid moves are especially revealing because they show that the model’s internal board representation has drifted away from reality.

The appendix examples make the failure less abstract. A model may narrate a plausible sequence, revise itself, identify an error, propose another sequence, and still produce moves that do not legally follow from the actual state. The reasoning trace contains local self-correction, but not reliable global state maintenance.

That is a problem for any system where auditability matters. A reasoning trace can be rhetorically persuasive while being operationally wrong. The trace is not the state. The state needs to be represented, updated, and checked outside the model.

Failure mechanism two: valid moves are not planning

The paper’s sharpest experiment is the external move validator.

Here the authors remove part of the state-tracking burden. At each step, the model receives the current puzzle state, a list of valid moves, and the previous move. The model no longer has to infer which moves are legal. It only needs to choose the best next valid move. The system then applies that move, updates the board, and asks again.

This intervention is important because it tests a narrower capability: if legality is provided, can the model still plan toward the goal?

The answer is brutal: no model solves any puzzle in this setting.

The dominant failure mode changes. GPT-5-Thinking loops in 100% of trials. Gemini-2.5-Pro loops in 92%. Llama 3.1 8B-Instruct loops in 86%. GPT-5-mini behaves differently, mostly reaching the 50-move cap through early termination, but it also solves nothing.

This result prevents a lazy interpretation of the baseline failures. The issue is not simply that models sometimes choose illegal moves. Even when valid moves are supplied, they lack sufficiently strong heuristic planning to reach the goal reliably. They can select actions, but action selection is not the same as search.

A useful way to read the validator experiment is as an ablation. It removes one burden—valid-move identification—to isolate another burden: strategy. The collapse under this condition suggests that the planning deficit is not merely a state-legality problem. It is also a goal-directed search problem.

That is exactly the distinction businesses need to understand. Many AI agent products already include validators, schemas, workflow constraints, and tool wrappers. Those are necessary. They are not sufficient. A validator can stop the model from clicking a forbidden button. It does not guarantee the model knows what sequence of allowed buttons will produce the desired outcome.

Feedback buys performance, not reliability

Feedback improves results, but it does not rescue the architecture.

The authors test three feedback regimes after initial failures. “Repeat” simply gives another attempt from the furthest valid state. “Specific” feedback identifies what went wrong, such as an invalid move. “Suggestive” feedback adds the optimal solution length from the current state, giving the model a scalar hint without revealing the action sequence.

The best result is GPT-5-Thinking with Algorithm-of-Thought and suggestive feedback: 34 out of 50 puzzles solved, or 68%. That is a large improvement over its 30% baseline AoT success rate.

But the cost profile is the part that deserves managerial attention. Under that best-performing condition, successful GPT-5-Thinking runs average about 24 minutes, 75,284 tokens, nearly two attempts, roughly 49 moves, and an average optimal solution length of about 21 moves. In other words, the model can sometimes get there, but by taking a long, expensive, indirect route through a tiny deterministic problem.

The weaker models show another pattern. Gemini-2.5-Pro and GPT-5-mini each peak at 18% success under their best-performing feedback conditions. For both, the peak can come from the “Repeat” condition rather than richer feedback, suggesting that saved progress and stochastic retries may matter more than the semantic content of the correction. Llama 3.1 8B-Instruct still solves none.

This is not a small implementation detail. It changes the business interpretation of “feedback helps.”

Feedback can improve completion rates, but it may do so by spending more time, more tokens, more retries, and more supervision-like structure. That is not the same as reliability. It is closer to expensive recovery.

For an enterprise workflow, expensive recovery may still be acceptable if the value of success is high and the system is supervised. But it is a weak foundation for unattended autonomy. If a toy planning task requires repeated retries, long traces, and carefully designed hints, then a real workflow with changing constraints and partial observability should not be handed to a monolithic LLM and a motivational prompt.

What each test is really doing

The paper’s results are easier to interpret if we separate the purpose of each experimental layer. Otherwise, the findings blur into “models did badly,” which is true but not very useful.

Test layer	Likely purpose	What it supports	What it does not prove
Baseline prompting across Zero-Shot, CoT, and AoT	Main evidence	Tool-free LLM planning is weak and prompt-sensitive	That prompting is always useless
Difficulty-bin analysis	Main evidence and diagnostic check	Success does not degrade purely with optimal solution length; failure is tied to planning/state instability	A complete theory of puzzle difficulty
Feedback trials	Sensitivity/intervention test	Retries and corrective signals can improve some model–prompt combinations, especially GPT-5-Thinking with AoT	That feedback creates reliable autonomous planning
External move validator	Ablation isolating planning from move legality	Even valid-move assistance does not produce successful search	That validators are useless in real systems
Appendix worked examples	Qualitative implementation detail and failure illustration	Shows how failures unfold inside concrete trajectories	Aggregate generalization beyond the measured trials

The key reading is that the paper does not merely compare models. It progressively strips away excuses.

First, maybe the prompt is too weak. So the authors test CoT and AoT.

Second, maybe the model just needs correction. So they add repeat, specific, and suggestive feedback.

Third, maybe the model fails because it cannot tell which moves are legal. So they provide valid moves.

After all that, the two failure mechanisms remain: the internal state is brittle, and the heuristic search is weak.

That is why the paper’s business lesson is not “use this prompt instead.” It is “stop expecting the prompt to carry the whole control system.”

The business value is diagnosis, not puzzle-solving

No executive needs an LLM to solve sliding-tile puzzles. The value of this paper is diagnostic.

It gives a clean analogy for where agent projects go wrong. Many AI agent demos look impressive because the model narrates intention, calls tools, and produces a final answer. But real operations require more than intention. They require state fidelity, transition validity, progress measurement, loop detection, rollback, and verification.

The paper points toward a practical architecture:

Business requirement	Weak LLM-only version	More reliable architecture
Track current state	Model “remembers” it in context	External state store or workflow engine
Enforce legal actions	Prompt says what is allowed	Validators, schemas, permissions, policy checks
Choose next action	Model reasons freely	Planner, search module, optimization routine, or constrained policy
Detect loops	Model is told not to repeat	Explicit visited-state tracking and loop alarms
Recover from failure	Ask the model to try again	Structured rollback, retry limits, escalation rules
Verify success	Trust final answer	Independent checker or end-state validator

The LLM still has a role. It can translate messy human requests into structured goals. It can explain trade-offs. It can summarize intermediate states for users. It can propose candidate actions. It can help design heuristics. But the parts that must be correct should not live only in the model’s prose.

For Cognaptus-style business automation, the implication is straightforward: treat LLMs as flexible language interfaces and reasoning assistants, not as invisible workflow engines. The workflow engine should know the state. The validator should know the rules. The planner should know how to search. The model should not be asked to hallucinate all three and then grade its own homework. That arrangement is brave in the same way putting a toaster in charge of procurement is brave.

False confidence is the real governance problem

The paper’s most operationally serious finding is not failure. Failure is manageable when systems expose it.

The worse problem is confident failure. Models often output a final move sequence as if the puzzle is solved, even though earlier steps contain invalid moves or drifted board states. The final answer can look clean while resting on a broken intermediate chain.

This is the pattern governance teams should care about. In a customer support workflow, the model may confidently say it issued a refund when the authorization step failed. In logistics, it may present an optimized route after silently violating a constraint. In trading, it may recommend an action based on a position state that no longer matches the exchange. In compliance, it may produce a polished rationale after skipping a required check.

The common failure shape is:

The model loses or corrupts state.
The model continues reasoning from that corrupted state.
The final answer appears coherent.
The system has no independent mechanism to detect the internal error.

That is not just an accuracy issue. It is an observability issue. If the system cannot tell when its own intermediate state has gone wrong, then a user cannot safely rely on final fluency as evidence of correctness.

This is why “chain-of-thought-style” explanations are not enough for enterprise assurance. A written rationale is not an audit log. An audit log must be tied to actual state transitions, tool calls, permissions, timestamps, and verified outcomes.

Boundaries: what the paper does not prove

The paper is sharp, but its scope is limited.

First, it studies one domain. The 8-puzzle is a useful planning probe, but it is not a universal proxy for all business workflows. Some business tasks are less search-heavy, more language-heavy, or easier to decompose with tools.

Second, the evaluation uses 50 puzzles. The authors sample across difficulty bins and use random generation to reduce contamination risk, but the dataset remains modest. The results are strong enough to diagnose failure modes, not to produce a final ranking of all models under all planning conditions.

Third, the paper deliberately excludes code interpreters and other tools. That is appropriate for isolating innate planning, but it does not imply that tool-augmented AI systems cannot solve such tasks. In fact, the practical lesson is almost the opposite: if planning matters, use tools and explicit algorithms rather than hoping the model performs search internally.

Fourth, the external move validator condition is not identical to a mature agent architecture. A production agent might include memory, symbolic planners, state machines, search procedures, learned policies, domain-specific heuristics, and multiple validators. The paper does not test those architectures. It tests what happens when an LLM is still the main planner even after some burden is offloaded.

So the right conclusion is not “LLMs cannot be part of agents.” The right conclusion is: LLMs should not be mistaken for complete planning systems.

The 3×3 lesson: build around the model’s weakness

The 8-puzzle is small enough to fit on a napkin, yet it exposes a gap that matters for real automation. Current LLMs can produce planning language. They can follow examples. They can sometimes benefit from feedback. They can even improve substantially under carefully engineered conditions.

But reliable planning requires more than sounding like a planner.

It requires a state representation that does not drift, a transition system that enforces valid actions, a search process that can escape local loops, and an independent verifier that checks whether the goal was actually reached. Those are engineering components, not adjectives in a prompt.

The paper’s best contribution is therefore not another reminder that benchmarks can be misleading. We have had enough reminders. Its contribution is a cleaner map of the failure mechanism: brittle state, weak search, costly correction, and false confidence.

For business leaders, the message is simple. Use LLMs where their flexibility is valuable. Surround them with systems where correctness matters. Let them read, explain, translate, summarize, and propose. But when the workflow needs durable state and sequential control, give the job to an architecture that can actually track the board.

A 3×3 box should not be the ceiling of agentic ambition. But it is a useful floor test. Right now, too many systems are still tripping over the tiles.

Cognaptus: Automate the Present, Incubate the Future.

Charles Schepanowski and Charles Ling, “On the Limits of Innate Planning in Large Language Models,” arXiv:2511.21591, 2025. ↩︎

The tiny board is not the point; the moving state is#

The experiment removes the usual escape hatches#

Failure mechanism one: the model loses the board#

Failure mechanism two: valid moves are not planning#

Feedback buys performance, not reliability#

What each test is really doing#

The business value is diagnosis, not puzzle-solving#

False confidence is the real governance problem#

Boundaries: what the paper does not prove#

The 3×3 lesson: build around the model’s weakness#