Reflections in the Mirror Maze: Why LLM Reasoning Isn't Quite There Yet

TL;DR for operators

Adding “reasoning” to an LLM agent is not the same as making it reason better. Wong et al. test four open-source models across dynamic SmartPlay tasks using a baseline prompt, reflection, reflection plus an Oracle that mutates heuristics, and reflection plus a Planner that simulates short future trajectories.¹ The clean result is not “planning wins” or “bigger models win.” The result is more annoying, therefore more useful: the same scaffold can be a booster, a distraction, or a failure amplifier.

For simple reactive tasks, extra reasoning can make agents worse because it adds prompt length and encourages unnecessary exploration. In Bandit, for example, Llama3-8B drops from a median score of 40.35 to 34.00 with Reflection + Planner, while DeepSeek-R1-14B drops from 41.00 to 32.05. The agent does not become wiser. It becomes the colleague who insists on “keeping options open” while the profitable option is blinking in neon.

For more complex tasks, scaffolds sometimes help smaller models close the model-size gap. Llama3-8B with Reflection + Oracle reaches a median Rock Paper Scissors score of 26.00, beating Llama3.3-70B’s baseline median of 22.20. Mistral-Nemo-12B with Reflection + Planner reaches a Messenger median of 1.00, above Llama3.3-70B’s baseline median of 0.10. But these gains are unstable: the same setup can have wide min–max ranges and sometimes collapse below baseline.

The business reading is straightforward. Do not ship “reflection” or “planning” as a blanket agent upgrade. Treat them as controlled modules with activation rules, validation, state tracking, and rollback. Reasoning is not a decorative prompt paragraph. It is a systems problem with cost, latency, variance, and failure modes attached.

The agent says it has a plan. Cute. Now watch the move.

A familiar scene: an AI agent explains its strategy beautifully, cites the rule, reflects on its last error, and then immediately performs the illegal action it just warned itself about.

That is the unpleasant charm of this paper. It does not test whether a model can describe reasoning. It tests whether an LLM can use prompted reasoning while acting inside a changing environment. That distinction matters because many enterprise agent demos quietly collapse the two. The agent produces a plan; therefore it must be planning. The agent critiques its own action; therefore it must be learning. The agent writes a heuristic; therefore it must be adapting. Lovely theatre. Not always control.

Wong et al. evaluate four models: Llama3-8B, Mistral-Nemo-12B, DeepSeek-R1-14B, and Llama3.3-70B. Each model acts in SmartPlay environments where it receives textual state observations, legal actions, histories, and rewards. The agent is not fine-tuned. It must decide from context. The authors compare four configurations:

Agent setup	What it adds	Operational interpretation
Baseline	Game manual, objective, history, current observation, legal actions	A plain state-to-action policy through prompting
Reflection	Retrospective feedback after each transition	Local self-critique inside the current episode
Reflection + Oracle	Cross-episode heuristic mutation using a simple evolutionary rule	A lightweight attempt at memory and adaptation
Reflection + Planner	Short-horizon simulation up to three steps	A look-ahead controller bolted onto the prompt

The paper then runs these setups on four SmartPlay environments: Two-Armed Bandit, Rock Paper Scissors, Tower of Hanoi, and Messenger. The environments are not equally hard. That is the point. Bandit rewards fast exploitation. Rock Paper Scissors rewards adaptation to opponent behaviour. Hanoi requires rule-constrained spatial planning. Messenger combines synonym interpretation, navigation, enemy avoidance, and delivery under a step limit.

This gives the paper its useful structure: not one leaderboard, but a set of contrasts.

Bigger models are steadier, but scaffolds can temporarily fake scale

The first contrast is model size versus prompting scaffold.

The baseline results broadly favour larger models. On Bandit, Llama3.3-70B posts a median score of 41.70, ahead of DeepSeek-R1-14B at 41.00, Llama3-8B at 40.35, and Mistral-Nemo-12B at 34.20. On Rock Paper Scissors, Llama3.3-70B’s baseline median is 22.20, above the smaller baselines. On Hanoi, Llama3.3-70B reaches a baseline median of 2.00 disks on the target, while smaller models sit lower. Scale helps.

But the paper’s more interesting result is that prompt scaffolds can sometimes make smaller models look competitive. Llama3-8B with Reflection + Oracle reaches a median score of 26.00 on Rock Paper Scissors, beating Llama3.3-70B’s baseline median of 22.20. Mistral-Nemo-12B with Reflection + Planner reaches a Messenger median of 1.00, higher than Llama3.3-70B’s baseline median of 0.10.

That sounds like the sort of result a vendor would love to put in a slide deck. “Smaller model beats larger model with agentic reasoning!” Add a gradient background and the invoice practically writes itself.

But the paper does not allow that conclusion to remain comfortable. The same improvements come with wider variance. Mistral-Nemo-12B with Reflection + Planner ranges from 10.00 to 33.00 on Rock Paper Scissors. DeepSeek-R1-14B with Reflection + Planner ranges from 0.00 to 2.00 on Hanoi. In other words, scaffolding may close the median gap in some cases, but it can also make the system less predictable.

For operators, this distinction matters more than the headline win. A smaller model plus clever prompting may reduce inference cost. It may also increase retry cost, monitoring cost, exception-handling cost, and user trust cost. The paper’s evidence supports experimentation with scaffolds; it does not support treating them as a stable substitute for capability.

The same scaffold can help, distract, or sabotage

The second contrast is task type.

The paper’s most practical warning appears in Bandit. The task is simple: learn which slot machine pays better and exploit it. Here, more reasoning often hurts. Llama3-8B drops from a baseline median of 40.35 to 34.00 under Reflection + Planner. DeepSeek-R1-14B drops from 41.00 to 32.05 under the same setup.

The mechanism is not mysterious. The scaffold encourages continued exploration even after one option is clearly better. Reflection tells the agent not to pull the same machine too often. The Planner proposes alternating sequences. The system overthinks a task where the right behaviour is almost embarrassingly direct.

This is the first product lesson: reasoning bandwidth should be task-sensitive. A data-fetching agent, invoice classifier, routing bot, or deterministic workflow executor does not need a philosophical crisis at every step. Extra context can dilute the relevant signal. Extra reflection can create false uncertainty. Extra planning can make the model optimise for an imagined complexity that the task does not have.

Rock Paper Scissors is different. The agent must adapt to biased and shuffled opponent behaviour. Here, Llama3.3-70B benefits strongly from Reflection + Planner, improving from a baseline median of 22.20 to 30.00. Llama3-8B benefits from Reflection + Oracle, moving from 17.15 baseline to 26.00. The qualitative examples matter: the larger model varies moves and adjusts to opponent patterns, while smaller models can get stuck repeating a losing move.

So the scaffold is not bad. It is conditional. Reflection helps when there is a meaningful policy to revise. Planning helps when future state estimates are reliable enough to matter. Heuristic mutation helps when experience generalises across episodes. When those conditions fail, scaffolding becomes premium-grade confusion.

A compact operating rule emerges:

Task condition	Likely scaffold value	Main risk
Simple reward signal, fast exploitation needed	Low	Over-exploration and prompt distraction
Pattern adaptation over repeated interactions	Medium to high	Variance across runs
Rule-constrained spatial planning	Uncertain	Invalid actions despite verbal rule knowledge
Navigation with language and state grounding	Uncertain	Object misidentification and poor spatial awareness

This is better than the usual “use reflection for complex tasks” advice. Complexity is not one thing. A task can be complex because it requires pattern adaptation, valid-action filtering, long-horizon memory, spatial grounding, schema compliance, or reward interpretation. Reflection does not solve all six. It barely remembers where it parked.

Hanoi exposes the knowing-doing gap

Tower of Hanoi is the paper’s sharpest test because the rules are famous, simple, and often present in training data. The model can be asked for the optimal seven-move solution and produce it. Yet when placed inside the environment, the agents struggle to execute.

Llama3.3-70B achieves the strongest Hanoi baseline at a median score of 2.00, and Reflection + Oracle matches it. But Reflection alone drops it to 0.70, and Reflection + Planner drops it to 1.00. For Llama3-8B, the baseline median is only 0.20, while Reflection + Planner can reach a maximum of 2.00 but also falls to 0.00. Mistral-Nemo-12B moves from a baseline median of 0.00 to 1.00 with Reflection + Oracle or Reflection + Planner, but still remains brittle.

The paper’s qualitative diagnosis is more important than the scores. Agents repeatedly make invalid moves, such as placing a larger disk on a smaller one. One reflection explicitly identifies this recurring pitfall. Then the agent still fails to act consistently within the constraint. That is the knowing-doing gap: the model can verbalise the rule but cannot reliably bind it to action selection over state transitions.

For business systems, the equivalent failure is not a toy puzzle. It is an AI procurement agent that knows the approval policy but submits a purchase order anyway. It is a compliance assistant that recites the rule but routes the document to the wrong region. It is a customer operations agent that explains the escalation protocol while skipping the required consent step.

The remedy is not simply “better reflection.” Hanoi suggests that rule-constrained tasks need external validity checks. If an action is illegal, the system should not rely on the model’s internal monologue to avoid it. It should constrain the action space, validate transitions, and separate proposed moves from executable moves.

The authors’ additional Hanoi tests make this point clearer. They test Mistral-Nemo-12B under modified conditions: reward shaping, valid-action hints, both combined, and a simplified two-disk version. This is best read as a failure-mode and sensitivity analysis, not as the paper’s main benchmark.

Test element	Likely purpose	What it supports	What it does not prove
Main SmartPlay comparison	Main evidence	Scaffolds have task- and model-dependent effects	Universal ranking of all LLM agents
Figure 2 skill aggregation	Cross-task diagnostic summary	Smaller models benefit more consistently; larger models can regress	That one scaffold is globally best
Hanoi reward shaping	Sensitivity/failure-mode test	Sparse rewards are part of the problem	That reward shaping fixes planning
Valid-action hints	Ablation-style diagnostic	Illegal move identification is a core bottleneck	That prompt hints are enough for reliable control
Messenger synonym removal	Ablation-style diagnostic	Language ambiguity is not the only issue	That navigation failures are solved by simpler wording

In the three-disk Hanoi setting, Mistral-Nemo-12B never completes the puzzle under baseline, reflection, Reflection + Oracle, or Reflection + Planner without adjustments. Invalid moves are extremely frequent: the baseline has 79.4% invalid moves, Reflection has 74.1%, Reflection + Oracle has 74.1%, and Reflection + Planner has 76.3%. Showing valid actions reduces invalid moves for several methods, but does not eliminate them. Even a random policy outperforms all methods in the no-adjustment three-disk setting on success, disk placement, and invalid-move metrics.

That last sentence is not flattering. It is also the kind of result evaluation papers should report more often.

In the two-disk version, performance improves sharply. Reflection + Oracle reaches 42.0% goal success with no adjustments and 68.0% with valid-action hints. This suggests models can handle shorter move chains better. But the three-disk failure shows that “knowing the rule” is not the same as maintaining a valid stateful policy over a longer sequence.

Messenger shows that language understanding is not embodied competence

Messenger adds another contrast: language comprehension versus grounded navigation.

The agent must identify objects described with synonyms, pick up a message, avoid enemies, and reach the goal. The task is textual, but it behaves like a navigation problem. That distinction matters because LLMs are much better at naming things than moving through state spaces.

In the main results, smaller models gain from Reflection + Planner. Llama3-8B improves from a median Messenger score of -0.15 to 0.00. Mistral-Nemo-12B improves from -0.20 to 1.00. DeepSeek-R1-14B improves from 0.40 to 1.00. But Llama3.3-70B collapses under Reflection + Planner, from 0.10 to -1.00. The authors attribute this to overly cautious enemy avoidance, which causes detours and failure within the ten-step horizon.

That is a useful failure mode. Planning can be locally rational and globally useless. Avoiding enemies is sensible. Avoiding them so cautiously that the agent never completes the delivery is not. Many enterprise agents fail in the same pattern: they avoid every possible risk, request clarification forever, or produce defensive non-actions that technically reduce error exposure while failing the business task.

The additional Messenger tests are also revealing. The authors use Mistral-Nemo-12B and try reward shaping, synonym removal, and both together. Removing synonyms only slightly improves pickup and goal rates. Reward shaping increases pickup rates, but does not reliably improve final goal completion. Reflection + Planner with both reward shaping and synonym removal gives the best pickup rate, 47.0%, but goal completion remains only 8.5%, with collisions still at 27.5%.

So the bottleneck is not just vocabulary. The model can understand that a “classified report” is a message and that a “danger” object should be avoided. It still struggles to translate that understanding into grounded, stateful navigation.

For business use, this should dampen enthusiasm for language-only agent evaluations. A workflow agent may understand a ticket category and still fail to navigate the tool sequence. A sales agent may understand the CRM fields and still update the wrong object. A legal assistant may identify the clause and still fail to route the next action correctly. Labels are not control.

The appendix-style tests are diagnosis, not a second thesis

The paper’s additional analyses should not be read as “reward shaping solves dynamic reasoning.” They are narrower and more useful than that.

In Hanoi, reward shaping adds penalties and rewards: -2 for invalid moves, +1 for valid moves, +100 for completion. This helps some measures, but showing valid actions often matters more. In the three-disk version, valid-action hints reduce invalid moves for the baseline from 79.4% to 56.3%, but success remains only 2.0%. In the two-disk version, valid-action hints lift the baseline success rate from 2.0% to 44.0%. That looks less like “the model learned the puzzle” and more like “the system finally stopped asking the model to infer legality from scratch.”

In Messenger, reward shaping increases message pickup but not final delivery. The best adjusted setup, Reflection + Planner with reward shaping and no synonyms, reaches 47.0% pickup but only 8.5% goal completion. That gap is the whole story. The agent can be nudged toward intermediate progress without acquiring robust navigation.

This distinction matters for operators because dense feedback is attractive. It is cheaper than fine-tuning and often easier than redesigning the architecture. But the paper suggests dense feedback is a partial instrument. It can improve local behaviour while leaving the central competence gap intact.

A good agent architecture should therefore separate three questions:

Did the model understand the instruction?
Did the model choose a valid next action?
Did the action move the system toward the terminal objective?

A surprising amount of agent discourse treats these as the same question. They are not. That is why the mirror maze keeps producing confident agents walking into glass.

What Cognaptus infers for agent design

The paper directly shows that prompt scaffolds have uneven effects across models, tasks, and environments. It directly shows that advanced prompting can reduce model-size gaps in selected cases while increasing variance. It directly shows failure patterns in Hanoi and Messenger that look like knowing-doing and language-embodiment gaps.

Cognaptus’ business inference is that agent systems should treat reasoning modules as governed components, not universal decorations. Reflection, heuristic mutation, and planning should be activated by task state, monitored for variance, and constrained by validators.

A practical design pattern follows:

Design choice	Paper signal	Business interpretation
Use minimal prompts for simple reactive tasks	Bandit performance drops under extra reasoning	Do not pay for cognitive theatre where direct exploitation works
Gate reflection by anomaly or uncertainty	Reflection can help or distract	Trigger self-critique only when the state justifies it
Keep heuristics versioned and reversible	Oracle can improve but also mislead	Treat learned rules as hypotheses, not truth
Validate planned actions before execution	Hanoi agents verbalise rules yet make illegal moves	Use symbolic or programmatic constraints for high-risk workflows
Track variance, not only median scores	Scaffolded methods show wide min–max ranges	Reliability is an operating metric, not a footnote
Design reward signals around terminal success, not only local progress	Messenger pickup improves while delivery remains weak	Intermediate KPIs can flatter systems that still fail the job

This is the difference between an impressive demo and an operational agent. A demo needs a good trajectory. An operation needs a distribution of trajectories that stays inside acceptable bounds.

Where the result applies, and where it does not

The paper is valuable because it is specific, not because it settles the whole reasoning debate.

The experiments use open-source models and SmartPlay-style text environments. Results may differ with newer proprietary models, tool-augmented systems, multimodal perception, stronger external memory, constrained decoders, or programmatic planners. The largest model also makes the experiment expensive, limiting runs to three in the main comparison. That means the min–median–max reporting is informative, but not enough for strong statistical claims.

The Messenger setup is modified from the original SmartPlay horizon: the authors lengthen rollouts from four to ten steps because the original horizon often made the goal unreachable, and they cap training at twenty episodes after observing no further improvement. That is reasonable for the study’s purpose, but it means human baseline comparisons are not clean in Messenger.

The additional Hanoi and Messenger tests use Mistral-Nemo-12B, chosen as a balance between speed and performance. These tests are useful diagnostics, not universal ablations across all models. They tell us what failure modes look like under one practical model choice. They do not prove every model would respond the same way.

Still, the practical warning survives those boundaries. If an agent cannot reliably bind rules, state, rewards, and action validity in a toy dynamic environment, we should be cautious about trusting prompt-only reasoning in messy enterprise workflows. Not terrified. Just sober. A rare and underrated mood in AI.

The mirror maze is not solved by adding more mirrors

The misconception this paper punctures is simple: more explicit reasoning does not automatically make an LLM agent more capable.

Sometimes reflection helps. Sometimes planning helps. Sometimes heuristic mutation lets a smaller model punch above its weight. But sometimes the same additions lengthen the prompt, dilute the signal, encourage overthinking, increase variance, or cause the agent to protect itself from the wrong risk.

The better conclusion is architectural. Reasoning should be treated as a controlled capability within an agent system: invoked when needed, bounded by constraints, checked against state, and evaluated by stability as much as by peak score. Prompt scaffolds are not fake. They are just not magic. Apparently we still have to build the machine around the talking part. Tragic, but manageable.

Cognaptus: Automate the Present, Incubate the Future.

Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, and Anna V. Kononova, “Reasoning Capabilities of Large Language Models on Dynamic Tasks,” arXiv:2505.10543, 2025. https://arxiv.org/abs/2505.10543 ↩︎

TL;DR for operators#

The agent says it has a plan. Cute. Now watch the move.#

Bigger models are steadier, but scaffolds can temporarily fake scale#

The same scaffold can help, distract, or sabotage#

Hanoi exposes the knowing-doing gap#

Messenger shows that language understanding is not embodied competence#

The appendix-style tests are diagnosis, not a second thesis#

What Cognaptus infers for agent design#

Where the result applies, and where it does not#

The mirror maze is not solved by adding more mirrors#