Tools of Thought: Why Reasoning Isn’t an Illusion After All

TL;DR for operators

The useful question is not whether reasoning models “really think”. That debate is charming, mostly because it lets everyone pretend a benchmark table is a metaphysics seminar.

The operational question is simpler: when you give a reasoning model the same tools as a non-reasoning model, does it use them better?

In Thinking Isn’t an Illusion, Song, Yue, and Zhang revisit Apple’s “thinking-illusion” puzzle benchmark by adding external tools: Program-of-Thought Python execution, Think-and-Execute pseudo-code reasoning, and scratchpad memory.¹ Their finding is not that reasoning models are magically superior. It is narrower and more useful: when the problem can be reformulated into executable code or supported by external state, large reasoning models often outperform their non-reasoning counterparts. DeepSeek-R1 with Program-of-Thought performs strongly on River Crossing and Blocks World where DeepSeek-V3 largely fails. Qwen 3 Thinking also benefits on Blocks World. But Checker Jumping remains essentially unsolved across the tested setups.

For business readers, the lesson is blunt: stop evaluating “the model” as if it will operate alone in a padded cell. Real AI systems include tools, memory, verifiers, retrieval, execution environments, and workflow constraints. A reasoning model that looks wasteful under raw prompting can become useful when the surrounding scaffold lets it delegate computation. Equally, a tool layer cannot rescue a model that cannot formulate the right abstraction. Tools are not pixie dust. They are leverage, and leverage only helps when something solid is under it.

The benchmark dispute is really a comparison problem

Apple’s earlier “thinking-illusion” benchmark was influential because it made reasoning models look less impressive under controlled puzzle complexity. The tested systems could produce long chains of thought, but those chains did not reliably translate into correct solutions. At low complexity, ordinary models often did fine. At high complexity, both ordinary and reasoning models collapsed. The embarrassing part was not just failure; it was expensive failure, wrapped in many tokens and a confident little bow.

The new paper does not deny that result under its original conditions. Instead, it changes the comparison.

The authors ask what happens when both sides receive external tools. That matters because many of the benchmark puzzles are not merely “think harder” problems. They are state-tracking and sequence-generation problems. Tower of Hanoi, River Crossing, Checker Jumping, and Blocks World all require maintaining constraints over a sequence of moves. A human solving these does not usually recite the whole solution from pure inner radiance. A human writes things down, traces states, uses a procedure, or gives up and calls it “strategic prioritisation”.

The paper’s comparison therefore has four axes:

Comparison	What changes	Why it matters
LRM vs LLM	Reasoning variant versus non-reasoning counterpart	Tests whether explicit reasoning behaviour adds value under the same tool access
No-tool vs tool-augmented	Direct prompting versus external supports	Separates model limitation from workflow limitation
PoT vs scratchpad vs Think-and-Execute	Different tool interfaces	Shows that “tool use” is not one thing
Solved vs still-unsolved tasks	Task-specific outcomes	Prevents the lazy conclusion that tools fix reasoning generally

This is why a comparison-based reading is more revealing than a conventional paper summary. The paper is not a victory lap for reasoning models. It is a controlled argument about where reasoning becomes useful once it can interact with machinery outside the token stream.

What the paper actually adds to Apple’s setup

The study keeps Apple’s four puzzle environments but modifies the evaluation environment. The authors test four model configurations: DeepSeek-V3 and DeepSeek-R1, plus Qwen 3 and Qwen 3 Thinking. They repeat each experiment five times and report successful runs out of five. Simpler cases below the tested range are mostly omitted because almost all models perform well there.

The important change is tool augmentation. The paper evaluates four modes:

Mode	Likely purpose in the paper	What it tests
Direct prompting	Main baseline	Whether the model can solve the puzzle in one ordinary response
Program-of-Thought	Main evidence	Whether the model can generate executable Python that solves the task
Think-and-Execute	Tool comparison / implementation contrast	Whether code-like reasoning helps when the model itself “executes” the pseudo-code
Scratchpad	External-memory extension	Whether multi-step state storage helps when output length or state tracking becomes the bottleneck
Scratchpad length and token studies	Robustness / sensitivity tests	Whether tool use merely spends more tokens or changes how models manage intermediate work

That distinction matters. The hyperparameter and token-consumption analyses are not a second thesis. They are support beams. The main load-bearing claim comes from the accuracy tables comparing model families and tool modes across puzzle complexity.

The authors’ most important methodological move is to treat the tool interface as part of the evaluation setting. That is sensible. A model deployed in a business workflow rarely acts as a naked chatbot. It calls APIs, writes code, searches documents, updates records, runs calculations, checks schemas, and sometimes politely detonates a spreadsheet. Evaluating model cognition without the tool layer can be useful for scientific isolation, but it is incomplete for operational prediction.

Program-of-Thought is the strongest tool because it stops asking language to be machinery

Program-of-Thought is the cleanest intervention in the paper. The model writes Python; Python executes. This separates task formulation from mechanical execution.

That separation is exactly why PoT performs so well on Tower of Hanoi. In direct prompting, models struggle as the number of disks grows. With PoT, all four tested model variants solve Hanoi from $N=3$ through $N=13$ with 5/5 success. That does not prove the models discovered some deep new recursive wisdom. It shows that once the problem is translated into a standard algorithmic form, the execution burden is no longer trapped inside the model’s output window.

The more revealing results come from River Crossing and Blocks World.

For River Crossing, DeepSeek-R1 with PoT achieves 4/5 success across the tested sizes from $N=3$ to $N=13$, while DeepSeek-V3 with PoT remains at 0/5 across those same settings. That is the paper’s core comparison in miniature. The same class of tool is available, but the reasoning model uses it more effectively. The tool does not automatically solve the problem. The model still has to formulate the right state constraints and generate code that respects them.

Blocks World shows a similar pattern. DeepSeek-R1 with PoT reaches 5/5 across all tested sizes. Qwen 3 Thinking also reaches 5/5 across all tested sizes. Their non-reasoning counterparts improve much less: DeepSeek-V3 with PoT sits at 1/5 across the tested Blocks World sizes, while Qwen 3 reaches 2/5 across those sizes.

The obvious interpretation is tempting: reasoning models win. The better interpretation is more precise: reasoning models win when the external tool rewards structured problem formulation. That is less glamorous, but far more useful.

A Python interpreter cannot rescue a bad abstraction. It can only execute the abstraction it is given. The LRM advantage appears when the model is better at turning the puzzle into a computational object that the tool can handle. The intelligence is not located purely in the model or purely in the tool. It sits in the fit between them. Annoying, yes. Also how systems work.

Scratchpads help when the problem is memory, not when the problem is strategy

The scratchpad environment gives models an external memory across multiple steps. Instead of forcing the whole solution into one response, the model can write intermediate state, continue later, and stop early when it decides the answer is complete.

This is a reasonable intervention because some benchmark failures may be output-window failures rather than reasoning failures. A long Tower of Hanoi solution can require many moves. Blocks World can require careful state tracking. If the model’s answer is truncated or loses track of prior state, the failure may reflect interface design rather than cognitive incapacity.

Scratchpads do help, but unevenly.

On Blocks World, DeepSeek-R1 with scratchpad support performs strongly through several tested sizes: 5/5 at $N=3$, 5/5 at $N=5$, 3/5 at $N=7$, 4/5 at $N=9$, and 4/5 at $N=11$, before falling to 0/5 at $N=13$. That is meaningfully better than direct prompting for harder Blocks World cases, but it is not the clean scalability of PoT.

The reason is straightforward. A scratchpad extends memory, but the model still performs the reasoning. It must decide what to store, how to update state, how to avoid duplicating moves, and when to stop. In other words, the scratchpad gives the model a notebook, not a solver.

This distinction matters for enterprise systems. External memory is valuable when the workflow bottleneck is context continuity: long case histories, multi-step investigations, document review, compliance trails, customer-support state, or project planning. But scratchpads do not turn weak planners into strong planners. They make it easier for a capable planner to avoid dropping its keys halfway through the task.

Think-and-Execute is the awkward middle child

Think-and-Execute looks attractive on paper. The model writes pseudo-code and then interprets it itself, acting like a compiler. The problem is that this keeps too much burden inside the model. It encourages structure, but it does not provide deterministic execution.

The experimental results reflect that awkwardness. On Blocks World, Think-and-Execute gives DeepSeek-R1 some gains at higher complexity: 2/5 at $N=7$, 2/5 at $N=9$, and 1/5 at $N=11$ and $N=13$. That is better than nothing. It is also not a revolution. For River Crossing and Checker Jumping, it does not unlock much.

This result is useful because it punctures a common product-design mistake: dressing a prompt up as a tool and expecting tool-like reliability. If the “tool” is still just the model simulating a procedure in language, the workflow has not escaped the model’s failure modes. It has merely put them in a lab coat.

PoT works better because the boundary is real. The model produces code; the interpreter executes it. Scratchpads work when they provide real external state. Think-and-Execute is closer to asking the model to role-play a compiler. As enterprise architecture, role-play is not an execution substrate. It is theatre with logging.

Checker Jumping is the result that keeps the paper honest

The most important negative result is Checker Jumping. Across the tested tool modes and model variants, Checker Jumping remains unsolved for the evaluated cases. This is not a small footnote. It is the anti-hype anchor.

Checker Jumping matters because it shows that tools extend reasoning only when the model can discover or encode the right structure. If a task requires a strategy that the model fails to formulate, external execution does not help. Python can faithfully execute the wrong plan. A scratchpad can faithfully preserve confusion. Think-and-Execute can faithfully simulate a mistaken procedure. Very enterprise, really.

This is why the paper should not be read as “reasoning models are generally superior”. It shows something narrower: reasoning models can exploit certain tools better than non-reasoning models on certain structured tasks. Some puzzles remain out of reach. That boundary is not a defect in the paper; it is the part that makes the paper useful.

A business analogue would be an AI procurement assistant. Give it a calculator and it may price scenarios better. Give it a document store and it may cite policies more consistently. Give it a workflow engine and it may route approvals more reliably. But if it misunderstands the procurement rule, every tool in the stack can help it be wrong faster.

The token story is not “tools cost more”

One quiet contribution of the paper is its token-consumption analysis. The authors examine Qwen 3 Thinking across tool-use baselines and benchmark tasks. Their finding is not that tool use always increases cost. In some cases, tool-supported workflows reduce both thinking tokens and output tokens.

That is operationally important. Many teams assume reasoning models are expensive because they produce long intermediate reasoning. Often they are. But the right tool interface can reduce waste by moving mechanical work out of the model and narrowing the reasoning path. The issue is not simply “more reasoning equals more tokens”. The issue is whether the reasoning is organized.

A model forced to simulate every step internally may ramble. A model that can call an interpreter may compress the task into code. A model with a scratchpad may avoid restating state. The savings are not guaranteed, but the design principle is clear: token cost is partly an architecture problem, not merely a model-pricing problem.

This is where the paper connects most directly to business ROI. The cheapest AI system is not always the smallest model. It is the system that allocates work correctly: language for interpretation, tools for execution, memory for state, verifiers for correctness, and humans for the parts where accountability still matters.

What enterprises should infer, and what they should not

The paper directly shows that tool augmentation changes the relative performance of reasoning and non-reasoning models on controlled puzzle tasks. It also shows that different tools have different effects. PoT is strongest in these experiments, scratchpads help selectively, and Think-and-Execute adds little.

Cognaptus would infer three practical rules from that evidence.

Paper result	Operational inference	Boundary
LRMs outperform matched LLMs under PoT on several tasks	Evaluate reasoning models inside executable workflows, not only under direct prompting	Strongest for tasks that can be formalised into code or state search
Scratchpads help on some Blocks World settings	External memory can improve long-horizon state management	Memory does not supply strategy by itself
Think-and-Execute is weak	Simulated tools are not the same as real tools	Prompt structure helps, but deterministic execution helps more
Checker Jumping remains unsolved	Some failures are abstraction failures, not memory failures	Tool access cannot compensate for missing problem formulation
Tool use need not increase tokens	Architecture can reduce reasoning waste	Token results are model- and task-dependent

The wrong takeaway is “buy reasoning models and add tools”. That is procurement astrology.

The better takeaway is to evaluate the complete reasoning workflow. A serious enterprise benchmark should test the model, the tool calls, the memory design, the verifier, the failure recovery loop, and the handoff to humans. The question is not whether the model sounds thoughtful. The question is whether the system produces correct actions under operational constraints.

The boundary: synthetic puzzles are useful, not sufficient

The study uses synthetic puzzle environments because they are controllable and automatically verifiable. That is a strength. It lets the authors vary complexity and compare systems under clear correctness criteria. Business workflows rarely offer that luxury. Invoices are messy. Policies contradict each other. Customers omit the one detail that matters. Internal systems have APIs designed during someone’s villain era.

So the paper should not be treated as a universal claim about enterprise AI performance. It tests four puzzle families, four model variants, repeated five times, with specific tool interfaces. It is best read as evidence about mechanisms, not as a deployment guarantee.

The tasks also favour certain kinds of tool use. Program-of-Thought shines when a problem can be converted into executable Python. That maps well to algorithmic planning, symbolic manipulation, constraint checking, simulation, and structured calculations. It maps less directly to ambiguous negotiation, strategic prioritisation, legal judgement, customer empathy, or product discovery. Naturally, those are the tasks executives most want to automate by Friday.

The paper’s contribution is therefore not a finished recipe. It is a correction to evaluation habits. If the real system will use tools, the benchmark should include tools. If the real system needs memory, the benchmark should include memory. If the real system requires verification, the benchmark should include verification. Otherwise we are testing a deliberately disabled version of the thing we plan to deploy, then acting surprised when the conclusion is distorted.

The illusion was never thinking; it was isolation

The phrase “reasoning is an illusion” was always too neat. The new paper does not prove that reasoning models possess deep general intelligence. It proves something more grounded: under the right tool conditions, reasoning behaviour can become operationally useful.

That is enough.

For operators, the lesson is not philosophical. Build evaluations around the actual work loop. Compare reasoning and non-reasoning models with equal access to tools. Measure not only final accuracy, but tool formulation, execution reliability, token cost, state management, and recovery from failure. Then ask which component created the improvement.

Sometimes the answer will be the model. Sometimes it will be the tool. Usually it will be the interface between them, because reality likes being inconvenient.

Thinking is not an illusion. But neither is it a solo performance. In practical AI systems, intelligence is increasingly assembled: model, memory, code, verifier, workflow. The companies that understand that will build systems that reason with tools. The companies that do not will continue benchmarking chatbots in empty rooms and calling it strategy.

Cognaptus: Automate the Present, Incubate the Future.

Zhao Song, Song Yue, and Jiahao Zhang, “Thinking Isn’t an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations,” arXiv:2507.17699, 2025, https://arxiv.org/abs/2507.17699. ↩︎

TL;DR for operators#

The benchmark dispute is really a comparison problem#

What the paper actually adds to Apple’s setup#

Program-of-Thought is the strongest tool because it stops asking language to be machinery#

Scratchpads help when the problem is memory, not when the problem is strategy#

Think-and-Execute is the awkward middle child#

Checker Jumping is the result that keeps the paper honest#

The token story is not “tools cost more”#

What enterprises should infer, and what they should not#

The boundary: synthetic puzzles are useful, not sufficient#

The illusion was never thinking; it was isolation#