In early 2025, Apple’s now-infamous “thinking-illusion” benchmark delivered a sobering verdict: large reasoning models (LRMs)—those step-by-step thinkers like DeepSeek-R1 and Qwen 3 Thinking—failed to show meaningful advantages over simpler LLMs. Their verbose, reflective outputs didn’t help on easy problems, nor did they scale on hard ones. In some cases, they even underperformed.

But what if we were judging thinking models under unfair conditions?

A new study titled “Thinking Isn’t an Illusion” argues that the problem isn’t with reasoning itself—it’s with reasoning in a vacuum. When these models are augmented with tools like Python interpreters and structured scratchpads, their performance transforms dramatically. In fact, they begin to consistently outperform their non-reasoning counterparts across a diverse set of logic puzzles.

Reframing the Benchmark: From Illusion to Integration

The original “thinking-illusion” benchmark featured classic symbolic puzzles like:

  • Tower of Hanoi (recursive logic)
  • Checker Jumping (combinatorial planning)
  • River Crossing (state constraints)
  • Blocks World (hierarchical planning)

Apple’s key finding was that LRMs generated long outputs without added accuracy. But crucially, those tests imposed token limits—a practical ceiling that penalized long-form reasoning. Some Hanoi problems needed over 100,000 tokens; the models capped at 32K or 64K.

The new paper fixes this by introducing external tool use:

Tool Type Description
Python Interpreter Executes model-generated code to offload logical computation
Scratchpad Memory Stores intermediate state to allow multi-step, multi-turn reasoning

This isn’t just a hardware patch. It’s a philosophical one. It reframes reasoning not as a soliloquy but as a dialogue with tools.

The Setup: Four Models, Four Tools, One Truth

The study compares:

  • LLMs: DeepSeek-V3, Qwen 3
  • LRMs: DeepSeek-R1 (with “reasoner” mode), Qwen 3 Thinking

Each model is tested under:

  • Direct Prompting (no tools)
  • Think-and-Execute (LLM as a code-generating compiler)
  • Program-of-Thought (PoT) (generate + execute Python)
  • Scratchpad (structured external memory)

The headline finding? PoT is a game-changer.

Task DeepSeek-R1 Accuracy (Direct) DeepSeek-R1 Accuracy (PoT)
River Crossing 0/5 4/5
Blocks World 5/5 (N=3), 0/5 (N=7) 5/5 (N=13)

Even the notoriously hard Hanoi with N=13 became solvable with code execution. Meanwhile, models without reasoning abilities or tool use remained stuck on trivial cases.

Why PoT Beats Scratchpads (and Everyone Else)

Interestingly, scratchpads also helped, but not nearly as much. Why?

  • Scratchpads require the model to break up its output over multiple rounds, managing state transitions manually.
  • PoT hands off the computation to Python, which is deterministic and far more scalable for recursive or symbolic tasks.

Think of it like this:

Scratchpad = Mental Math Notebook

PoT = Outsourced Calculator

PoT aligns better with how real reasoning systems (and humans) solve complex tasks: by delegating the grunt work.

But Wait: Checker Jumping Still Defies All

One fascinating caveat: no model solved Checker Jumping for N≥3. Not even DeepSeek-R1 with full tool access. This reveals a deeper issue: tools can extend memory and precision, but they don’t give insight. Some problems require genuine strategic generalization, which remains a blind spot for even the best agents.

This echoes a broader trend we’ve seen across Cognaptus Insights: LLMs can simulate problem solving, but they still lack a reliable planner.

Implications: Tools as Catalysts, Not Crutches

This paper doesn’t just rebut Apple’s critique. It provides a clear prescription for future evaluation:

  • Reasoning is only real when it interfaces with tools.
  • Benchmarks must include tool integration by default.
  • Model evaluations should separate “talking about reasoning” from “doing reasoning.”

It also makes a quiet but vital point: Reasoning is expensive in tokens only when we force LLMs to simulate everything. With tools, they become leaner, smarter, and more scalable.

Final Thought: The Return of the Reasoner

“Thinking isn’t an illusion”—but it is incomplete without augmentation.

This study elegantly reframes LRMs not as failed philosophers, but as emerging tool users, capable of real problem solving when embedded in the right context.

The illusion wasn’t in the thinking. It was in our testing.


Cognaptus: Automate the Present, Incubate the Future.