From Reaction to Reflection

Modern AI models, especially language models, are stunningly capable at answering our queries. But what happens when there is no query? Can an AI reason about the world not just in reaction to prompts, but proactively — triggered by internal goals, simulated futures, and visual imagination? That’s the central question Slimane Larabi explores in his latest paper: “Can Mental Imagery Improve the Thinking Capabilities of AI Systems?”

His answer is a bold one: Yes — but only if we teach machines to imagine.

The Architecture of Artificial Thought

At the heart of the proposal is a novel machine thinking framework designed to mimic some core features of human cognition. It consists of four tightly coupled components:

Unit Role
Cognitive Thinking Unit (CTU) Central reasoner — initiates, guides, and reflects on reasoning tasks.
Needs Unit Stores internal goals and planned actions (e.g., “unlock the door”). These act as thinking triggers.
Input Data Unit Converts real-world inputs (vision, sound, touch) into structured language using models like Faster R-CNN, BLIP, and Wav2Vec.
Mental Imagery Unit Generates visual simulations — imagined scenes based on CTU prompts.

This isn’t just a pipeline — it’s a feedback loop. The CTU doesn’t just reason about data, it initiates simulations, interprets their implications, and even refines its goals based on imagined outcomes.

How Imagination Guides Reasoning

Let’s consider a simple need: “I need the keys to open the door.” Here’s how the system proceeds:

  1. Context Matching: The Input Data Unit provides scene descriptions like: “a laptop with a bunch of keys on it.” The CTU uses sentence embeddings to match the best context to the goal.
  2. Action Inference: Using LLMs, the CTU formulates a plan: “Pick up the keys and open the door.”
  3. Mental Simulation: The Mental Imagery Unit (via Stable Diffusion) generates sketch-style visuals for each step. If reasoning encounters a contradiction (e.g., “the key doesn’t work”), the CTU loops: generating new plans like “call the firefighters” or “break the door.”

The result is a dynamic simulation loop: needs → hypotheses → imagined actions → reevaluated plans.

Why Sketches, Not Photorealism?

One fascinating design choice is the use of pencil-sketch abstraction rather than detailed renders. This isn’t just aesthetic. Sketches, like mental images in humans, strip away visual noise and emphasize structure. They help the system focus on spatial layout and object relations — critical for reasoning — rather than irrelevant textures or colors.

This aligns with prior research (e.g., Kunda 2018) that suggests imagery-based reasoning can bypass symbolic logic’s bottlenecks by allowing direct scene manipulation. In short, sketches think differently than words.

The Broader Significance: Toward Autonomous Agents

Most current LLMs and vision-language models are passive. They wait. This paper moves toward agents that initiate reasoning when internal motivations arise — like hunger does in animals.

Moreover, by integrating imagination, these agents can:

  • Simulate consequences before acting
  • Evaluate alternative plans visually
  • Learn not just from data, but from hypothetical futures

It’s a foundational shift from “language models” to what might be called cognitive agents with internal worlds.

Caveats and Future Directions

While promising, the current system still relies on predefined needs and sequences. It doesn’t yet learn or evolve its needs over time. Nor can it imagine entirely novel scenarios ungrounded in current input.

But these are solvable. The real innovation here is architectural: giving imagery an equal seat at the reasoning table. As Larabi’s work shows, a mental sketch might be worth more than a thousand words — if you want a machine that truly thinks.


Cognaptus: Automate the Present, Incubate the Future