The Minimal LLM Thesis: When Agents Think for Themselves

Opening — Why this matters now

For the past two years, the dominant narrative in AI has been simple: if your agent isn’t powered by a large language model at every step, it’s probably underpowered. More tokens, more reasoning, more capability.

This paper quietly dismantles that assumption.

It asks a more uncomfortable question: what if most of the intelligence we attribute to LLM agents isn’t coming from the LLM at all?

Instead of chasing larger models or more frequent calls, the authors propose something far less glamorous—but far more actionable: make the agent’s thinking structure explicit, measurable, and, where possible, independent of the LLM.

Background — The illusion of monolithic intelligence

Modern LLM agents tend to bundle everything into a single loop:

World modeling
Planning
Reflection
Action

All wrapped inside prompt engineering and latent reasoning.

This works. But it creates a fundamental problem: you can’t tell where the intelligence actually comes from.

Is the model reasoning? Or is it just following a well-structured prompt scaffold? Is “reflection” real, or just narrative post-hoc rationalization?

Previous frameworks—ReAct, Reflexion, and program-guided agents—lean heavily on the LLM as the central orchestrator. Even when structure exists, it is often generated or controlled by the model itself.

This paper flips the paradigm.

It treats the LLM not as the brain—but as a conditional tool.

Analysis — Decomposing the agent

The core innovation is a declared reflective runtime protocol.

Instead of embedding reasoning inside prompts, the agent is decomposed into four explicit, inspectable layers:

Layer	Function	LLM Dependency
Belief Tracking	Maintains posterior over states	None
World-Model Planning	Simulates actions and strategies	None
Symbolic Reflection	Tracks errors, confidence, and triggers revision	None
LLM Revision	Adjusts policy under uncertainty	Sparse

This is not just architectural hygiene. It is instrumentation.

By externalizing state, confidence, and revision logic, the agent becomes observable. You can measure:

When it is confident
When it is wrong
When it decides to revise
Whether that revision helps

In other words, reflection becomes a runtime mechanism, not a storytelling device.

The reflective loop (explicitly defined)

The agent follows a structured cycle:

Simulate outcomes
Predict results
Execute actions
Compare prediction vs reality
Update confidence
Trigger revision if needed

The key shift: revision is gated, computed, and inspectable—not prompted.

Findings — What actually drives performance

1. Structure beats LLMs (by a wide margin)

System	Avg F1	Win Rate
Greedy baseline	0.522	50.0%
World-model planning	0.539	74.1%

The largest gain comes from explicit planning, not LLM reasoning.

A +24.1 percentage point increase in win rate—with zero LLM usage.

This is not a marginal improvement. It’s a reallocation of credit.

2. Reflection exists without LLMs—but it’s fragile

Configuration	Avg F1	Win Rate
Reflection OFF	0.552	57.4%
Reflection ON	0.551	55.6%

Symbolic self-revision works as a mechanism—but not yet as a consistently beneficial one.

It helps in some cases (recovering from poor belief states), but overfires in others.

Translation: reflection is real, but calibration is everything.

3. LLMs help… a little. And not reliably.

Configuration	Avg F1	Win Rate	LLM Usage
No LLM	0.552	57.4%	0%
Sparse LLM (~4.3%)	0.557	53.7%	4.3%

This is the most interesting result.

F1 improves slightly
Win rate declines

The effect is non-monotonic.

The LLM improves local decision quality—but can disrupt global strategy.

It is not a universal enhancer. It is a noisy intervention mechanism.

Implications — Rethinking agent design

1. The LLM is not the system

The dominant industry pattern treats the LLM as the core runtime.

This paper suggests the opposite:

The LLM is a residual component—used only where structure fails.

This has immediate implications for cost, latency, and reliability.

2. Observability becomes a competitive advantage

Once reflection is externalized, you gain:

Debuggable reasoning
Measurable confidence
Controllable revision policies

This is not just engineering hygiene—it’s governance infrastructure.

For enterprises, this directly maps to:

Auditability
Compliance
Risk control

3. Sparse intelligence beats constant intelligence

Calling an LLM on every step is equivalent to:

High cost
High variance
Low control

A structured agent with sparse LLM intervention is:

Cheaper
More stable
More interpretable

And, in many cases, just as capable.

4. The real bottleneck is calibration, not capability

The paper’s weakest-performing component—symbolic reflection—is also the most promising.

Why?

Because its failures are diagnosable.

Unlike opaque LLM reasoning, you can improve:

Confidence thresholds
Revision triggers
Policy updates

This shifts progress from scaling models to engineering better control systems.

Conclusion — Intelligence is structure, not scale

This paper doesn’t claim that LLMs are unnecessary.

It makes a more precise argument:

Most of what we call “agent intelligence” comes from structure—and the LLM fills in the gaps.

Once you expose that structure, something interesting happens:

You need fewer LLM calls
You understand your system better
You can improve it systematically

The question is no longer:

“How powerful is your model?”

But rather:

“Where does your system actually need intelligence—and where can it be engineered?”

That distinction is where the next generation of agent systems will be built.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The illusion of monolithic intelligence#

Analysis — Decomposing the agent#

The reflective loop (explicitly defined)#

Findings — What actually drives performance#

1. Structure beats LLMs (by a wide margin)#

2. Reflection exists without LLMs—but it’s fragile#

3. LLMs help… a little. And not reliably.#

Implications — Rethinking agent design#

1. The LLM is not the system#

2. Observability becomes a competitive advantage#

3. Sparse intelligence beats constant intelligence#

4. The real bottleneck is calibration, not capability#

Conclusion — Intelligence is structure, not scale#