When Reflection Needs a Committee: Why LLMs Think Better in Groups

Opening — Why this matters now

LLMs have learned how to explain themselves. What they still struggle with is learning from those explanations. Reflexion was supposed to close that gap: let the model fail, reflect in natural language, try again — no gradients, no retraining, just verbal reinforcement. Elegant. Cheap. And, as this paper demonstrates, fundamentally limited.

The problem is not that LLMs cannot reflect. It is that a single mind reflecting on its own mistakes is remarkably good at protecting its original beliefs.

Background — The promise and ceiling of Reflexion

Reflexion reframes learning as memory rather than parameter updates. After a failed attempt, the model writes a short critique of what went wrong and stores it as episodic memory. Future attempts are conditioned on this growing archive of self-advice.

In theory, this mirrors reinforcement learning without the compute bill. In practice, the replication in this paper exposes a structural flaw: the same model acts as policy, judge, and therapist.

That creates two pathologies:

Failure mode	What actually happens
Confirmation bias	The reflection restates the original flawed logic, just more confidently
Mode collapse	Subsequent retries reproduce nearly identical reasoning traces

The model is not correcting itself — it is rationalizing itself.

Analysis — Multi-Agent Reflexion (MAR)

The core insight of MAR is brutally simple: stop asking a single model to be objective about its own mistakes.

Instead, MAR decomposes reflection into a structured social process:

Actor attempts the task (as in standard Reflexion)
Evaluator determines success or failure
Critic agents (with distinct personas) independently diagnose the failure
Debate rounds surface disagreements and alternative hypotheses
Judge agent synthesizes a consensus reflection
The Actor retries, conditioned on this aggregated critique

Crucially, these critics are not stylistic variants. They are deliberately polarized along axes like:

Evidence strictness
Exploratory behavior
Specification literalism

Think Verifier vs Skeptic vs Logician vs Creative — not four copies of the same intern.

Findings — What changes when models disagree

The empirical results are modest but consistent.

HotPotQA (Exact Match)

Method	EM
ReAct (GPT-3.5)	32
Reflexion	44
MAR	47

The gain looks small until you inspect the failure cases. MAR frequently produces semantically correct answers that Exact Match refuses to credit. In other words, the agent improves faster than the metric can acknowledge.

HumanEval (pass@1)

Method	pass@1
GPT-3.5 baseline	67.1
Reflexion	76.4
MAR	82.6

Here the effect is unambiguous. Multi-agent critique breaks repetitive bugs: off-by-one errors, repeated loop structures, and hallucinated specs that single-agent Reflexion keeps reinforcing.

Implications — What this means for real systems

Three implications matter beyond benchmarks:

1. Reflection quality beats reflection frequency More retries do not help if the feedback is internally correlated. MAR improves not by trying harder, but by disagreeing earlier.

2. Metrics shape learning trajectories Exact Match actively misleads reflective agents. If your reward signal is brittle, self-improvement will drift — confidently.

3. Cost is the new bottleneck MAR is ~3× more expensive in tokens and latency. This is not free intelligence — it is purchased diversity.

For production agent systems, this suggests hybrid strategies: invoke multi-agent reflection selectively, only when stagnation or repeated failure is detected.

Conclusion — Reflection works better as a conversation

Single-agent Reflexion fails for the same reason humans do: introspection without challenge entrenches beliefs. Multi-Agent Reflexion does not make models smarter — it makes them less lonely.

And it turns out, thinking improves dramatically once someone else is allowed to say, “I disagree.”

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The promise and ceiling of Reflexion#

Analysis — Multi-Agent Reflexion (MAR)#

Findings — What changes when models disagree#

HotPotQA (Exact Match)#

HumanEval (pass@1)#

Implications — What this means for real systems#

Conclusion — Reflection works better as a conversation#