A review meeting has one obvious purpose: prevent one person’s mistake from becoming everyone’s plan.
That sounds mundane until we remember how many LLM agent systems are currently designed like a one-person review meeting. The same model attempts the task, explains why it failed, writes advice to itself, stores that advice in memory, and then tries again. It is actor, evaluator, critic, therapist, and occasionally courtroom stenographer. Efficient, yes. Also a little suspicious.
The paper behind this article, MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs, studies exactly this problem.1 It begins by replicating Reflexion, a framework where an LLM learns from failed attempts by writing natural-language reflections into episodic memory. The attraction is easy to understand: no gradient updates, no retraining, just failure converted into reusable advice. For agentic systems, that is a beautiful promise. Cheap learning, neatly wrapped in prose.
The paper’s uncomfortable finding is that reflection can become self-confirmation. When the same model generates the answer and diagnoses its own failure, its “reflection” often repeats the original misconception, narrows the search space, or quietly rewrites the task. So the authors propose Multi-Agent Reflexion, or MAR: instead of letting one model reflect alone, they introduce multiple persona-based critics and a judge that synthesizes their disagreement into a consensus reflection.
The headline result is not that “more agents are better.” That would be the lazy version, and the industry already has enough lazy versions. The real contribution is more precise: reflection improves when the workflow separates acting, diagnosing, critiquing, aggregating, and remembering.
The single-agent loop fails because the critic shares the actor’s blind spot
Reflexion is an elegant framework because it turns task failure into verbal memory. An Actor attempts the task. An Evaluator marks success or failure. A Self-Reflector converts the failed trajectory into natural-language advice. That advice is stored as episodic memory and injected into the next attempt.
The mechanism is attractive because it resembles reinforcement learning without updating the model’s parameters. Instead of changing weights, the system changes context. Instead of learning through gradients, it learns through written self-instruction.
But that same elegance hides a structural weakness. In a single-agent Reflexion loop, the system often relies on the same model family, and sometimes effectively the same reasoning style, to perform several roles:
| Role in the loop | What it is supposed to do | Failure risk when one model dominates |
|---|---|---|
| Actor | Generate the answer, reasoning trace, or code | Makes an initial conceptual mistake |
| Evaluator | Detect whether the attempt succeeded | Provides only coarse feedback or inherits brittle scoring |
| Reflector | Explain why the attempt failed | Rationalizes the earlier mistake instead of diagnosing it |
| Memory | Guide the next attempt | Stores flawed advice as if it were learning |
The paper’s replication finds two recurring pathologies: confirmation bias and mode collapse. Confirmation bias appears when the reflection restates the original flawed reasoning, sometimes with more confident language. Mode collapse appears when retries reproduce nearly identical reasoning patterns: similar loop structures in code, repeated indexing mistakes, or the same off-by-one error wearing a slightly different hat.
That is the important mechanism. The system is not failing because it lacks a reflection step. It is failing because the reflection step is too internally correlated with the original mistake.
A bad reflection is worse than no reflection when it becomes memory. It does not merely fail to help the next attempt; it can actively steer the next attempt in the wrong direction. The appendix example on HumanEval makes this concrete: for the double_the_difference task, the original specification asks for the sum of squares of positive odd integers. After failure, the single-agent reflection hallucinates a different task meaning and pushes the implementation toward “double the difference.” That is not correction. That is specification drift with a clipboard.
For business users, this is not an exotic benchmark artifact. It is the same risk that appears when an agent misreads an invoice policy, produces a bad SQL query, drafts a legally sensitive clause, or summarizes a contract term incorrectly — then writes itself a confident note explaining how to repeat the error more efficiently next time. Wonderful. The future has automated institutional memory, including the bad parts.
MAR changes the reflection mechanism, not just the number of models
Multi-Agent Reflexion modifies the failure loop. When the Actor fails, MAR does not ask a single reflector to produce a diagnosis. It sends the failed trajectory to several persona-guided critics. These critics are designed to disagree along meaningful axes: evidence exploitation, exploration, and specification strictness. A judge then synthesizes their debate into one actionable consensus reflection, which is stored and used for the retry.
The sequence is simple:
- The Actor attempts the task.
- The Evaluator determines success or failure.
- If the attempt fails, multiple critics diagnose the failure.
- Critics respond to one another and refine their critiques.
- A judge synthesizes a consensus reflection.
- The Actor retries with that reflection in memory.
The key design choice is not “several agents talk.” It is that each critic has a reason to look for a different class of error.
For HotPotQA, the paper uses roles such as Verifier, Skeptic, Logician, and Creative Thinker. The Verifier demands factual grounding. The Skeptic challenges assumptions. The Logician checks specification compliance. The Creative role expands the search space when conventional reasoning stalls. For HumanEval, the personas map more naturally onto software work: Senior Engineer, QA Engineer, Algorithm Expert, and Code Reviewer.
This is not decorative roleplay. At least, it should not be. The operational idea is that different critics create different diagnostic pressure. One critic asks, “Is the claim supported?” Another asks, “What edge case breaks this?” Another asks, “Does the answer literally satisfy the task?” Another asks, “Are we stuck in the wrong plan altogether?”
That is why the committee metaphor fits. A useful committee is not a group of people taking turns saying the same thing with different fonts. It is a controlled disagreement system. The value comes from role separation, conflict, and synthesis.
What the experiments show, and what each test is really doing
The paper contains three layers of evidence: replication results, MAR main results, and appendix failure cases. They should not be treated as equal types of proof.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Reflexion replication on HotPotQA and HumanEval | Comparison with prior work and baseline calibration | The authors can reproduce the broad Reflexion pattern and observe its failure modes under logging | That every Reflexion setup fails in the same way across all tasks |
| MAR results on HotPotQA and HumanEval | Main evidence | Replacing single-agent reflection with structured multi-agent critique improves performance in these settings | That MAR is universally better or cost-effective in production |
| Appendix failure cases | Qualitative mechanism evidence | Shows how brittle metrics and bad self-reflection can mislead retries | Provides examples, not a statistical taxonomy of all failures |
| Persona prompts and parameter settings | Implementation detail | Clarifies how critics and debate rounds were instantiated | Does not establish the optimal persona set or debate depth |
On HotPotQA, the reproduced ReAct baseline reaches 32% Exact Match, Reflexion improves to 44%, and MAR reaches 47%. That is a three-point gain over Reflexion. It is positive, but not theatrical. Nobody should build a procurement deck around three Exact Match points and pretend the angels have spoken.
The interpretation is more interesting than the raw number. HotPotQA uses Exact Match, and the paper shows cases where semantically acceptable answers are marked wrong because they do not match the expected string. In one appendix case, MAR repeatedly answers that Woman’s Era and Naj are “women’s interest magazines,” while the ground truth is “fortnightly women interest magazine.” Under Exact Match, the answer fails. Under a human reading, the core category is basically there, with one temporal modifier missing.
Another appendix example is even more revealing. MAR initially answers “Stone Brewing Co.” where the target answer is “Stone Brewing.” The system receives an incorrect signal and keeps searching for a different answer until it times out. This is a reward-signal problem. When the metric punishes a near-correct answer as fully wrong, reflection becomes hazardous. The agent learns from a misleading teacher. A strict teacher is useful; a brittle teacher is just expensive confusion.
HumanEval gives a cleaner signal because execution-based tests provide sharper feedback. There, GPT-3.5 baseline pass@1 is 67.1, Reflexion improves to 76.4, and MAR reaches 82.6. The 6.2-point gain over Reflexion is more substantial. It also fits the mechanism: code failures often benefit from differentiated review. A QA-style critic catches edge cases. A code reviewer catches implementation defects. An algorithmic critic checks the logic. The judge then consolidates those critiques into a retry instruction.
The paper also compares its replication against the original Reflexion results. For HumanEval, the original paper reported GPT-4 baseline pass@1 of 80.1 and Reflexion pass@1 of 91.0; this replication reports GPT-4 baseline 81.7 and Reflexion 89.4. The broad pattern holds: reflection-guided retries help. But the replication also exposes why a single reflector can plateau.
So the empirical story is not “MAR defeats Reflexion everywhere by a large margin.” The better reading is: MAR improves over reproduced Reflexion on both tested benchmarks, with stronger evidence on code generation than on Exact Match QA, and its qualitative logs explain why structured disagreement can help.
The mechanism: disagreement creates new search paths before memory hardens
The most useful business interpretation comes from treating MAR as a workflow design pattern.
Single-agent reflection has a dangerous sequence:
- The model makes an error.
- The same reasoning style explains the error.
- The explanation is stored in memory.
- Future attempts inherit the explanation.
If step two is wrong, step three makes the error durable. This is how a temporary mistake becomes a standing operating procedure.
MAR interrupts that sequence by forcing diagnostic diversity before memory is updated. The Actor’s failed trajectory is not immediately converted into advice. It is first examined through multiple lenses. The judge then compresses the disagreement into one reflection.
This matters because many agent systems are now being designed with memory modules, retrieval stores, task histories, and self-improvement loops. The question is no longer whether agents can remember. They can. The question is whether they should remember what they just told themselves.
A business agent does not need permanent memory for every failed attempt. It needs controlled memory admission. Before an instruction becomes reusable guidance, the system should ask:
| Memory candidate | Control question | MAR-inspired safeguard |
|---|---|---|
| A reflection after a failed task | Did the reflection identify the actual failure? | Require independent critique before storing |
| A revised workflow rule | Does it generalize beyond the current case? | Use a judge or policy checker to separate local fixes from reusable rules |
| A diagnosis of user intent | Is there another plausible interpretation? | Add a skeptic or ambiguity-check role |
| A code fix or automation patch | Does it satisfy edge cases and the original specification? | Add QA and specification-review roles |
| A rejected answer | Was the reward signal reliable? | Check whether the metric is too brittle before triggering retries |
This is where the paper’s business relevance sits. Not in the romantic claim that agents “think better together,” but in the engineering principle that critique should be decoupled from generation before memory is updated.
The metric is part of the learning system
One of the paper’s most practical lessons is easy to miss: evaluation is not a passive scoreboard. In reflective systems, the metric becomes part of the learning loop.
With a normal benchmark, a brittle metric merely misreports performance. Annoying, but survivable. In Reflexion-style systems, a brittle metric can actively corrupt the next attempt. If the environment says a semantically correct answer is wrong, the agent may reflect itself away from the correct solution. That is exactly what the HotPotQA appendix examples illustrate.
This matters for business deployment because many enterprise reward signals are also brittle. A customer-support agent may be rewarded for short response time and learn to close tickets too quickly. A sales assistant may be rewarded for booked meetings and learn to overqualify weak leads. A code agent may be rewarded for passing visible tests and miss hidden maintainability constraints. A document-processing agent may be rewarded for field-level exact match and learn to overfit formatting rather than meaning.
The paper does not solve reward design. It simply reminds us that reflection amplifies whatever signal it receives. If the signal is accurate, reflection can improve behavior. If the signal is misleading, reflection can operationalize the mistake.
That is why MAR’s committee should not only critique the answer. In production systems, the committee also needs to critique the feedback.
What businesses should copy from MAR, and what they should not
The obvious but wrong lesson is: “Use multiple agents for everything.”
That would be a fine way to increase latency, token bills, and the number of dashboards needed to explain why a simple task now requires a small parliament. The paper itself reports that MAR requires roughly three times more API calls and latency than single-agent Reflexion, with the societal-impact section mentioning roughly 300–400 API calls per task in their pipeline. This is not free reliability. It is purchased diagnostic diversity.
The better business lesson is selective escalation.
Use single-agent reflection when the task is low-risk, feedback is reliable, and errors are easy to recover from. Use MAR-style reflection when the system shows signs of stagnation, repeated failure, specification ambiguity, or high-cost consequences.
A practical deployment pattern might look like this:
| Situation | Reflection design |
|---|---|
| Low-risk draft generation | Single-agent self-review is often enough |
| Code generation with failed tests | Add QA, algorithm, and code-review critics |
| Research or market analysis | Add evidence verifier, skeptic, and synthesis judge |
| Contract, compliance, or policy work | Add specification checker and risk reviewer |
| Repeated failed retries | Escalate to multi-agent diagnosis before updating memory |
| Unclear scoring or ambiguous labels | Add metric-review step before treating feedback as truth |
This is the ROI logic. MAR is not a default architecture. It is an escalation layer for cases where internally correlated reflection is likely to fail.
The paper’s persona design also suggests an implementation habit: critics should be role-specific, not personality-specific. “Optimistic analyst,” “careful analyst,” and “creative analyst” may sound charming, but charm is not a control system. Better roles map to operational risks: factual grounding, specification compliance, edge-case coverage, alternative hypotheses, cost impact, user intent, policy constraints.
A production MAR system should make disagreement useful, bounded, and auditable. Two debate rounds may be enough for many tasks, as the paper argues based on its setup. More debate is not automatically better. Past a certain point, agents are not deliberating; they are billing.
The limitations are not footnotes; they define where the method belongs
MAR improves over reproduced Reflexion on the tested benchmarks, but its boundaries are clear.
First, the evidence base is narrow. HotPotQA and HumanEval are useful because they test multi-hop reasoning and code generation, but they do not cover the full range of business agent work: long-horizon operations, tool chains with external state, multi-user workflows, compliance-heavy decision support, or messy CRM data. The mechanism may transfer, but the paper does not prove transfer.
Second, the HotPotQA gains are modest under Exact Match. The authors give reasonable evidence that Exact Match undercounts semantic correctness, but that cuts both ways. A better metric might show stronger MAR gains, or it might reveal different failure patterns. The safest interpretation is that HotPotQA supports the mechanism more than it supports a large quantitative claim.
Third, persona design is still manual. The paper defines critic roles around evidence exploitation, exploration, and specification strictness, and uses different personas for QA and coding. That is sensible, but it does not establish an optimal taxonomy. In business systems, critic roles should be designed around task risk, not copied blindly from the paper.
Fourth, cost matters. A threefold increase in calls and latency is acceptable for some high-value tasks and absurd for others. The correct deployment question is not “Does MAR improve accuracy?” It is “Which failures are expensive enough to justify committee review?”
Finally, a judge model is itself another point of failure. MAR separates critique, but the final synthesis still depends on a model deciding which criticism matters. If the judge compresses away the minority view that was actually correct, the committee can become a more expensive single agent. Governance does not disappear because several prompts were involved. It merely gets better furniture.
The real lesson: do not let agents grade their own homework alone
The paper’s strongest contribution is not the three-point HotPotQA gain or the 6.2-point HumanEval gain over reproduced Reflexion. Those numbers matter, but they are not the main event.
The main event is architectural: self-improving agents need process control around reflection. Acting, evaluating, critiquing, judging, and remembering should not collapse into one undifferentiated model call. Once memory enters the loop, bad critique becomes persistent behavior. That is when a small mistake grows legs.
MAR offers a useful pattern: before an agent stores a lesson from failure, make that lesson survive structured disagreement. Ask one critic whether the evidence supports it. Ask another whether the specification is being rewritten. Ask another whether edge cases break it. Ask a judge to synthesize, but keep the debate logs available for audit. Then, and only then, let the system remember.
This is not a case for bloated agent swarms. It is a case for committees with job descriptions.
Reflection is useful. Reflection with unchecked self-belief is dangerous. Reflection with controlled disagreement is closer to engineering.
And yes, apparently the model also benefits when someone in the room is paid to say: “No, that is not what the task asked.”
Cognaptus: Automate the Present, Incubate the Future.
-
Onat Ozer, Grace Wu, Yuchen Wang, Daniel Dosti, Honghao Zhang, and Vivi De La Rue, “MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs,” arXiv:2512.20845, 2025. ↩︎