Mirror, Mirror on the Agent: Teaching LLMs to Judge Their Own Actions

The agent did exactly what it was taught. That was the problem.

A familiar business agent failure does not look dramatic. It looks boring.

The agent searches the database, clicks the wrong record, receives an error, retries the same action, receives the same error, retries again, and then politely informs the user that it has encountered “temporary difficulty.” Very professional. Completely useless.

This kind of failure is not always caused by a weak base model or a bad prompt. Sometimes the agent is doing exactly what its training encouraged it to do: imitate successful actions from expert demonstrations. It has learned the surface of competence, but not the habit of asking a more important question: is this action still good in the current state?

That distinction is the center of Agentic Critical Training, a paper by Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, and Furong Huang.¹ The paper proposes a training stage where an LLM agent is not asked to copy an expert action, nor to imitate prewritten reflection text. Instead, it is trained through reinforcement learning to choose the better action between an expert action and a sampled alternative.

The mechanism is simple enough to sound almost obvious after someone has written the paper. That is usually where the useful ideas hide.

The paper’s main claim is not that agents need more chain-of-thought. It is not that every system should print a longer self-reflection before acting. The more precise claim is this: if we want agents to recover from mistakes, adapt to unfamiliar states, and avoid blindly executing scripts, we may need to train them to judge action quality, not merely to reproduce successful trajectories.

The difference between “explain yourself” and “choose the better action” is small in wording. In training, it is not small at all.

The old recipe teaches what to do, not what to reject

Most agent training begins with imitation learning. Given an expert trajectory, the model learns to predict the expert’s next action under a given context. In the paper’s notation, the supervised objective maximizes the likelihood of the expert action under the observed state. Operationally, this says:

When you see this kind of state, output this kind of action.

That is useful. It is also incomplete.

An expert demonstration usually contains a sequence of successful actions. It rarely contains the near-misses, tempting mistakes, failed clicks, wrong tool calls, invalid product selections, or “nothing happens” states that make real environments so irritatingly educational. So the model learns a clean path through the maze, but not the walls.

Recent “self-reflection” approaches try to patch this gap. The paper discusses Early Experience as the key comparison. In that setup, the system executes both expert and alternative actions, observes the resulting next states, generates reflection text explaining why the expert action is better, and then trains the model to imitate that reflection through next-token prediction.

This is better than pure imitation in one sense: the model is at least exposed to contrast. But the authors’ criticism is sharp. The model is still imitating a target string. It learns to reproduce reflection text, not necessarily to discover the reasoning that would lead to the better action.

That is the misconception worth removing. A model that writes “I should check the constraint carefully” has not necessarily learned to check the constraint carefully. It may have learned that this sentence often appears before the answer. Anyone who has read enough AI-generated “careful analysis” will recognize the species.

ACT changes the training question.

Instead of asking the model to imitate the expert action or imitate a reflection, ACT asks:

Given the current state and two candidate actions, which action is better?

The answer is rewarded by whether the model selects the expert action. The reasoning is not supervised with a fixed target explanation. The reward only verifies the choice. If the model wants the reward, it must learn some internal procedure for comparing actions under the current state.

That is why this paper is best read mechanism-first. The benchmark numbers matter, but they only become interpretable once the objective shift is clear.

ACT turns agent training into a comparison problem

ACT starts from expert trajectories. For each expert state-action pair, the method samples an alternative action from an initial policy. It removes duplicate alternatives and pairs the expert action with the sampled alternative. The resulting training example contains the current context and two candidate actions.

The prompt then asks the model to think about which action is better and output the chosen action inside <action>...</action> tags. The expert action is randomly placed as Action 1 or Action 2, so the model cannot simply memorize position.

The training pipeline has three stages:

Stage	What happens	Likely purpose	What it does not prove by itself
Data construction	Expert actions are paired with sampled alternative actions	Creates contrastive action-quality examples	It assumes expert actions are generally better than sampled alternatives
Agentic Critical Training	GRPO rewards the model for selecting the better action	Teaches discriminative judgment over actions	It does not directly train free-form action generation
RL action training	The ACT-enhanced model is further trained to generate actions	Converts critical judgment into stronger policy behavior	It still depends on the benchmark reward/proxy design

The training uses Group Relative Policy Optimization, a reinforcement learning method that samples groups of responses and updates the model based on relative rewards. The reward has three practical components: exact match with the expert action, partial credit for admissible but non-expert actions, and a format penalty when the required action tags are missing. For WebShop action training, the admissible-action component is disabled because open-ended search queries cannot be fully enumerated.

This detail matters because ACT is not magic introspection. It is a verifiable reward setup. The system rewards correct selection under a contrastive action-choice task. Reflection is not injected as text; it is incentivized as an instrument for choosing correctly.

That is the useful design pattern for business systems: do not merely tell the model to “reflect.” Give it a comparison task where reflection is the cheapest path to reward.

The main result: judgment improves action training

The authors evaluate ACT on three agent benchmarks:

Benchmark	Domain	Reported metric
ALFWorld	Text-based household tasks with navigation and object manipulation	Success rate, with in-distribution and out-of-distribution splits
WebShop	Web shopping tasks requiring product search and selection	Success rate
ScienceWorld	Scientific experimental procedure tasks	Offline next-action prediction accuracy

That last row deserves attention. ScienceWorld is not reported as live task success in the main table. It is next-action prediction accuracy over sampled states. So it supports a claim about action prediction under scientific task states, not a full claim that ACT solves end-to-end ScienceWorld experiments. This is not a fatal limitation. It is simply the boundary of the evidence.

On Qwen3-8B, ACT improves both imitation learning and reinforcement learning when used as a preliminary training stage. The headline numbers are:

Comparison	Average gain reported in the paper	Interpretation
IL w/ ACT vs. IL	+5.07 percentage points	Critical judgment improves supervised action learning
RL w/ ACT vs. RL	+4.62 percentage points	Critical judgment improves later RL action training
IL w/ ACT vs. Early Experience	+2.42 percentage points	RL-trained judgment beats imitated reflection text in these benchmarks

The full main table is more informative than the averages:

Method	ALFWorld ID	ALFWorld OOD	WebShop	ScienceWorld
Prompt w/o CoT thinking	35.71	27.61	2.80	28.01
Prompt w/ CoT thinking	56.43	50.00	3.00	25.21
ACT only	72.86	72.39	7.40	26.71
Imitation Learning	85.71	82.84	28.00	42.80
Early Experience	87.86	85.82	31.00	45.60
IL w/ ACT	91.43	87.31	31.60	48.69
RL	90.71	84.33	29.40	43.04
RL w/ ACT	92.86	88.06	33.80	50.34

A few things are easy to miss.

First, ACT alone does not beat IL or RL. That is expected. ACT trains the model to judge between candidate actions; it is not primarily an action-generation training stage. The paper is not saying that a critic objective replaces policy training. It is saying that a critic objective creates a better foundation for policy training.

Second, the gain over RL is not merely an IL artifact. RL w/ ACT beats RL across all reported benchmark columns in the main table. This supports the claim that action-quality judgment transfers into later action generation.

Third, the result against Early Experience is the cleanest test of the paper’s central argument. Early Experience adds reflection data but still trains through imitation. ACT trains action comparison through RL. In these experiments, ACT does better. That does not prove all reflection distillation is weak. It does show that, under this setup, rewarding correct judgment is more effective than asking the model to copy reflection text.

The paper is therefore not arguing against reflection. It is arguing against treating reflection as a decorative transcript.

The failure-recovery case shows the mechanism, not just the score

The most useful qualitative evidence is the ALFWorld failure-recovery case.

In one trace, an imitation-learning model tries to put a cleaned cloth into a cabinet. The action fails. The model repeats essentially the same failed action for more than 30 steps until termination. This is the agentic version of pressing the elevator button harder.

The ACT-trained model faces a similar kind of failure. After repeated “Nothing happens” feedback, it diagnoses that it is not at the correct location and chooses to go to the dining table before trying to place the object there.

This case study is not main statistical evidence. It is explanatory evidence. Its purpose is to show a plausible behavioral mechanism behind the table: ACT appears to help the model interpret the current state, recognize failed action feedback, and revise the next action instead of continuing a memorized script.

The appendix adds a WebShop example with the same theme. An imitation-trained model follows a search-click-select-buy pattern even when the product page shows a price of $55 despite the user’s constraint of below $50. It buys anyway and receives a score of zero. The authors interpret this as rigid execution without state awareness.

For business readers, that example is more recognizable than the benchmark name. Many deployed agents fail not because they cannot produce a valid next action, but because they do not re-check whether the current state still satisfies the goal. They complete the workflow while violating the task. Very efficient. Also wrong.

OOD results suggest judgment is less brittle than scripts

The out-of-distribution result is important because it tests whether ACT is just helping the model memorize a cleaner action distribution.

On ALFWorld, the authors report both in-distribution and out-of-distribution splits. The OOD split uses unseen room layouts and object combinations. The main table shows that RL w/ ACT reaches 88.06 on ALFWorld OOD, compared with 84.33 for RL. The paper highlights that the ACT gain on top of RL is larger on OOD tasks, at 3.73 percentage points, than on in-distribution tasks, at 2.15 percentage points.

This is not proof of broad real-world robustness. The environments are still benchmarks, and the OOD shift is defined by the benchmark. But it supports a more modest and useful interpretation: training the model to evaluate actions under the current state may be less brittle than training it only to reproduce action sequences.

That distinction matters for enterprise agents. Business workflows rarely repeat perfectly. A CRM field is missing. A supplier page changes layout. A compliance constraint appears halfway through a process. A user says “same as last time,” except one parameter is different, because naturally it is.

A script follower tries to complete the learned pattern. A judgment-trained agent has at least been trained to compare candidate next steps against the current context. That does not make it reliable by default. It gives reliability engineering a better object to work with.

Cross-size transfer is a cost argument, not just a model argument

ACT requires alternative actions to construct contrastive examples. That introduces a cost question: must every model size collect its own ACT data?

The paper tests this by training Qwen3-4B on ALFWorld using ACT data collected entirely from Qwen3-8B. No re-collection or adaptation is used. The Qwen3-4B results still improve when ACT is added.

Method	Qwen3-4B ID	Qwen3-4B OOD	Qwen3-8B ID	Qwen3-8B OOD
Prompt w/o CoT thinking	13.57	8.96	35.71	27.61
Prompt w/ CoT thinking	50.71	29.85	56.43	50.00
ACT only	71.43	62.69	72.86	72.39
Imitation Learning	85.00	83.58	85.71	82.84
Early Experience	88.57	88.06	87.86	85.82
IL w/ ACT	88.57	91.04	91.43	87.31
RL	91.43	88.81	90.71	84.33
RL w/ ACT	92.14	91.79	92.86	88.06

This is best treated as a cost-amortization test. It suggests that contrastive action-quality data collected from a larger model can still help a smaller model. For organizations, that matters because critic-data collection can be expensive. If one collection pass can support several deployment sizes, the economics become less unpleasant.

Still, the result is narrow. It is one benchmark family, one transfer direction, and two Qwen3 model sizes. It does not prove that ACT data can move freely across model families, tool ecosystems, or business domains. The practical inference is softer: action-comparison datasets may have reuse value, so teams should not assume every production model needs a fully separate critic-data pipeline.

That is already useful.

The reasoning benchmark result is intriguing, but should not be inflated

The paper also evaluates models trained only on ALFWorld agentic data against MATH-500 and GPQA-Diamond. No mathematical or scientific reasoning data is used during this training.

The result:

Method	MATH-500	GPQA-Diamond
Prompt w/o CoT thinking	78.6 ± 0.33	42.93 ± 1.09
Prompt w/ CoT thinking	86.93 ± 0.74	51.52 ± 1.89
Imitation Learning	87.00 ± 0.33	44.61 ± 0.95
Early Experience	86.86 ± 0.25	51.85 ± 0.63
RL	87.07 ± 0.77	52.36 ± 1.32
ACT	87.73 ± 0.19	53.37 ± 0.63

The authors interpret this as evidence that ACT can improve general reasoning without reasoning-specific training data. The strongest result is on GPQA-Diamond: ACT improves over the CoT prompting baseline by 1.85 percentage points, while IL drops by 6.91 percentage points.

This is interesting. It is not a license to declare that household-task agent training produces general intelligence. Please do not print that on a slide deck.

A disciplined interpretation is that ACT’s comparison objective may preserve or strengthen self-verification behavior. The paper’s qualitative GPQA example shows the ACT-trained model deriving an answer and then checking options against energy conservation. The appendix contrasts this with IL reasoning collapse: unfocused meandering in one physics problem and prolonged algebraic looping in a MATH-500 problem.

Those appendix cases are not a second thesis. They are explanatory diagnostics for a surprising table. Their role is to suggest how ACT might avoid the degradation that imitation learning can cause when a reasoning-capable base model is fine-tuned on short action-heavy traces.

The business translation is similarly bounded. ACT may be useful when a company wants agents that preserve reasoning discipline while being trained on operational workflows. It does not prove that ACT-trained agents will reason better in every domain. It does suggest that supervised imitation on narrow action logs can damage general reasoning behavior, and that reward-based action judgment may be less destructive.

That warning alone is worth taking seriously.

What ACT directly shows, and what Cognaptus infers

The business relevance of ACT is not “your agents will become self-aware.” They will not. The more practical reading is that ACT points toward a training pattern for agents that must operate under changing states.

Layer	What the paper directly shows	Cognaptus inference for business systems	Boundary
Training objective	Rewarding correct action comparison improves later IL and RL performance on tested benchmarks	Add critic-style action comparison before action-generation training	Needs domain-specific validation
Reflection	ACT outperforms a reflection-distillation baseline in reported benchmarks	Reflection should be trained as judgment, not merely generated as text	Other reflection methods may behave differently
Robustness	ACT improves ALFWorld OOD performance	Current-state action evaluation may reduce brittle script-following	Benchmark OOD is not the same as production drift
Cost	Qwen3-4B benefits from ACT data collected using Qwen3-8B	Contrastive critic data may be reusable across model sizes	Cross-family reuse remains untested
Reasoning	ACT improves MATH-500 and GPQA-Diamond modestly after ALFWorld-only training	Critic training may help preserve self-verification	Results are limited to two reasoning benchmarks and specific models

For workflow automation, the most immediate application is not a new chatbot personality. It is a training and evaluation layer.

Imagine an accounts-payable agent. The standard imitation dataset shows successful invoice processing. ACT-style data would add contrastive moments: approve vs. flag; match vendor by name vs. match by tax ID; proceed despite missing purchase order vs. request clarification; use the current exchange rate vs. reuse a stale one. The model is rewarded for identifying which action better satisfies the current state and policy constraints.

That does not replace rule-based controls. In regulated workflows, it should not. But it can train the agent to notice when a superficially plausible action violates context.

The same pattern applies to customer support, procurement, compliance review, trading operations, and CRM automation. In all of these settings, the expensive failures often come from agents that complete a learned pattern while ignoring a constraint that changed midstream.

ACT is therefore less about making agents eloquent and more about making them interruptible by reality.

The boundary: benchmark evidence is not deployment evidence

The paper is careful enough to give us strong signals and clear boundaries.

First, the experiments are benchmark-based. ALFWorld, WebShop, and ScienceWorld are useful, but they are not messy enterprise systems with live users, partial permissions, stale APIs, adversarial documents, and managers who rename spreadsheet columns for spiritual reasons.

Second, the main reported models are Qwen3-4B and Qwen3-8B. The paper gives evidence across model sizes, but not across a broad set of model families.

Third, ScienceWorld is evaluated through next-action prediction accuracy, not full online task success. That makes its evidence narrower than ALFWorld and WebShop.

Fourth, ACT depends on the quality of the contrastive setup. The method assumes the expert action is generally better than sampled alternatives. In many business domains, the “expert” log may contain shortcuts, policy violations, or legacy behavior that should not be reinforced. If the expert action is not actually better, ACT will faithfully teach the model to judge badly. A critic trained on bad taste is still a critic. Just not one you should hire.

Finally, action comparison is only one layer of agent reliability. Production systems still need tool permissions, audit logs, rollback design, deterministic checks, human escalation, and test suites. ACT can improve the learned policy. It does not replace system architecture.

The useful lesson: train the mirror, not the monologue

The title of the paper could have invited a lazy reading: “agents learn to reflect.” But reflection is not the real contribution. The contribution is more precise.

ACT trains the model to look at the current state, compare two possible next actions, and identify the better one through reward. That creates a form of learned criticism. When later used before imitation or action RL, this critic-like foundation improves task performance across the reported agent benchmarks. It also appears to help with OOD transfer and may preserve general reasoning better than standard imitation on action-heavy traces.

For businesses, the lesson is not that every agent needs longer reasoning text. Many agents already talk too much. The lesson is that agent training should include structured opportunities to reject bad actions.

A system that only learns successful paths may fail as soon as the path bends. A system trained to compare actions has at least practiced looking down before stepping.

That is not autonomy. But it is closer to operational judgment than imitation alone.

And in agent design, as in management, the first sign of maturity is not doing the right thing once. It is recognizing when the usual thing has become the wrong thing.

Cognaptus: Automate the Present, Incubate the Future.

Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, and Furong Huang, “Agentic Critical Training,” arXiv:2603.08706v1, 2026. https://arxiv.org/abs/2603.08706 ↩︎

The agent did exactly what it was taught. That was the problem.#

The old recipe teaches what to do, not what to reject#

ACT turns agent training into a comparison problem#

The main result: judgment improves action training#

The failure-recovery case shows the mechanism, not just the score#

OOD results suggest judgment is less brittle than scripts#

Cross-size transfer is a cost argument, not just a model argument#

The reasoning benchmark result is intriguing, but should not be inflated#

What ACT directly shows, and what Cognaptus infers#

The boundary: benchmark evidence is not deployment evidence#

The useful lesson: train the mirror, not the monologue#