Mirror, Mirror on the Agent: Teaching LLMs to Judge Their Own Actions

Opening — Why this matters now

The current wave of AI agents promises something ambitious: systems that plan, act, evaluate outcomes, and adapt. In theory, they resemble junior analysts—observing a situation, choosing an action, and refining their judgment over time.

In practice, however, many so‑called “agents” are little more than skilled imitators.

Most agent training pipelines rely on imitation learning: the model copies actions demonstrated by experts. This produces competent behavior, but it hides a critical weakness. The model learns what to do, but rarely learns why one action is better than another. Without that comparative judgment, agents struggle to reflect on mistakes or adapt to unfamiliar situations.

A recent research paper proposes an intriguing alternative: Agentic Critical Training (ACT). Rather than teaching models to replicate expert actions, ACT trains them to judge the quality of actions themselves.

It is a subtle shift—but an important one. Instead of imitation, the model learns criticism. And criticism, as any experienced manager knows, is often the beginning of expertise.

Background — The Limits of Imitation in Agent Training

The dominant paradigm for training LLM agents typically follows three stages:

Imitation Learning (IL) – Models replicate expert trajectories.
Reinforcement Learning (RL) – Models refine behavior using reward signals.
Self‑Reflection Augmentation – Some pipelines add reflection text explaining why an action works.

At first glance, reflection seems to solve the reasoning gap. But most implementations simply distill reflection text from stronger models or human annotators.

In other words, the model is not truly reflecting. It is imitating reflection.

This distinction matters. A model trained to copy explanations may produce elegant reasoning chains, yet still fail when confronted with unfamiliar situations. The underlying reasoning mechanism was never learned—it was merely rehearsed.

ACT reframes the problem entirely.

Instead of teaching a model to explain actions, it trains the model to identify the better action among alternatives.

Analysis — What Agentic Critical Training Actually Does

Agentic Critical Training introduces a reinforcement learning loop focused on comparative evaluation.

The core idea is straightforward:

Generate multiple candidate actions for a given state.
Compare them with the expert action.
Ask the model to decide which action is better.
Reward correct judgments.

Over time, the model develops an internal notion of action quality.

The ACT Training Workflow

Stage	Description	Purpose
Candidate Sampling	Generate several possible actions from a base policy	Introduce alternatives
Pair Construction	Pair expert action with each alternative	Create comparison tasks
Critical Judgment	Model selects the better action	Learn quality evaluation
RL Optimization	Reward correct judgments	Reinforce reasoning ability

The crucial difference from imitation learning is that the model is not rewarded for copying. Instead, it is rewarded for correctly identifying superior decisions.

This encourages the emergence of reasoning processes that resemble genuine reflection:

recognizing suboptimal actions
evaluating trade‑offs
preferring more effective strategies

In short, ACT trains agents to behave less like parrots and more like reviewers.

Findings — Empirical Performance

The researchers evaluated ACT across several agent benchmarks. The results show consistent improvements when ACT is added to existing training pipelines.

Performance Improvements

Training Method	Baseline Score	With ACT	Improvement
Imitation Learning	Baseline	+5.07 avg	Significant
Reinforcement Learning	Baseline	+4.62 avg	Strong
Reflection Distillation	Baseline	+2.42 avg	Moderate

Two additional results stand out.

1. Strong Out‑of‑Distribution Generalization

Agents trained with ACT perform better on tasks outside the training distribution. This suggests that learning to evaluate actions helps the model transfer reasoning across environments.

2. Spillover Gains in Reasoning Benchmarks

Interestingly, ACT improves performance on general reasoning tasks even though the training data does not explicitly include reasoning datasets.

This hints at an important mechanism: judging actions may implicitly train reasoning skills.

Implications — Why Critical Agents Matter for Real Systems

The implications extend well beyond benchmark scores.

1. Reflection Becomes an Emergent Skill

Rather than inserting reflection text through distillation, ACT allows reflective behavior to emerge from training.

For businesses deploying AI agents, this reduces reliance on handcrafted prompts or curated reflection datasets.

2. Better Robustness in Unfamiliar Environments

Systems that can compare actions are more resilient when encountering unexpected states. This matters in domains like:

financial trading agents
automated operations
autonomous decision support

In these environments, the ability to reject poor actions may be more valuable than perfectly imitating historical ones.

3. A Shift Toward “Evaluator” Models

ACT suggests a broader design pattern: agents may benefit from training signals that reward evaluation ability, not just generation ability.

Future agent architectures may therefore include specialized modules for:

action critique
decision comparison
self‑evaluation loops

In essence, tomorrow’s agents may behave less like obedient assistants and more like internal auditors.

Conclusion — From Imitation to Judgment

Large language model agents have advanced rapidly, but their training philosophy has remained surprisingly conservative. Most systems still rely on imitation as their primary learning signal.

Agentic Critical Training proposes something refreshingly different.

Instead of teaching models to replicate expert actions, it teaches them to recognize better decisions.

That difference—between copying and judging—may be one of the key steps toward genuinely autonomous agents.

After all, expertise rarely comes from memorizing the correct answer. It comes from understanding why the wrong answers fail.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The Limits of Imitation in Agent Training#

Analysis — What Agentic Critical Training Actually Does#

The ACT Training Workflow#

Findings — Empirical Performance#

Performance Improvements#

1. Strong Out‑of‑Distribution Generalization#

2. Spillover Gains in Reasoning Benchmarks#

Implications — Why Critical Agents Matter for Real Systems#

1. Reflection Becomes an Emergent Skill#

2. Better Robustness in Unfamiliar Environments#

3. A Shift Toward “Evaluator” Models#

Conclusion — From Imitation to Judgment#