Opening — Why this matters now

The current wave of AI agents promises something ambitious: systems that plan, act, evaluate outcomes, and adapt. In theory, they resemble junior analysts—observing a situation, choosing an action, and refining their judgment over time.

In practice, however, many so‑called “agents” are little more than skilled imitators.

Most agent training pipelines rely on imitation learning: the model copies actions demonstrated by experts. This produces competent behavior, but it hides a critical weakness. The model learns what to do, but rarely learns why one action is better than another. Without that comparative judgment, agents struggle to reflect on mistakes or adapt to unfamiliar situations.

A recent research paper proposes an intriguing alternative: Agentic Critical Training (ACT). Rather than teaching models to replicate expert actions, ACT trains them to judge the quality of actions themselves.

It is a subtle shift—but an important one. Instead of imitation, the model learns criticism. And criticism, as any experienced manager knows, is often the beginning of expertise.


Background — The Limits of Imitation in Agent Training

The dominant paradigm for training LLM agents typically follows three stages:

  1. Imitation Learning (IL) – Models replicate expert trajectories.
  2. Reinforcement Learning (RL) – Models refine behavior using reward signals.
  3. Self‑Reflection Augmentation – Some pipelines add reflection text explaining why an action works.

At first glance, reflection seems to solve the reasoning gap. But most implementations simply distill reflection text from stronger models or human annotators.

In other words, the model is not truly reflecting. It is imitating reflection.

This distinction matters. A model trained to copy explanations may produce elegant reasoning chains, yet still fail when confronted with unfamiliar situations. The underlying reasoning mechanism was never learned—it was merely rehearsed.

ACT reframes the problem entirely.

Instead of teaching a model to explain actions, it trains the model to identify the better action among alternatives.


Analysis — What Agentic Critical Training Actually Does

Agentic Critical Training introduces a reinforcement learning loop focused on comparative evaluation.

The core idea is straightforward:

  1. Generate multiple candidate actions for a given state.
  2. Compare them with the expert action.
  3. Ask the model to decide which action is better.
  4. Reward correct judgments.

Over time, the model develops an internal notion of action quality.

The ACT Training Workflow

Stage Description Purpose
Candidate Sampling Generate several possible actions from a base policy Introduce alternatives
Pair Construction Pair expert action with each alternative Create comparison tasks
Critical Judgment Model selects the better action Learn quality evaluation
RL Optimization Reward correct judgments Reinforce reasoning ability

The crucial difference from imitation learning is that the model is not rewarded for copying. Instead, it is rewarded for correctly identifying superior decisions.

This encourages the emergence of reasoning processes that resemble genuine reflection:

  • recognizing suboptimal actions
  • evaluating trade‑offs
  • preferring more effective strategies

In short, ACT trains agents to behave less like parrots and more like reviewers.


Findings — Empirical Performance

The researchers evaluated ACT across several agent benchmarks. The results show consistent improvements when ACT is added to existing training pipelines.

Performance Improvements

Training Method Baseline Score With ACT Improvement
Imitation Learning Baseline +5.07 avg Significant
Reinforcement Learning Baseline +4.62 avg Strong
Reflection Distillation Baseline +2.42 avg Moderate

Two additional results stand out.

1. Strong Out‑of‑Distribution Generalization

Agents trained with ACT perform better on tasks outside the training distribution. This suggests that learning to evaluate actions helps the model transfer reasoning across environments.

2. Spillover Gains in Reasoning Benchmarks

Interestingly, ACT improves performance on general reasoning tasks even though the training data does not explicitly include reasoning datasets.

This hints at an important mechanism: judging actions may implicitly train reasoning skills.


Implications — Why Critical Agents Matter for Real Systems

The implications extend well beyond benchmark scores.

1. Reflection Becomes an Emergent Skill

Rather than inserting reflection text through distillation, ACT allows reflective behavior to emerge from training.

For businesses deploying AI agents, this reduces reliance on handcrafted prompts or curated reflection datasets.

2. Better Robustness in Unfamiliar Environments

Systems that can compare actions are more resilient when encountering unexpected states. This matters in domains like:

  • financial trading agents
  • automated operations
  • autonomous decision support

In these environments, the ability to reject poor actions may be more valuable than perfectly imitating historical ones.

3. A Shift Toward “Evaluator” Models

ACT suggests a broader design pattern: agents may benefit from training signals that reward evaluation ability, not just generation ability.

Future agent architectures may therefore include specialized modules for:

  • action critique
  • decision comparison
  • self‑evaluation loops

In essence, tomorrow’s agents may behave less like obedient assistants and more like internal auditors.


Conclusion — From Imitation to Judgment

Large language model agents have advanced rapidly, but their training philosophy has remained surprisingly conservative. Most systems still rely on imitation as their primary learning signal.

Agentic Critical Training proposes something refreshingly different.

Instead of teaching models to replicate expert actions, it teaches them to recognize better decisions.

That difference—between copying and judging—may be one of the key steps toward genuinely autonomous agents.

After all, expertise rarely comes from memorizing the correct answer. It comes from understanding why the wrong answers fail.

Cognaptus: Automate the Present, Incubate the Future.