Opening — Why this matters now
The current wave of AI agents promises something ambitious: systems that plan, act, evaluate outcomes, and adapt. In theory, they resemble junior analysts—observing a situation, choosing an action, and refining their judgment over time.
In practice, however, many so‑called “agents” are little more than skilled imitators.
Most agent training pipelines rely on imitation learning: the model copies actions demonstrated by experts. This produces competent behavior, but it hides a critical weakness. The model learns what to do, but rarely learns why one action is better than another. Without that comparative judgment, agents struggle to reflect on mistakes or adapt to unfamiliar situations.
A recent research paper proposes an intriguing alternative: Agentic Critical Training (ACT). Rather than teaching models to replicate expert actions, ACT trains them to judge the quality of actions themselves.
It is a subtle shift—but an important one. Instead of imitation, the model learns criticism. And criticism, as any experienced manager knows, is often the beginning of expertise.
Background — The Limits of Imitation in Agent Training
The dominant paradigm for training LLM agents typically follows three stages:
- Imitation Learning (IL) – Models replicate expert trajectories.
- Reinforcement Learning (RL) – Models refine behavior using reward signals.
- Self‑Reflection Augmentation – Some pipelines add reflection text explaining why an action works.
At first glance, reflection seems to solve the reasoning gap. But most implementations simply distill reflection text from stronger models or human annotators.
In other words, the model is not truly reflecting. It is imitating reflection.
This distinction matters. A model trained to copy explanations may produce elegant reasoning chains, yet still fail when confronted with unfamiliar situations. The underlying reasoning mechanism was never learned—it was merely rehearsed.
ACT reframes the problem entirely.
Instead of teaching a model to explain actions, it trains the model to identify the better action among alternatives.
Analysis — What Agentic Critical Training Actually Does
Agentic Critical Training introduces a reinforcement learning loop focused on comparative evaluation.
The core idea is straightforward:
- Generate multiple candidate actions for a given state.
- Compare them with the expert action.
- Ask the model to decide which action is better.
- Reward correct judgments.
Over time, the model develops an internal notion of action quality.
The ACT Training Workflow
| Stage | Description | Purpose |
|---|---|---|
| Candidate Sampling | Generate several possible actions from a base policy | Introduce alternatives |
| Pair Construction | Pair expert action with each alternative | Create comparison tasks |
| Critical Judgment | Model selects the better action | Learn quality evaluation |
| RL Optimization | Reward correct judgments | Reinforce reasoning ability |
The crucial difference from imitation learning is that the model is not rewarded for copying. Instead, it is rewarded for correctly identifying superior decisions.
This encourages the emergence of reasoning processes that resemble genuine reflection:
- recognizing suboptimal actions
- evaluating trade‑offs
- preferring more effective strategies
In short, ACT trains agents to behave less like parrots and more like reviewers.
Findings — Empirical Performance
The researchers evaluated ACT across several agent benchmarks. The results show consistent improvements when ACT is added to existing training pipelines.
Performance Improvements
| Training Method | Baseline Score | With ACT | Improvement |
|---|---|---|---|
| Imitation Learning | Baseline | +5.07 avg | Significant |
| Reinforcement Learning | Baseline | +4.62 avg | Strong |
| Reflection Distillation | Baseline | +2.42 avg | Moderate |
Two additional results stand out.
1. Strong Out‑of‑Distribution Generalization
Agents trained with ACT perform better on tasks outside the training distribution. This suggests that learning to evaluate actions helps the model transfer reasoning across environments.
2. Spillover Gains in Reasoning Benchmarks
Interestingly, ACT improves performance on general reasoning tasks even though the training data does not explicitly include reasoning datasets.
This hints at an important mechanism: judging actions may implicitly train reasoning skills.
Implications — Why Critical Agents Matter for Real Systems
The implications extend well beyond benchmark scores.
1. Reflection Becomes an Emergent Skill
Rather than inserting reflection text through distillation, ACT allows reflective behavior to emerge from training.
For businesses deploying AI agents, this reduces reliance on handcrafted prompts or curated reflection datasets.
2. Better Robustness in Unfamiliar Environments
Systems that can compare actions are more resilient when encountering unexpected states. This matters in domains like:
- financial trading agents
- automated operations
- autonomous decision support
In these environments, the ability to reject poor actions may be more valuable than perfectly imitating historical ones.
3. A Shift Toward “Evaluator” Models
ACT suggests a broader design pattern: agents may benefit from training signals that reward evaluation ability, not just generation ability.
Future agent architectures may therefore include specialized modules for:
- action critique
- decision comparison
- self‑evaluation loops
In essence, tomorrow’s agents may behave less like obedient assistants and more like internal auditors.
Conclusion — From Imitation to Judgment
Large language model agents have advanced rapidly, but their training philosophy has remained surprisingly conservative. Most systems still rely on imitation as their primary learning signal.
Agentic Critical Training proposes something refreshingly different.
Instead of teaching models to replicate expert actions, it teaches them to recognize better decisions.
That difference—between copying and judging—may be one of the key steps toward genuinely autonomous agents.
After all, expertise rarely comes from memorizing the correct answer. It comes from understanding why the wrong answers fail.
Cognaptus: Automate the Present, Incubate the Future.