When Rewards Learn to Think: Teaching Agents *How* They’re Wrong

An agent fails a task. It searched the web twice, opened the wrong page, trusted a noisy snippet, wrote a plausible final answer, and lost the point.

Traditional reinforcement learning sees one thing: wrong.

That is brutally clean, and also rather unhelpful. The agent may have performed three useful steps before collapsing at the fourth. Or it may have wandered confidently through nonsense from the beginning. Sparse final-answer rewards flatten these cases into the same training signal. The scoreboard says “0.” Very educational, in the same way a fire alarm teaches architecture.

The paper Exploring Reasoning Reward Model for Agents proposes a more diagnostic alternative: Agent-RRM, a reasoning reward model for agent trajectories.¹ Instead of only returning a scalar reward, Agent-RRM produces three linked outputs: an analytical reasoning trace, a targeted critique of the agent’s mistakes, and a holistic score. The authors then test three ways to use those signals in agent training and inference: Reagent-C, Reagent-R, and Reagent-U.

The important idea is not simply “better reward model.” That phrase has become almost decorative. The paper’s real claim is more specific: for long-horizon agents, the reward model should understand the failure mode well enough to say what went wrong. A scalar score can rank trajectories. A critique can teach the agent where the trajectory broke. Reagent-U tries to use both.

That distinction matters because many production agent failures are not mysterious. They are boringly specific. The agent called a tool when it should have reasoned. It reasoned when it should have called a tool. It believed a partial tool result. It hallucinated a file name. It repeated a search with no new purpose. Agent-RRM is designed to evaluate exactly those defects. Finally, a reward model with a little professional resentment toward sloppy workflow.

Final-answer rewards are too coarse for tool-using agents

For ordinary math or short-answer tasks, final correctness can be a tolerable reward signal. Not perfect, but tolerable. If the model gets the answer right, reward it. If not, punish it. This is the basic attraction of reinforcement learning with verifiable rewards: it avoids expensive human labeling and gives a clean optimization target.

Agents make that simplicity less innocent.

A tool-using agent is not just producing a single answer. It is producing a trajectory: decide whether to search, decide what to search, inspect results, browse pages, maybe run code, maybe read a file, maybe interpret an image, then synthesize an answer. In that setting, “final answer correct or incorrect” is a very lossy compression of the actual behavior.

Two failed trajectories can have completely different training value:

Failed trajectory type	What final-answer reward sees	What a useful evaluator should notice
Mostly correct reasoning, final synthesis error	0	The agent’s tool plan was useful, but final integration failed
Correct tool use, noisy tool output accepted too quickly	0	The agent needed skepticism toward partial evidence
Wrong tool chosen from the start	0	The agent misunderstood the task requirements
Repeated searches without new information	0	The agent is wasting steps rather than narrowing uncertainty
Hallucinated file name or URL	0	The agent is inventing tool inputs, which is operationally dangerous

The authors frame this as the limitation of sparse outcome-based rewards. They argue that existing agentic RL often fails to distinguish high-quality intermediate reasoning from fundamentally broken attempts. That is not a minor inconvenience. It means the learning system may throw away useful partial behavior and under-diagnose recurring operational errors.

This is where Agent-RRM enters.

Agent-RRM turns a trajectory into three kinds of feedback

Agent-RRM is trained to evaluate a full agent trajectory. Its output has three components:

<think>: an internal analytical trace evaluating the reasoning and tool-use quality;
<critique>: a concise, actionable summary of the trajectory’s flaws;
<score>: a scalar quality score between 0 and 1.

The division is important. The analytical trace supports the evaluator’s own judgment. The critique is designed to be shown back to the agent. The score becomes a reward signal for reinforcement learning.

This is not just a process reward model attached to intermediate steps. The paper’s reward model is asked to judge the global quality of reasoning and tool use. It is not merely checking whether step 4 looked good. It is looking for patterns such as unnecessary tool calls, missing tool calls, hallucinated tool arguments, uncritical acceptance of tool outputs, faulty logical jumps, and misunderstandings of tool limitations.

That makes Agent-RRM closer to a workflow auditor than a scoreboard.

The authors build this evaluator using two datasets: Reagent-RRM-SFT-28K for supervised fine-tuning and Reagent-RRM-RL-90K for reinforcement learning. The training judgments are generated from trajectories sampled from several models, then annotated by GPT-OSS-120B into the required three-part format. The reward model itself, like the agent models, is initialized from Qwen3-8B.

There is a subtle design constraint here: the critique is not supposed to reveal the correct answer. It should diagnose reasoning and tool-use defects without giving away the solution. That matters because otherwise the critique becomes a hidden answer channel, not a learning signal. The prompt in the appendix is explicit about this: evaluate the process, not the final truth.

For enterprise use, that constraint is not academic decoration. A diagnostic layer that leaks answers or reveals privileged information would be difficult to use safely in many settings. A diagnostic layer that says “you trusted an incomplete search result” or “you failed to verify the file source” is much more operationally usable.

Reagent-C, Reagent-R, and Reagent-U test three ways to use diagnosis

The paper’s experiments are organized around three integration strategies. This is the right way to read the work. The benchmark tables are evidence, but the mechanism is the story.

Variant	What it uses from Agent-RRM	Where it acts	Likely experimental purpose
Reagent-C	Textual critique	Inference-time refinement, no parameter update	Tests whether critiques are directly useful for correcting an agent’s answer
Reagent-R	Scalar score	RL training reward, combined with rule-based reward	Ablates the value of dense model-based reward without critique-guided refinement
Reagent-U	Textual critique + scalar score	Unified RL loop using initial and refined trajectories	Tests whether critique and score are complementary rather than redundant
Lambda analysis	Reward weight $\lambda$	Training reward balance	Sensitivity check on how much model-based reward should influence learning
GAIA full-set evaluation	Multimodal and broader tool tasks	Evaluation	Extension beyond text-only search tasks

Reagent-C is the simplest. The agent first produces an answer. Agent-RRM critiques the trajectory. The same frozen agent then tries again using that critique. This isolates the immediate value of textual feedback. No training magic. No updated weights. Just, “Here is what you did wrong; try again.”

Reagent-R uses the scalar score as a dense reward signal during RL. The reward combines rule-based correctness with the Agent-RRM score:

$$ R = R_{\text{rule}} + \lambda R_{\text{RRM}} $$

The paper sets $\lambda = 0.3$ in its main experiments. The purpose is to reduce the poverty of binary correctness rewards. If the final answer is wrong but the reasoning path was partially strong, the scalar score can still encode that difference.

Reagent-U combines both ideas. During training, the agent generates an initial response, receives a critique, generates a refined response, and the system pools both initial and refined trajectories when computing rewards and advantages. The textual critique helps produce improved trajectories; the scalar score helps rank their quality. At inference time, however, Reagent-U runs as a normal agent without external critique refinement.

That last detail is crucial. Reagent-U is not merely buying better test-time answers by calling a second evaluator. It uses critique during training so the agent internalizes better behavior. In production language: the critic is a teacher, not a permanent extra employee sitting next to the agent on every task.

The main evidence says unified feedback beats either signal alone

The headline results appear across 12 benchmarks covering general agent/search tasks, knowledge-intensive reasoning, and mathematical reasoning. The most business-relevant results are the agentic and search benchmarks because they resemble the messy tool-use workflows companies actually care about.

On GAIA text, Reagent-U reaches 43.7 average pass@1. The sparse-reward Reagent baseline without Agent-RRM reaches 34.0. Reagent-R reaches 36.9, and Reagent-C reaches 25.2. On WebWalkerQA, Reagent-U reaches 46.2, compared with 43.5 for Reagent without Agent-RRM and 45.3 for Reagent-R.

The pattern is more interesting than any single number. Reagent-R improves over the no-Agent-RRM baseline on most tasks, suggesting that scalar reasoning rewards help. But Reagent-U usually improves further, suggesting that critique and score are not substitutes. The score tells the optimizer which trajectories are better. The critique helps generate better trajectories in the first place.

This is the mechanism-first interpretation: the paper is not saying “one more reward model wins benchmark.” It is saying that agent learning benefits when evaluation is both discriminative and diagnostic.

The results on knowledge-intensive and math benchmarks reinforce this interpretation. Reagent-U reaches 68.1 on HotpotQA, 78.8 on 2Wiki, 76.8 on Bamboogle, and 31.3 on MuSiQue. On math, it reaches 60.0 on AIME24, 50.0 on AIME25, 93.8 on MATH500, and 95.1 on GSM8K. These are not all equally impressive in business terms, but they help show that the method is not only tuned for web traversal.

Still, the agentic benchmarks carry the main practical weight. A company does not usually need an agent that wins one benchmark category while collapsing when asked to combine search, browsing, code, files, and images. The paper’s broader claim is that reasoning feedback improves long-horizon decision-making across heterogeneous tasks.

Reagent-C is the diagnostic proof, not the deployment answer

Reagent-C deserves careful interpretation. Its performance is not the best overall. On GAIA text, it reaches 25.2, below the trained Reagent baseline without Agent-RRM. On WebWalkerQA, it reaches 35.5, also below the no-RRM trained Reagent baseline. A lazy reading would dismiss critique-based refinement.

That would miss the point.

Reagent-C is not competing as the final product. It is a mechanism test. It asks: if we give a frozen model a targeted critique, can it use that feedback to repair its own reasoning? The authors report consistent gains from the refined second response over the initial response, especially in mathematical reasoning. The case studies in the appendix show Agent-RRM identifying logical inconsistencies or inappropriate tool use, enabling the second attempt to correct the failure.

That is valuable evidence because Agent-RRM does not need access to the ground-truth answer to generate the critique. It is judging process quality. In practical terms, this resembles a QA reviewer that can say, “Your retrieval step did not actually verify the claim,” without knowing the final answer in advance.

For enterprise agents, this is often the missing layer. Logs already exist. Failed workflows already exist. The hard part is converting those traces into structured lessons. Reagent-C shows that the critique signal contains usable information. Reagent-U then shows that using this signal during training is stronger than relying on ad hoc test-time self-revision.

The lambda test is about balance, not magic hyperparameters

The paper includes a parameter analysis on the Agent-RRM reward weight $\lambda$, using AIME24 and xbench. This should be read as a sensitivity test. It is not a second thesis.

The result is intuitive: performance improves as Agent-RRM reward is introduced, plateaus at moderate values, then slightly declines when the reward weight becomes too high. The authors interpret this as a balance problem. Reasoning-quality reward helps, but over-emphasizing intermediate reasoning can distract from final task completion.

That matters more than the exact curve.

In business systems, the equivalent mistake is over-optimizing for “good-looking process.” A support agent can cite policies beautifully and still fail to resolve the customer’s issue. A research agent can perform careful browsing and still answer the wrong question. A compliance agent can produce a very tidy audit trail while missing the substantive risk. Lovely paperwork. Still wrong.

Agent-RRM helps because it adds process sensitivity. But the paper’s own sensitivity result reminds us that process reward must remain anchored to outcome reward. The critic should improve execution, not turn the agent into a theater student performing diligence.

The full GAIA test checks whether the method survives beyond text search

Many agent papers quietly live inside text-only search tasks. That is understandable. Text search is easier to standardize, easier to score, and less computationally irritating. Real workflows, unfortunately, do not respect benchmark convenience.

The authors therefore evaluate Reagent-U on the GAIA full set, which includes tasks requiring broader combinations of search, multimodal interpretation, Python coding, and file-based reasoning. Reagent-U reaches 38.8 pass@1 and 53.9 pass@3 on the full set. For comparison, MCP-R1 reaches 37.6 pass@1 and 51.5 pass@3, while base Qwen3-8B reaches 20.0 pass@1 and 26.7 pass@3.

This is best read as an exploratory extension beyond text-only search, not as proof that the method is ready for every enterprise workflow. The evaluation still uses standardized benchmark tasks and a particular tool suite. But it does reduce one concern: that Reagent-U merely learned a web-search trick. The full-set result suggests the learned behavior generalizes somewhat across tool categories.

The tool suite itself is worth noting. The agent uses search, web browsing, Python execution, file reading, image description, and audio transcription. In training and evaluation, these tools are implemented using services such as Bing Search API, Jina Reader, DeepSeek-Chat for webpage condensation, GPT-4.1 for image-to-text, and Whisper-large-v3 for audio transcription.

That setup is powerful, but it also complicates interpretation. The agent is not operating in a sterile model-only environment. Tool quality, retrieval quality, summarization quality, and evaluator quality all shape the observed outcome. Anyone translating this into production should treat the tool layer as part of the model, not as plumbing that can be ignored because it has an API endpoint. APIs are where neat research systems go to meet weather.

What this paper directly shows

The paper directly supports three conclusions.

First, reasoning-aware reward models can provide useful feedback for agent trajectories. Agent-RRM is trained to produce structured judgments over reasoning and tool use, not just final correctness. The Reagent-C results and appendix case studies support the claim that textual critiques can guide correction.

Second, scalar reasoning rewards improve agent training compared with sparse rule-based rewards alone. Reagent-R consistently outperforms the Reagent baseline without Agent-RRM across the benchmark suite, with notable gains such as Bamboogle and xbench.

Third, combining textual critique and scalar reward works better than using either signal alone in most reported settings. Reagent-U is the strongest variant overall, reaching 43.7 on GAIA text, 46.2 on WebWalkerQA, 76.8 on Bamboogle, and 60.0 on AIME24.

These are meaningful results. They are not proof of universal deployment readiness.

What Cognaptus infers for business use

For enterprise AI, the paper points toward a practical design pattern: agent systems should not only log outcomes; they should learn from diagnosed trajectories.

A common enterprise agent stack already has the raw material:

the user request;
the model’s intermediate thoughts or action plan, where available;
tool calls;
tool outputs;
final response;
human correction or downstream success/failure signal.

The missing layer is structured evaluation. A reasoning reward model could turn those traces into a repeatable diagnostic format: what went wrong, whether tool use was appropriate, how severe the issue was, and whether the final answer was saved by luck or earned by good process.

That creates a more useful improvement loop:

Operational artifact	Diagnostic conversion	Training or governance use
Failed support-agent conversation	Critique missing policy lookup or unsupported claim	Fine-tune agent behavior; update escalation triggers
Bad research-agent answer	Identify weak search query, unverified source, or synthesis error	Improve retrieval policy and answer verification
Workflow automation error	Detect hallucinated file name or wrong tool arguments	Harden tool schemas and permission checks
Correct answer with poor process	Score lower due to fragile reasoning path	Avoid rewarding lucky shortcuts
Incorrect answer with useful partial work	Score partial quality and critique final integration	Preserve useful sub-skills during training

The ROI relevance is not “cheaper training” in the abstract. It is cheaper diagnosis. Human reviewers are expensive because they must inspect logs, identify failure patterns, and translate them into engineering changes. A reasoning reward model can potentially compress that work into structured feedback at scale.

The business value is especially plausible in domains where failures are repetitive but not trivially verifiable: internal research, procurement analysis, compliance triage, technical support, financial document review, and multi-step data operations. These are places where final answers matter, but process defects are what determine whether the system can be trusted.

Where the result should not be over-sold

The paper is strongest as a research demonstration of structured reasoning feedback for agents. Its boundaries are clear.

First, the experiments primarily use 8B-scale Qwen3 models. The authors explicitly note that scaling behavior on larger models remains unexplored. Larger models may benefit more, benefit less, or require different reward calibration. The paper does not settle that.

Second, the benchmarks are broad but still standardized. Real enterprise workflows contain messier permissions, private databases, stale documents, contradictory policies, latency constraints, and organizational politics. Sadly, “organizational politics” remains underrepresented in benchmark suites, perhaps because no one wants to annotate despair.

Third, the reward model depends on synthetic and model-generated supervision. GPT-OSS-120B is used to annotate structured judgments for reward-model training. That may be practical, but it also means quality depends on the annotator model’s ability to identify reasoning and tool-use errors.

Fourth, the system still needs defenses against reward hacking. The paper positions Agent-RRM as a way to reduce the limitations of sparse rewards and step-level scalar rewards, but any reward model can become a target once agents learn to optimize against it. The lambda sensitivity analysis already hints at the broader issue: too much weight on model-based reasoning reward can pull training away from final task success.

Fifth, production cost is not fully resolved. Reagent-U avoids critique calls at inference time, which is attractive. But training the reward model and agent still requires significant compute, trajectory generation, tool infrastructure, and evaluation design. The authors report training on 8 NVIDIA A800-80G GPUs. This is not a weekend spreadsheet macro wearing a hoodie.

The larger lesson: agents need teachers, not just graders

Agent-RRM is useful because it changes the role of reward.

A conventional reward says, “This answer is good” or “This answer is bad.” A reasoning reward model says, “This answer failed because your tool choice was wrong, your evidence was incomplete, and your final inference jumped over the missing step.” That is a different kind of supervision.

For agents, this difference is not cosmetic. Agents act over time. They call tools. They recover from partial information. They make operational choices before they make final claims. A final score can tell us whether the journey ended well. It cannot explain where the journey became stupid.

The paper’s most important contribution is therefore not the GAIA number, although 43.7 is a useful marker. The contribution is the training pattern: use a reward model that can reason about trajectories, generate critique, assign a score, and teach the agent during training so it behaves better at inference without needing the critic beside it.

That is a plausible direction for enterprise AI systems. Not because it makes agents magically reliable. It does not. But because it attacks a real bottleneck: agents do not only need more tools; they need better feedback on how they misuse the tools they already have.

The next wave of useful agent infrastructure may look less like a bigger toolbox and more like a strict reviewer who reads the logs, circles the bad step, and says: “No, darling. That is not reasoning. That is autocomplete with confidence.”

Cognaptus: Automate the Present, Incubate the Future.

Kaixuan Fan et al., “Exploring Reasoning Reward Model for Agents,” arXiv:2601.22154v2, 28 Apr. 2026, https://arxiv.org/abs/2601.22154. ↩︎

Final-answer rewards are too coarse for tool-using agents#

Agent-RRM turns a trajectory into three kinds of feedback#

Reagent-C, Reagent-R, and Reagent-U test three ways to use diagnosis#

The main evidence says unified feedback beats either signal alone#

Reagent-C is the diagnostic proof, not the deployment answer#

The lambda test is about balance, not magic hyperparameters#

The full GAIA test checks whether the method survives beyond text search#

What this paper directly shows#

What Cognaptus infers for business use#

Where the result should not be over-sold#

The larger lesson: agents need teachers, not just graders#