Search failure is boring until it becomes expensive.

A research agent asks for evidence. The retriever returns documents. The reasoning model reads them, continues writing, and eventually produces a confident answer. Somewhere in the middle, the evidence was slightly wrong: not irrelevant enough to trigger an obvious failure, not useful enough to support the next reasoning step. The agent proceeds anyway, because that is what agents do when we dress up uncertainty as workflow automation.

This is the practical problem behind Critic-R, a paper from Md Zarif Ul Alam, Alireza Salemi, and Hamed Zamani on improving agentic search with instruction-tuned retrievers and natural-language introspective feedback.1 The paper’s key move is simple, and more interesting than another “RAG benchmark improves” headline: after a retrieval call, do not immediately commit the retrieved documents to the agent’s reasoning history. First, let the agent react to them. Then ask a critic whether those documents actually satisfy the agent’s current information need.

That difference matters. Most agentic-search systems put most of the intelligence budget into the reasoner. The assumption is familiar: if the model is strong enough, it will ask better questions, search again when needed, and recover from weak evidence. Critic-R challenges that assumption. The bottleneck is not only whether the model can reason. It is whether the retrieval system can provide the right evidence at the exact moment the reasoning process needs it.

The paper’s contribution has three layers. First, Critic-R-Zero adds an inference-time critic loop that judges retrieved evidence and rewrites both query and retrieval instruction when the evidence is inadequate. Second, Critic-Embed turns successful and failed refinement trajectories into contrastive training data for a retriever, without relying on manually labeled gold passages. Third, the full Critic-R system combines the trained retriever with the inference-time critic loop, producing the best average results on the paper’s multi-hop QA suite.

The useful lesson is not “add a critic, enjoy the numbers.” Naturally. That would be too convenient. The useful lesson is that the agent’s dissatisfaction can become an operational signal.

The useful signal is the agent saying what the documents failed to do

Critic-R starts from a small observation: after a model reads retrieved documents, its next reasoning trace often reveals whether the retrieval helped. It may say, in effect, “these passages discuss the film, but they do not identify the director,” or “this document gives the person’s biography but not the university.” That sentence is not just dead text in a chain of thought. It is a diagnosis of retrieval failure.

The framework uses a ReAct-style agent that alternates between thinking, searching, and answering. At each search step, the agent produces a query. The retriever returns the top-$k$ documents. In a conventional setup, those documents are appended to the agent’s history and the agent continues.

Critic-R-Zero inserts a speculative stage before that commitment:

  1. The retrieved documents are shown to the reasoner temporarily.
  2. The reasoner produces an introspective trace about the evidence.
  3. A separate critic model judges whether the evidence is satisfactory for the current sub-query.
  4. If the answer is “yes,” the documents are committed.
  5. If the answer is “no,” the critic writes a better retrieval instruction and a refined query.
  6. The system retries, up to a fixed refinement budget.

The paper sets the maximum refinement budget to $K = 2$, because additional iterations did not yield further improvements in their setup. That is a useful detail. The authors are not proposing infinite self-reflection, the favorite hobby of systems that wish latency did not exist.

A simplified version of the mechanism looks like this:

Stage What happens What changes operationally
Initial search Agent emits a sub-query; retriever returns top-$k$ documents Same as ordinary agentic RAG
Speculative reading Agent reacts to retrieved evidence before documents are committed The system gets a natural-language signal of fit or mismatch
Critic judgment Critic outputs a binary satisfaction verdict and a diagnostic reason Retrieval quality becomes inspectable per step
Query and instruction rewrite If unsatisfied, critic rewrites the query and retrieval instruction The next retrieval attempt is targeted at the missing evidence
Commitment Final accepted documents enter the reasoning history Bad evidence is less likely to pollute later reasoning

The separate critic is important. The critic is not merely asking the original reasoner to “try harder.” It is a dedicated evaluation component that sees the original question, the current sub-query, the retrieved documents, and the reasoner’s feedback. It then emits a verdict and, when needed, a reason for failure. In the second mode, it uses that failure reason to rewrite the retrieval instruction and query.

This division of labor is the first practical insight. Retrieval repair is easier when the system separates three roles that are often blurred together: the reasoner that knows what it currently needs, the critic that decides whether the evidence satisfies that need, and the retriever that must respond to improved instructions.

Critic-Embed turns temporary repair into permanent retriever training

Inference-time repair helps, but it costs compute. Every failed retrieval can trigger another reasoner pass, another critic judgment, and another retrieval attempt. Useful, yes. Free, no. The bill has a habit of arriving even when the architecture diagram looks elegant.

Critic-R’s second mechanism, Critic-Embed, tries to amortize that cost. The paper uses Critic-R-Zero trajectories to train a better retriever. The trick is that each refinement trajectory produces its own weak supervision:

  • documents accepted by the critic become positives;
  • documents rejected during earlier attempts become hard intra-trajectory negatives;
  • trajectories are retained only when the final answer is correct, improving label quality;
  • positive-only samples are also used when the first retrieval attempt is satisfactory.

The resulting training data is not manually labeled relevance data. It is supervision extracted from the agent’s own search process. The paper reports roughly 11K natural contrastive pairs, where a search call underwent at least one refinement and produced both positives and hard negatives, plus about 67K positive-only samples, where the first attempt was satisfactory.

The retriever is initialized from Stella-400M and fine-tuned using an InfoNCE-style contrastive objective. Conceptually, the loss encourages the query embedding to move closer to accepted documents and farther from rejected documents and in-batch negatives:

$$ L = - \log \frac{\exp(\text{sim}(q_i, z_i^+) / \tau)}{\sum_{z \in Z_i} \exp(\text{sim}(q_i, z) / \tau)} $$

Here, $q_i$ is the query embedding, $z_i^+$ is the positive document, and $Z_i$ contains the positive document plus negatives. The paper uses temperature $\tau = 0.02$, an effective batch size of 128, five epochs, learning rate $2 \times 10^{-5}$, and up to three intra-trajectory hard negatives per query.

The business translation is straightforward: a deployed RAG agent already creates logs of failure, retry, and partial satisfaction. Critic-R shows one way to turn those logs into retriever improvement. The retriever does not need a human to label every gold passage. It needs the system to notice which retrieved documents failed the current reasoning need and which later documents resolved it.

This is not magic. It is bookkeeping with ambition.

The experiments separate repair, distillation, and combination

The paper’s experiments are easier to understand if we classify what each test is trying to prove. Otherwise the tables blur into the usual metric parade, and nobody deserves that.

Paper component or test Likely purpose What it supports What it does not prove
Critic-R-Zero with frozen Stella-400M Main evidence for inference-time retrieval repair A critic loop can improve answer accuracy without retriever training That unlimited critic compute is efficient or always better
Critic-Embed versus Stella-400M and Agentic-R Comparison with prior retriever baselines Trajectories from the critic loop contain transferable retrieval supervision That the method dominates all possible retrievers or corpora
Full Critic-R Main evidence for complementarity A trained retriever and inference-time loop can combine for best average performance That the combination wins on every dataset
Removing introspective feedback Ablation The agent’s introspective trace is a key supervisory signal That all forms of chain-of-thought should be exposed or stored in production
General-domain QA appendix Robustness/sensitivity extension The critic loop also helps on mostly single-hop QA That enterprise live search or private document stores will behave the same

The setup is consistent across the main experiments. The reasoner is Search-R1, using Qwen2.5 variants at 3B, 7B, and 14B scale. The critic is a frozen Qwen2.5-Instruct model at 14B, 32B, or 72B. The retrieval corpus is a December 2018 Wikipedia dump, indexed with a Stella-400M dense retriever. The main benchmarks are HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Evaluation uses Exact Match and token-level F1.

The paper also reports general-domain QA results on Natural Questions, TriviaQA, and PopQA, but the real center of gravity is multi-hop QA. That is where retrieval timing and evidence composition become painful enough for the mechanism to matter.

Inference-time repair works, but bigger critics are not magic wands

The first major result tests Critic-R-Zero with a frozen retriever. This isolates the value of the critic loop itself. No retriever training. No gold passage labels. Just a reasoner, a retriever, and a critic that checks whether the retrieved evidence is good enough.

At top-$k=1$ and $K=2$ refinements, every critic model improves over the no-critic baseline across the reported reasoner and dataset combinations. For the 14B Search-R1 reasoner, the average result rises from 0.3472 EM / 0.4470 F1 with no critic to 0.3903 EM / 0.4855 F1 with a Qwen2.5-72B critic. The paper describes this as a 12.4% relative improvement overall for Critic-R-Zero.

That result is the cleanest evidence for the paper’s first claim: some retrieval failures can be fixed at inference time by judging and rewriting individual search calls. The agent does not need to be retrained. The retriever does not need to be retrained. The system only needs to stop treating the first returned evidence as sacred.

The more interesting finding is less flattering to brute-force scaling. Larger critics help, but not monotonically. For the 7B reasoner, the 32B critic produces a higher multi-hop average than the 72B critic: 0.3293 EM / 0.4176 F1 versus 0.3192 EM / 0.4108 F1. The 72B critic is strongest for the weaker 3B reasoner, and generally strong for the 14B reasoner, but the paper’s own interpretation is clear: more evaluator compute does not automatically overcome limitations in the reasoner and retriever.

That is a useful warning for enterprise AI teams. A critic loop is not a license to buy your way out of architecture. Past a point, the problem may not be that the critic is too small. The problem may be that the reasoner’s feedback is weak, the corpus lacks the needed evidence, or the retriever’s instruction interface cannot express the refined need well enough.

The appendix general-domain QA results serve as a robustness extension. On NQ, TriviaQA, and PopQA, the same broad pattern holds: adding a critic improves over no critic across reasoner scales, and the 72B critic usually gives the strongest average. This supports the idea that the loop is not limited to multi-hop tasks, although the business relevance remains strongest where retrieval failures compound across steps.

The retriever learned from complaints, not gold labels

The second major result asks whether Critic-R-Zero trajectories contain reusable supervision. This is where Critic-Embed enters.

The paper compares three retrievers under the same Search-R1 reasoner with no inference-time critic loop: the original Stella-400M retriever, the Agentic-R retriever baseline, and Critic-Embed. Retrieval depth varies across $k \in {1,3,5}$.

Critic-Embed wins in every reported setting.

Retrieval depth Stella-400M average EM/F1 Agentic-R average EM/F1 Critic-Embed average EM/F1
$k=1$ 0.3472 / 0.4470 0.3670 / 0.4564 0.3794 / 0.4806
$k=3$ 0.3996 / 0.4990 0.4036 / 0.4972 0.4128 / 0.5144
$k=5$ 0.4149 / 0.5119 0.4105 / 0.5104 0.4269 / 0.5272

The biggest absolute benefit appears when $k=1$, where the retriever has the least room to hide behind recall. On Bamboogle, for example, Critic-Embed reaches 0.4480 EM / 0.5872 F1, compared with 0.3520 / 0.4963 for Stella-400M and 0.4240 / 0.5260 for Agentic-R.

This is the paper’s second central evidence point. The critic loop does not merely patch individual bad searches. Its accepted and rejected retrieval attempts can be distilled into a retriever that performs better even when the loop is turned off.

For businesses, this matters because inference-time repair and model improvement have different economics. Repair spends compute per query. Retriever fine-tuning spends compute during training and reduces future friction. Critic-R’s structure suggests a practical lifecycle:

Phase Operational action Expected value
Deploy with critic loop Catch bad retrieval calls at runtime Better answers, richer diagnostics
Log accepted and rejected retrievals Preserve query, instruction, documents, and verdict Converts failures into training material
Fine-tune or adapt retriever Use positives and hard negatives from trajectories Lower future retrieval failure rate
Keep critic loop for hard cases Use inference-time repair selectively Avoid making every query pay full critic cost

That lifecycle is more credible than pretending a single embedding model will remain optimal forever. Documents change. User questions change. Business language changes. The retriever should learn from the agent’s scars. Elegantly, if possible. Begrudgingly, if necessary.

The full system wins on average, with useful per-dataset messiness

The full Critic-R system combines Critic-Embed with the Critic-R-Zero loop. In the main comparison at top-$k=1$, all configurations use the same Search-R1 14B reasoner, and the critic loop uses Qwen2.5-72B with $K=2$ refinement attempts.

Method Average EM Average F1 Interpretation
Search-R1 with static Stella-400M 0.3472 0.4470 Baseline agentic search with frozen retriever
Critic-Embed, no loop 0.3794 0.4806 Training trajectories improve the retriever
Critic-R-Zero on Stella-400M 0.3903 0.4855 Runtime repair improves retrieval calls
Full Critic-R 0.3957 0.4959 Trained retriever plus runtime repair gives best average

This supports the complementarity claim. Critic-Embed and Critic-R-Zero each close part of the retrieval gap. Combining them gives the best average result.

But the per-dataset pattern is not perfectly additive. Full Critic-R wins on 2Wiki and Bamboogle, including a strong Bamboogle gain: 0.4800 EM / 0.6200 F1, compared with 0.4480 / 0.5627 for Critic-R-Zero. However, Critic-R-Zero alone performs better on HotpotQA and MuSiQue.

That messiness is not a defect in the article’s story. It is the story behaving like real evidence. The trained retriever and the inference-time loop are not guaranteed to help the same examples. Sometimes a stronger retriever reduces the need for repair. Sometimes the loop rescues cases the retriever still misses. Sometimes the retriever’s new ranking shifts the evidence mix in ways that help one dataset and hurt another.

The correct conclusion is not that full Critic-R dominates every setting. The correct conclusion is that the two mechanisms repair different slices of retrieval failure, and the average gain comes from partial overlap rather than perfect synergy.

The ablation shows introspection is the payload

The most revealing experiment is the ablation that removes the agent’s introspective feedback from trajectory collection. The critic still sees the global question, generated query, and retrieved documents. What it no longer sees is the reasoner’s own trace about what the retrieved evidence failed to provide.

This is not a robustness test. It is an ablation of the paper’s core mechanism.

The result is clear: removing introspective feedback degrades the trained retriever at every retrieval depth.

Retrieval depth Critic-Embed average EM/F1 Without introspective feedback Drop
$k=1$ 0.3794 / 0.4806 0.3614 / 0.4521 -0.0180 EM / -0.0285 F1
$k=3$ 0.4128 / 0.5144 0.3903 / 0.4887 -0.0225 EM / -0.0257 F1
$k=5$ 0.4269 / 0.5272 0.3971 / 0.4990 -0.0298 EM / -0.0282 F1

This matters because it identifies where the signal actually lives. The critic is not simply a generic document relevance judge. It is useful because it judges retrieval evidence against the reasoner’s current information need, as expressed after reading the retrieved documents.

That distinction is crucial for enterprise RAG. Many evaluation systems ask, “Is this document relevant to the query?” Critic-R asks a more operational question: “Does this evidence let the agent continue this specific reasoning step?”

A document can be topically relevant and still operationally useless. Enterprise systems encounter this constantly: a policy document that mentions the regulation but not the exception; a contract that names the vendor but not the renewal clause; a support article that explains the product but not the failure mode. The critic’s value is in detecting that gap.

What Cognaptus infers for business use

The paper directly shows improvements on QA benchmarks over a static Wikipedia corpus. It does not directly show enterprise deployment performance. Still, the mechanism suggests a useful design pattern for business AI systems.

First, treat retrieval dissatisfaction as data. Most RAG logs contain questions, retrieved chunks, and final answers. That is not enough. A Critic-R-style system would log the local sub-query, retrieval instruction, returned documents, reasoner feedback, critic verdict, failure reason, and rewritten query. This turns “the answer was bad” into “the retrieval missed the supplier termination clause after the agent asked for post-renewal obligations.”

Second, separate runtime repair from retriever improvement. Runtime repair is expensive but immediately useful. Retriever improvement is slower but amortizes cost. A mature deployment should not use the same critic budget for every query forever. It should use critic traces to identify recurring retrieval gaps, then update the retrieval layer.

Third, evaluate retrieval at the step level. Final answer accuracy is too coarse for debugging agentic workflows. If an AI analyst answers a question after five searches, the business needs to know which search failed, why it failed, and whether the failure came from query formulation, retrieval ranking, corpus coverage, or reasoning over adequate evidence.

Fourth, do not assume stronger reasoning models solve retrieval. Critic-R’s evidence supports the opposite reading: retrieval remains an independent bottleneck even when the reasoner is a search-trained model. Better reasoning helps, but when the evidence is wrong, the model may simply reason more fluently around the wrong evidence. A polished hallucination is still a hallucination. It just wears a tie.

A practical enterprise adaptation might look like this:

Business layer Critic-R analogue Implementation question
AI research assistant Critic-R-Zero loop Should the system retry retrieval before drafting an answer?
Knowledge-base search Critic-Embed Can failed and accepted retrieval traces fine-tune the retriever?
Compliance or legal QA Satisfaction judgment Can the critic identify missing evidence before the agent cites a policy?
Customer support automation Query/instruction rewrite Can the system search for the specific failure mode rather than the broad product name?
Analytics governance Trajectory logging Can teams audit where the answer became unsupported?

The business value is not just higher benchmark F1. It is cheaper diagnosis. When retrieval failures become structured events, teams can prioritize corpus cleanup, retriever tuning, prompt changes, and workflow redesign with less guesswork.

Boundaries before procurement gets too excited

The paper’s limitations are not decorative. They materially affect deployment interpretation.

First, Critic-R depends on the reasoner’s ability to verbalize what is missing or misaligned in retrieved evidence. The paper uses Search-R1-style reasoning agents, which are explicitly prompted to provide such feedback. Weaker models, terse agents, or production systems that hide intermediate reasoning may not produce a clean signal. In those cases, the critic may judge weaker feedback and produce noisier supervision.

Second, the experiments use a static Wikipedia corpus. Enterprise document stores are uglier: duplicated PDFs, stale policy versions, scanned attachments, permission boundaries, inconsistent terminology, and live updates. A method that works on a clean benchmark corpus may still need substantial engineering to handle document freshness, access control, chunk quality, and source reliability.

Third, the paper evaluates QA tasks, not end-to-end business workflows. Exact Match and token-level F1 are useful for controlled comparison, but they do not measure latency, cost per resolved query, user trust, citation quality, escalation reduction, or compliance risk. Those are the metrics a deployment team eventually has to face, preferably before the CFO does.

Fourth, Bamboogle has only 125 test examples in the reported dataset statistics. Its results are interesting, especially because Critic-R performs strongly there, but small benchmark slices should not carry too much strategic weight.

Fifth, the critic loop adds inference-time cost. The paper reports A100 GPU usage and evaluates critics up to Qwen2.5-72B. That is acceptable for research and possibly for high-value enterprise analysis. It is less attractive for every low-stakes support query. The natural deployment pattern is selective critic use: activate the loop for high-uncertainty, high-value, or high-risk retrieval steps, and use accumulated traces to improve the retriever over time.

The durable idea is not criticism; it is reusable dissatisfaction

Critic-R is best read as a retrieval-feedback architecture, not just a critic architecture. The critic matters because it turns the agent’s post-retrieval dissatisfaction into two things: an immediate repair action and a future training signal.

That is why the mechanism-first reading is more useful than a benchmark-first summary. The benchmark gains are real, but the transferable idea is the loop:

  1. Let the agent inspect evidence before committing it.
  2. Ask a critic whether that evidence satisfies the current reasoning need.
  3. Rewrite the query and instruction when it does not.
  4. Store accepted and rejected retrievals as contrastive supervision.
  5. Train the retriever so fewer future queries need repair.

For enterprise AI, this suggests a shift in how RAG systems should be monitored. Do not merely ask whether the final answer was good. Ask whether each retrieval step gave the agent what it needed at that moment. The answer to that question is where the operational leverage lives.

The old RAG pipeline treated retrieval as a prelude to generation. Agentic search made retrieval iterative. Critic-R makes retrieval inspectable, repairable, and teachable.

That is a more useful kind of intelligence: not a model that never complains, but a system that knows how to make its complaints productive.

Cognaptus: Automate the Present, Incubate the Future.


  1. Md Zarif Ul Alam, Alireza Salemi, and Hamed Zamani, “Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback,” arXiv:2606.00590, 2026, https://arxiv.org/abs/2606.00590↩︎