The Search That Remembers: Training AI Without Answers

Search looks cheap until you try to train it.

A business can usually collect plenty of questions. Employees ask support bots why a policy changed. Analysts ask internal search systems for comparable transactions. Legal teams ask where a contract clause first appears. Researchers ask agents to chase a multi-step trail across documents, web pages, and databases.

The hard part is not producing more questions. The hard part is knowing, at training time, what the perfect answer should have been.

That is the awkward little invoice behind many search-agent demos. Reinforcement learning works well when a system can be rewarded for correct behavior, but search agents often live in domains where gold answers are expensive, incomplete, or obsolete by the time someone finishes annotating them. If the whole point of a search agent is to operate in a changing information environment, demanding a fixed answer key for every training case is a charmingly bureaucratic way to miss the point.

The paper Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training proposes a different reward: do not ask whether the agent produced the known answer; ask whether the agent’s search trajectory preserves enough information to reconstruct the original question.¹

That sounds dangerously close to rewarding query-copying. The paper knows this. Its actual contribution is not the slogan “reconstruct the question.” It is the mechanism that tries to make reconstruction non-trivial: remove the final answer, mask named entities in search actions, keep the retrieved observations, and reward trajectories only when an independent reconstructor can infer the original information need from the remaining evidence trail.

The result is Cycle-Consistent Search, or CCS: a gold-supervision-free reinforcement learning framework for search agents. Its business relevance is not that it magically removes evaluation. It does something more useful and less theatrical: it turns logged search behavior into a trainable signal when labeled answers are scarce.

The mechanism: a good search path should contain the question

CCS starts from a simple analogy. In cycle-consistency methods, a transformation is useful when the original input can be recovered from the transformed representation. Translate a sentence into another language and back; if the meaning survives the round trip, the intermediate representation preserved something important. Translate a zebra into a horse-like image and back; if the zebra is still recoverable, the transformation did not throw away the essential structure.

The paper applies that idea to agentic search.

A user question is treated as an information need. A search trajectory is treated as the agent’s attempt to expand that information need into a sequence of actions and observations. If the trajectory is good, it should contain enough evidence and structure to explain why the original question was asked. If the trajectory is shallow, irrelevant, or missing a hop, the original question should become hard to recover.

In simplified form, CCS builds this cycle:

$$ q \rightarrow \tau \rightarrow \hat{q} $$

where $q$ is the original question, $\tau$ is the search trajectory, and $\hat{q}$ is the reconstructed question. The reward is based on semantic similarity between $q$ and $\hat{q}$.

This is the central editorial point: CCS does not reward the final answer directly. It rewards whether the process preserved the user’s intent.

That distinction matters because search agents are not ordinary answer generators. Their value comes from how they gather evidence, revise queries, and chain observations. A final answer can be fluent and wrong. A search trace can reveal whether the system actually looked in the right places, or merely performed a confidence trick with citations attached.

The obvious failure mode is copying, so the bottleneck is the real invention

A naive version of question reconstruction would be almost useless. If the search query contains the full original question, a reconstructor can recover the question without caring whether the agent searched well. If the final response restates the question before answering, reconstruction becomes even easier. Congratulations: we have trained an AI system to smuggle the exam question into its scratch paper. A bold leap for automation, perhaps less so for intelligence.

CCS tries to block that shortcut with two information bottlenecks.

Leakage channel	CCS intervention	Why it matters
Final response restates the question or answer context	Exclude the final response from the reconstruction input	The reconstructor must evaluate the investigative trace, not the polished answer
Search actions copy entity names from the question	Mask named entities in search actions with tags such as `[PERSON]`, `[ORG]`, `[LOC]`, or `[MISC]`	The reconstructor must use retrieved observations to resolve entities, not copied query text
Observations contain external evidence	Keep observations unmasked	Evidence remains available, so reconstruction is possible when search was actually informative

The resulting input to the reconstructor is not the full trajectory. It is a processed trace: masked actions plus retrieved observations, without the final answer.

This is why the paper’s misconception trap matters. “Question reconstructability” can sound like a linguistic trick. In CCS, it is intended as an evidence-sufficiency test under constraint. The reconstructor is not supposed to simply rewrite a query; the appendix prompt explicitly frames the task around recovering the original question only when the masked actions and observations provide enough evidence. If the trace is under-specified, contradictory, or insufficient, the reconstructor should output N/A rather than hallucinate a plausible question.

The business translation is straightforward: the value is not in reconstructing questions for entertainment. The value is in using reconstructability as a proxy for whether a search trace captured the relevant intent and evidence.

From reconstruction score to reinforcement learning reward

Once CCS has a reconstruction score, it can use standard reinforcement learning machinery.

For each question, the policy samples a group of search trajectories. A frozen reconstructor attempts to recover the question from each bottlenecked trajectory. A sentence embedding model computes semantic similarity between the original and reconstructed questions. Then Group Relative Policy Optimization, or GRPO, increases the probability of trajectories that score better relative to others in the same group.

The group-relative part is useful because the score is not treated as an absolute certificate of truth. It ranks trajectories for the same question: this trace preserved intent better than that one. In a search setting, that is often enough to push the policy toward better query formulation and evidence collection.

The paper’s implementation uses Gemini 2.5 Flash as both evaluator and reconstructor, Qwen3-Embedding-4B for semantic similarity, and a BERT-based NER model for masking. RL training uses three Qwen policy models, five sampled responses per prompt, 300 training steps, a maximum action budget of four searches, and up to ten retrieved snippets per search. This is not a lightweight spreadsheet macro. It is a serious training setup, with serious compute behind it.

That implementation detail matters for business interpretation. CCS may reduce the need for gold answer labels, but it does not reduce training to zero cost. It substitutes one expensive thing—manual ground-truth construction—with another stack: rollout generation, reconstruction, embedding scoring, and RL optimization. The question is not whether CCS is free. The question is whether it is cheaper and more scalable than maintaining answer keys in domains where answers decay quickly.

The main evidence: CCS is the strongest gold-free method across the QA benchmarks

The paper evaluates CCS across seven question-answering benchmarks: three general QA datasets and four multi-hop QA datasets. The models are Qwen2.5-7B-Instruct, Qwen3-4B-Instruct-2507, and Qwen3-32B. Baselines are grouped into three families: inference-only methods such as direct inference, CoT, RAG, IRCoT, and Search-o1; gold-supervised training methods such as SFT and Search-R1; and gold-free methods such as TTRL, Constitutional Judge, RLIF, and CCS.

The main table is the paper’s core evidence, not a decorative scoreboard. It tests whether reconstructability is strong enough to compete with reward signals that either use gold answers or other gold-free proxies.

Policy model	Best prior gold-free average	CCS average	Gold-supervised Search-R1 average	Interpretation
Qwen2.5-7B-Instruct	CJ: 0.580	0.606	0.601	CCS is best overall in the table, slightly above Search-R1
Qwen3-4B-Instruct-2507	CJ: 0.579	0.636	0.649	CCS is strongest gold-free, but Search-R1 remains ahead
Qwen3-32B	CJ: 0.624	0.662	0.649	CCS is best overall in the table, above Search-R1

The authors report relative improvements over the strongest competing non-gold baseline of 4.5%, 9.8%, and 6.1% across the three models. That is a meaningful result because the comparison is not against weak prompting baselines only. CCS is beating alternative reward designs that also try to avoid gold answers.

The more interesting comparison is with Search-R1, which uses ground-truth answers. CCS slightly exceeds Search-R1 on Qwen2.5-7B-Instruct and Qwen3-32B, while falling behind it on Qwen3-4B-Instruct-2507. This should not be oversold. The paper does not prove that gold supervision is obsolete. It shows that a carefully bottlenecked process reward can become competitive with a gold-answer reward on these search-agent QA settings.

For business readers, that is already enough to be useful. The relevant question is not “Can we delete every labeled evaluation set tomorrow?” No, please do not make governance worse in the name of innovation. The better question is: where do we lack enough gold answers to train search behavior, but have enough search traces and retrieved evidence to score whether the process preserved intent?

That is where CCS becomes operationally interesting.

The ablation is the paper’s mechanism test, not just a bonus table

The ablation table is more important than it first looks. It asks whether the bottleneck design actually matters.

The paper compares four reconstruction inputs on Qwen2.5-7B-Instruct:

Reconstruction input	Average score	Likely purpose of the test	What it supports
Actions + Observations + Final Response	0.561	Leakage stress test	Including the final response weakens the training signal
Actions + Observations	0.545	Raw-action shortcut test	Unmasked actions are not a reliable substitute for the CCS bottleneck
Observations only	0.584	Structure removal test	Evidence alone helps, but lacks the search scaffold
Masked Actions + Observations	0.606	Full CCS component test	The best signal combines structural intent with evidence while reducing lexical shortcuts

The most revealing row is not merely that CCS wins. It is that “observations only” performs better than raw actions plus observations, but still worse than masked actions plus observations.

That pattern supports a precise interpretation: actions are useful when they preserve the scaffold of the search, but harmful when they leak too much surface text. Observations provide grounding, but without the masked action sequence they can become a pile of retrieved snippets without enough intent structure. CCS works because it keeps the skeleton and removes some of the fingerprints.

This is exactly the kind of result businesses should care about. Many enterprise AI evaluations collapse process into output: Did the answer look right? Did the citation list look respectable? Did the system sound confident enough to impress someone in a meeting? The ablation says process representation design changes training quality. What you log, mask, remove, and preserve can determine whether a reward signal teaches useful search behavior or merely rewards shortcut exploitation.

The qualitative cases explain what “bad search” means here

The paper’s qualitative figure is not main evidence; it is a mechanism illustration. It shows how CCS assigns low rewards to two bad trajectories and high reward to a trajectory that preserves the needed structure and evidence.

One low-reward case is an “information void.” The trajectory retrieves something lexically related to the question but fails to satisfy key entity constraints. The words look nearby; the evidence is not enough. This is the classic failure mode of shallow retrieval: semantic neighborhood mistaken for factual resolution.

The second low-reward case is “shallow depth.” The agent starts down the right multi-hop path but does not complete it. It identifies an intermediate entity, then fails to gather the remaining evidence needed to reconstruct the original question’s full relational structure.

The high-reward case, by contrast, contains enough scaffold and evidence for the reconstructor to infer the original comparison question.

This matters because CCS is not merely punishing missing facts. It is punishing broken investigative shape. A multi-hop question is not just a bag of nouns. It is a chain of constraints. If the trace breaks the chain, reconstruction fails.

That point is easy to underweight in business AI systems. Enterprise search often fails not because the system retrieves nothing, but because it retrieves one plausible fragment and stops. A CCS-like reward could train against that behavior by making incomplete traces less reconstructable.

The ResearchRubrics extension suggests broader value, but should be read carefully

The paper also evaluates CCS on ResearchRubrics, an open-ended Deep Research benchmark covering ten domains. This is not the same kind of evidence as the main QA table. It is better read as an exploratory extension: can a reward designed around reconstructable search traces help with longer, evidence-backed responses, not only closed-ended QA?

The reported result is encouraging. CCS achieves the best overall performance on both Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507, outperforming Search-o1, Search-R1, RLIF, and Constitutional Judge. The authors report relative improvements over Search-o1, Search-R1, RLIF, and CJ of 7.92%, 14.48%, 17.63%, and 9.96% for Qwen2.5-7B-Instruct; for Qwen3-4B-Instruct-2507, the corresponding improvements are 20.08%, 19.53%, 26.81%, and 7.86%.

The interesting part is that Search-R1, trained with closed-ended gold answers, is weaker in this open-ended setting. That does not mean supervised RL is bad. It means a reward optimized for closed-answer correctness may not transfer cleanly to research tasks where evidence coverage, scope control, and synthesis quality matter.

For Cognaptus-style business interpretation, this is where CCS becomes more than a QA trick. Many enterprise tasks are not “What is the capital of X?” They are closer to: “Find the regulatory changes that matter for our expansion plan,” or “Compare the risk factors across these filings,” or “Explain why this supplier’s price changed.” The final deliverable is open-ended, but the search process still needs to preserve the original intent.

Still, do not mistake this extension for universal proof. The benchmark is judged by an LLM rubric evaluator. The paper uses Gemini 2.5 Pro for this evaluation. That may be reasonable, but it also means the result depends on the evaluator’s rubric alignment. In high-stakes workflows, CCS would be a training signal, not a substitute for independent validation.

What this changes for enterprise search agents

The practical value of CCS is easiest to see through the training pipeline.

A conventional supervised search-agent pipeline needs questions and gold answers. A CCS-like pipeline needs questions, generated search trajectories, retrieved observations, a bottlenecking procedure, a reconstructor, and a similarity reward.

That changes what becomes valuable inside an organization.

Business asset	Under answer-supervised training	Under a CCS-like process reward
Historical user queries	Useful only if paired with labels	Directly usable as starting questions for trajectory training
Search logs	Often treated as diagnostics	Become training material if actions and observations are captured
Internal documents	Need answer extraction or annotation	Can support reconstruction through retrieved evidence
Evaluation work	Focuses on answer correctness labels	Shifts partly toward trace design, masking, reconstruction prompts, and process audits
Domain adaptation	Requires labeled domain QA sets	Can begin with unlabeled domain questions, though validation is still needed

This does not remove humans from the loop. It moves their work.

Instead of labeling every answer, teams may spend more effort designing what should count as a valid trace: what metadata to preserve, what identifiers to mask, how to prevent leakage, how to instruct the reconstructor, and when to reject under-evidenced traces. That is a more scalable kind of supervision when the domain is broad, fast-moving, or proprietary.

For enterprise search, this points to three practical uses.

First, CCS can help train internal research agents where answer labels are unavailable but user questions and document search traces are abundant. Think due diligence, compliance research, technical support escalation, or market intelligence.

Second, CCS can become a diagnostic layer. If a search agent’s trace cannot reconstruct the original question under a reasonable bottleneck, the system may have retrieved plausible fragments without preserving the user’s actual intent. That is not a final correctness test, but it is a useful warning.

Third, CCS encourages better logging architecture. Most organizations under-log agent behavior or log it in ways that are useless for training. A CCS-style system needs structured actions, observations, masked identifiers, final-response separation, and reconstructable trace formats. The boring infrastructure becomes the product. As usual, the spreadsheet-looking part is where the money hides.

What the paper directly shows, and what we should infer with care

It is worth separating the evidence from the extrapolation.

Claim	Status in the paper	Business interpretation	Boundary
Bottlenecked question reconstruction can serve as a reward for search-agent RL	Directly tested across seven QA benchmarks	Useful when gold answer labels are hard to obtain	Tested in search-agent QA settings, not all agent workflows
CCS outperforms other gold-free reward methods	Directly supported in the main table across three models	Reconstruction aligns better with search quality than confidence, voting, or generic judge rewards in this setup	Baseline strength and implementation details matter
CCS can be competitive with gold-supervised Search-R1	Partially supported: above Search-R1 on two models, below on one	Gold labels may not always be necessary for strong search behavior training	Does not prove gold supervision is obsolete
Masked actions plus observations are better than raw or incomplete traces	Supported by ablation	Trace design is an operational lever, not a cosmetic logging choice	Masking quality depends on NER and domain entity structure
CCS helps open-ended research tasks	Supported by ResearchRubrics extension	Promising for enterprise research assistants	LLM-judge evaluation and benchmark scope limit certainty

This is the discipline businesses need when reading AI papers. The paper does not say, “Train every agent without answers.” It says a particular proxy reward, designed around reconstructable bottlenecked search trajectories, works well for training search agents in these experiments.

That is narrower. It is also more actionable.

The limitations are not footnotes; they define the deployment boundary

The first limitation is compute. The experiments use multi-node H100 training, multiple sampled rollouts, long context windows, retrieval snippets, and external models for reconstruction and evaluation. A company can adapt the idea at smaller scale, but the paper’s results are not evidence that CCS is cheap in every deployment.

The second limitation is reconstructor dependence. If the reconstructor is too forgiving, it may infer questions from weak evidence. If it is too strict, it may reject useful traces. The appendix prompt tries to control this with rules such as “No Evidence, No Question,” but prompt design is not a law of physics. It is a policy document with a GPU budget.

The third limitation is entity masking. NER masking works well when key leakage flows through recognizable named entities. Enterprise domains often contain product codes, contract IDs, internal project names, chemical identifiers, financial tickers, and abbreviations that generic NER models may not handle properly. In those settings, the masking layer would need domain customization.

The fourth limitation is that reconstructability is not the same as truth. A trajectory may preserve the question and still retrieve misleading evidence. CCS trains the search process to preserve intent and gather sufficient-looking observations; it does not eliminate the need to verify final answers, especially in legal, medical, financial, or regulatory settings.

Finally, CCS assumes the agent’s action-observation trace is meaningful. For workflows where important reasoning happens outside explicit search actions, reconstructability may be a weak proxy. The method fits search-heavy agents best. For tool-use agents that calculate, negotiate, schedule, or execute transactions, a different cycle-consistency design may be needed.

The real lesson: reward the trail, not only the destination

The usual story about AI agents is that better answers require better models. Sometimes that is true. Sometimes the model is not the bottleneck. The bottleneck is that the training signal rewards the wrong object.

CCS shifts attention from answer supervision to trajectory quality. It asks whether the search trail preserved the user’s intent under constraints that make cheating harder. That is a useful idea because many knowledge-work agents fail in the middle, not at the end. They ask the wrong follow-up query. They stop one hop too early. They retrieve something nearby and call it evidence. They produce a final answer that reads like a conclusion but was built on a shallow investigation.

The paper’s strongest contribution is therefore not “AI can train without answers.” That would be too broad and suspiciously convenient. The better formulation is:

For search agents, a bottlenecked evidence trail can become a reward signal when gold answers are unavailable.

That is less flashy. It is also closer to how businesses will actually use it.

A company adopting this idea would not throw away evaluation sets. It would build a process-aware training loop: capture trajectories, remove leakage, reconstruct intent, score semantic preservation, and use the result to improve search behavior. Over time, that could make internal agents better at investigation rather than merely better at sounding finished.

And in enterprise AI, sounding finished has never been the same thing as being right. It has merely been cheaper to demo.

Cognaptus: Automate the Present, Incubate the Future.

Sohyun An, Shuibenyang Yuan, Hayeon Lee, Cho-Jui Hsieh, and Alexander Min, “Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training,” arXiv:2604.12967, April 2026, https://arxiv.org/abs/2604.12967. ↩︎

The mechanism: a good search path should contain the question#

The obvious failure mode is copying, so the bottleneck is the real invention#

From reconstruction score to reinforcement learning reward#

The main evidence: CCS is the strongest gold-free method across the QA benchmarks#

The ablation is the paper’s mechanism test, not just a bonus table#

The qualitative cases explain what “bad search” means here#

The ResearchRubrics extension suggests broader value, but should be read carefully#

What this changes for enterprise search agents#

What the paper directly shows, and what we should infer with care#

The limitations are not footnotes; they define the deployment boundary#

The real lesson: reward the trail, not only the destination#