Charting a Better Bedside: When Agentic RL Teaches RAG to Diagnose

TL;DR for operators

Diagnosis is not a search-box problem. A clinician does not simply type a symptom list, read a guideline, and pick a disease like ordering takeaway. The useful work is iterative: form a hypothesis, compare against similar cases, notice what does not fit, retrieve again, ignore plausible-looking rubbish, and only then commit.

The paper behind Deep-DxSearch makes exactly that operational point. It introduces an agentic diagnostic RAG system where the language model is not merely given retrieval tools; it is trained with reinforcement learning to decide when and how to use them.¹ The agent can choose among five actions: reason internally, look up disease guidelines, match similar patient records, search broader biomedical knowledge, and diagnose.

The important business takeaway is not “medical RAG works if the corpus is large”. That is the easy, slightly lazy reading. Deep-DxSearch does build a large corpus: 16,371 disease profiles, 177,029 patient records, 3.31 million Wikipedia documents, 23.9 million PubMed articles, and 125,847 textbook segments. But the paper’s real claim is that retrieval behaviour itself should be trained as a policy. In other words, the model learns the diagnostic choreography instead of having it stapled on through prompt engineering. Delightfully unfashionable: the workflow matters.

The experimental evidence is broad by benchmark standards. Against vanilla and training-free RAG versions of Qwen2.5 and Llama backbones, Deep-DxSearch improves performance across in-distribution and out-of-distribution common and rare disease datasets. Against medical-specific baselines, the Qwen2.5-14B Deep-DxSearch system reaches 51.32% top-1 and 68.53% top-5 accuracy on the out-of-distribution Mendeley common-disease dataset, compared with 41.20% and 56.02% for MedRAG. On the Xinhua in-house rare-disease dataset, it reaches 45.16% top-1 and 61.82% top-5, narrowly beating MAC on top-1 and more clearly beating MedRAG on top-5.

For operators, the practical interpretation is this: the next useful clinical AI system may not be the biggest model sitting on top of a vector database. It may be the system whose retrieval actions, evidence filters, and diagnostic stopping behaviour are trained directly against outcomes. That is a better product thesis. It is not yet a deployment licence.

The bedside problem is not missing facts; it is knowing which facts to chase

Most RAG systems are built around a polite fiction: the user asks, retrieval fetches, the model answers. That fiction works tolerably well when the task is a policy FAQ, a product manual, or a contract clause. It is less convincing when the question is clinical diagnosis.

A patient presentation is usually messy. Symptoms are incomplete. Important findings may be absent rather than present. One disease can mimic another. The nearest retrieved case may be near for the wrong reason. A guideline may describe the canonical disease while the patient is busy refusing to be canonical, because biology is rude like that.

Deep-DxSearch starts from this harder version of the problem. The authors argue that existing agentic RAG systems for diagnosis remain fragile because they are often inference-only. They can call tools, but they are not jointly trained to decide when to call them, what to ask, when to reformulate, when retrieved evidence is misleading, and when enough evidence has accumulated to diagnose.

That distinction sounds small until it is placed in a clinical workflow. “Use retrieval” is not a capability. It is a permission slip. The capability is selecting the right retrieval action at the right moment, with the right query, after the previous evidence has changed the diagnostic state.

Deep-DxSearch formalizes diagnosis as a sequential interaction between an LLM agent and a retrieval environment. The agent receives a clinical presentation and produces a trajectory of actions and feedback. The action set is deliberately clinical:

Action	What it does	Operational role
`<reason>`	Integrates current evidence and updates hypotheses	Keeps diagnosis from becoming blind search
`<lookup>`	Queries disease-specific guideline information	Checks candidate diseases against structured medical knowledge
`<match>`	Retrieves similar patient records from symptom or phenotype queries	Brings case-based reasoning into the loop
`<search>`	Searches broader biomedical documents	Expands beyond predefined guideline and case databases
`<diagnose>`	Commits to the final diagnostic list	Stops the evidence loop and produces the answer

This is the mechanism-first part of the paper. The system is not just a base model plus a corpus. It is a trained controller over a diagnostic evidence environment.

The corpus matters, but it is the stage, not the actor

Deep-DxSearch’s corpus is large enough to attract attention on its own. It includes three main resource types.

First, there is a disease information guideline covering 16,371 diseases: 12,088 common diseases and 4,283 rare diseases. The authors report 257,022 disease–phenotype or disease–symptom pairs, mapped to ICD, ORPHA, and HPO terminology. Common disease entries are supported by multiple sources on average, while rare-disease annotations come from Orphanet.

Second, there is a patient record database of 177,029 curated records, including validated diagnoses, clinical presentations, medication histories, and chief complaints. This matters because similar-patient matching is often more diagnostically useful than abstract disease description. Guidelines describe tidy textbook disease; patient records preserve clinical mess. The mess is where quite a lot of medicine lives.

Third, there is a broad clinical knowledge collection: 3.31 million Wikipedia documents, 23.9 million PubMed articles, and 125,847 segments from 18 medical textbooks. Because this material is too long to stuff into model context windows like an intern carrying every textbook at once, the system uses summarization to distil retrieved documents.

The evaluation dataset is also multi-source. The authors curate 24,142 clinical cases from MIMIC, PMC-Patients, MedDialog, RareArena, RareBench, Mendeley, and Xinhua Hospital. Common-disease cases make up 73.1% of this dataset; rare-disease cases make up 26.9%. The first five datasets are split for in-distribution training and evaluation, while Mendeley and Xinhua are reserved for out-of-distribution testing.

That breadth is useful, but it is not the entire story. A weak article about this paper would say: bigger corpus, better diagnosis. There, done, very modern, please invoice the innovation department.

The better reading is: a large heterogeneous corpus creates the environment in which a retrieval policy can learn. Without the corpus, the agent lacks evidence. Without policy training, the corpus is only a bigger haystack, now with more medically plausible needles and several shiny bits of irrelevant metal.

The reward design trains clinical search behaviour, not just final answers

The authors train Deep-DxSearch with reinforcement learning rather than supervised fine-tuning. More specifically, they use GRPO-style training and interleaved agent-environment rollouts. During generation, when the model emits retrieval tags such as <lookup>, <match>, or <search>, the system pauses, queries the relevant environment, appends the returned evidence, and continues generation.

The reward design is the quiet centre of the paper. It tries to avoid a familiar failure mode: train only on the final answer, and the model learns to reach for the label while its intermediate evidence-gathering behaviour remains brittle or theatrical. A diagnostic system that says the right thing for the wrong trace is not reassuring. In medicine, vibes are not an audit trail.

Deep-DxSearch uses several reward components:

Reward component	What it supervises	Why it matters
Format coefficient	Whether the output follows the required tagged structure	Prevents invalid trajectories from being rewarded
Patient matching reward	Whether similar-patient retrieval surfaces cases sharing the ground-truth diagnosis, while penalizing excessive or redundant matching	Trains query refinement and case-based evidence gathering
Search reward	Whether diseases named in search actions align with the ground truth at token level	Encourages useful biomedical search targets
Diagnosis reward	Whether the final diagnosis matches the ground truth, adjusted by matching behaviour	Links final answer quality to evidence-gathering discipline
Weighted clipped total reward	Combines the above into a bounded reward	Keeps training stable and interpretable

The patient matching reward is especially interesting. It grants reward if retrieved references include the ground-truth disease, but it penalizes repeated matching operations and requires diversity between consecutive phenotype sets. This is not merely “retrieve more”. It is “retrieve differently when the previous query was insufficient”. That is much closer to diagnostic reasoning than most chatbot-style RAG.

The authors also use staged reward adaptation. Instead of optimizing all reward components equally from the beginning, they emphasize one reward at a time across training stages, then combine them. They report that emphasizing patient matching in the second stage improved the final answer score more than focusing directly on the answer reward in the third stage.

That finding is easy to underread. It suggests that better diagnosis may come less from yelling “be correct” at the model and more from shaping the intermediate evidence process. Obvious to clinicians, apparently still news to software.

Main evidence: RL changes how well the system uses the same tools

The first major experiment compares three system designs across Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-14B backbones:

vanilla direct diagnosis;
training-free RAG using the corpus;
Deep-DxSearch with agentic RL.

This is the cleanest evidence for the mechanism. The same general setup becomes much stronger when the retrieval policy is trained.

For Qwen2.5-14B in the in-distribution setting, training-free RAG improves over vanilla direct inference, but Deep-DxSearch improves far more. On MedDialog, top-1 accuracy rises from 17.87% for vanilla to 24.69% for RAG, then to 48.81% for Deep-DxSearch. On RareBench, it rises from 18.07% to 34.70% to 70.48%.

The result is not that RAG is useless. The result is sharper: retrieval without learned orchestration leaves a lot of performance on the table.

The out-of-distribution results tell a similar story. Using Qwen2.5-14B:

Dataset	Vanilla Acc@1 / Acc@5	Training-free RAG Acc@1 / Acc@5	Deep-DxSearch Acc@1 / Acc@5	Likely purpose
Mendeley common disease	22.22 / 34.61	26.59 / 34.01	31.09 / 42.70	Robustness/generalization test against unseen common-disease distribution
Xinhua rare disease	20.01 / 27.20	27.62 / 36.85	35.13 / 45.77	Robustness/generalization test against unseen in-house rare-disease distribution

The absolute numbers are not magical. A top-1 accuracy of 35.13% on rare disease is not something a hospital should blindly deploy and call it transformation. But the direction is meaningful. The trained policy beats both direct inference and prompt-level RAG across these unseen settings.

That is the business-relevant point: process training improves the use of retrieval under distribution shift. In operational terms, this suggests that medical AI products should measure not only whether the model has access to sources, but whether it has learned a reliable evidence-seeking policy when the case does not look like yesterday’s demo.

Comparison with medical baselines: strong, but read the leaderboard carefully

The paper then treats Deep-DxSearch as a full diagnostic system and compares it against general-purpose LLMs and medical-specific methods.

In in-distribution averages, Deep-DxSearch outperforms GPT-4o, GPT-4o with direct retrieval, and DeepSeek-R1 on common and rare disease diagnosis. The figure reports 43.04% top-1 and 53.30% top-5 average accuracy for common diseases, and 49.25% top-1 and 61.02% top-5 for rare diseases.

Against medical-specific systems, the paper compares MedCPT, Baichuan-M1, MedGemma, CoD, MedRAG, MAC, and Deep-DxSearch. The authors report the strongest overall performance for Deep-DxSearch, with one noted exception: CoD performs slightly better on MedDialog, which the paper attributes to MedDialog being specifically optimized for CoD.

The out-of-distribution comparison is more operationally useful because deployment never has the courtesy to remain in-distribution. Here the paper reports:

Method	Category	Mendeley common Acc@1 / Acc@5	Xinhua rare Acc@1 / Acc@5
DeepSeek-R1	Reasoning LLM, 671B	30.55 / 41.20	37.52 / 49.63
MedCPT	Biomedical CLIP model, 109M	3.24 / 5.02	27.60 / 40.08
Baichuan-M1	Medical LLM, 14B	28.70 / 41.85	40.80 / 48.17
MedGemma	Medical foundation model, 27B	34.26 / 47.33	28.01 / 42.16
CoD	Chain-of-thought agent, 34B	14.35 / 29.17	19.00 / 27.80
MedRAG	RAG-based method	41.20 / 56.02	39.63 / 54.20
MAC	Multi-agent system	36.11 / 50.93	45.06 / 51.42
Deep-DxSearch	Agentic RL, 14B	51.32 / 68.53	45.16 / 61.82

This table supports a specific claim, not every possible claim. It supports that Deep-DxSearch, as evaluated here, is competitive or superior against a set of strong diagnostic baselines on two out-of-distribution diagnostic datasets. It does not prove that the system is clinically safe, operationally cost-effective, or robust to every hospital’s documentation habits.

Notice the rare-disease top-1 result on Xinhua: Deep-DxSearch beats MAC by only 0.10 percentage points, 45.16% versus 45.06%. That is not a victory lap; it is a photo finish. The top-5 result is more substantial: 61.82% for Deep-DxSearch versus 54.20% for MedRAG and 51.42% for MAC. For triage and differential diagnosis, top-5 may be practically relevant because surfacing the right candidate can still assist a clinician. For autonomous diagnosis, no one should be casually impressed by top-5. Autonomy is where benchmarks go to become lawsuits.

Ablations show what is doing the work

The ablation studies are not decorative. They answer the obvious sceptical question: is the improvement coming from the corpus, the reward design, the summarizer, the patient records, or some favourable benchmark alignment?

The reward ablation compares full Deep-DxSearch against target-only RL without intermediate policy reward. Removing the policy reward reduces average top-1 accuracy by 16.68 percentage points for common diseases and 22.14 percentage points for rare diseases. The “Hint” score, which measures whether the correct disease appears in the reasoning even when the final answer is wrong, also drops by 7.53 points for common diseases and 9.17 points for rare diseases.

That is important because it implies the intermediate process supervision is not merely making the traces prettier. It changes diagnostic behaviour in a way that affects outcomes.

The retrieval corpus ablation is also informative:

Ablation	Likely purpose	Reported effect	What it supports
Remove policy reward	Reward-design ablation	Top-1 drops by 16.68 points for common and 22.14 for rare diseases	Intermediate retrieval/reasoning rewards matter beyond final-answer supervision
Remove document summarizer	Component ablation	Top-1 drops by 5.21 points for common and 5.61 for rare diseases	Distilling retrieved documents helps manage context length and noise
Remove clinical knowledge collection	Component ablation	Accuracy remains 3.79 points higher for common and 2.72 for rare than the no-summarization setting	Smaller/noiseless context can partly offset missing broad knowledge, but at cost of coverage
Remove disease guideline	Component ablation	Additional top-1 drop of 1.58 points for common and 1.88 for rare diseases	Guidelines help structure reasoning, but are not the dominant component
Remove patient record database	Component ablation	Top-1 drops by 11.78 points for common and 17.46 for rare diseases	Similar-case retrieval is central, especially for rare disease diagnosis

The patient-record result deserves attention. In enterprise RAG, leaders often obsess over document corpora because documents are visible, governable, and procurement-friendly. In this paper, the strongest corpus component is not the broad knowledge collection or the disease guideline; it is similar patient cases.

That does not mean hospitals should casually dump patient records into a vector database and call it innovation. Privacy officers would have a small, justified fire. It does mean that case-based evidence may be the economically valuable layer for diagnostic AI, provided consent, governance, de-identification, access control, auditability, and local validation are handled properly. Yes, the boring parts. The boring parts are usually where the product either survives or becomes a conference demo with a legal department attached.

The interpretability tests examine policy evolution, not bedside explainability

The paper includes interpretability analysis of the learned RAG policy. This should not be oversold. It is not a complete explanation system for clinicians, nor a guarantee that the model’s reasoning is faithful in every case. It is better understood as a process diagnostic: did RL training improve the behaviours the designers intended to train?

The authors measure three abilities.

First, symptom association: whether retrieved top-20 patient cases include at least one case with the same diagnosis as the ground truth. Deep-DxSearch improves hit@20 from 25.79% to 60.39%, while the target-only baseline shows only minor improvement.

Second, differential diagnosis: whether the correct disease appears among the model’s top five predictions. The target-only baseline improves from 38.71% to 45.00%; Deep-DxSearch reaches 71.07%.

Third, irrelevance exclusion: whether the model can still improve when misleading guidelines, patient records, and documents are injected. Deep-DxSearch’s top-5 accuracy increases by nearly 10 points over training, while baseline methods improve by about 5 points.

These tests support the mechanism. The trained system is not only getting more labels right; it is becoming better at retrieving relevant cases, discriminating among candidates, and resisting misleading evidence. That is exactly what an agentic diagnostic RAG policy is supposed to learn.

But the boundary matters. This is interpretability of policy behaviour under designed metrics. It is not the same as a clinician-facing explanation that satisfies regulatory, medico-legal, or hospital governance requirements. A trace can be useful and still not be sufficient. The world is cruel that way.

What the paper directly shows, what Cognaptus infers, and what remains uncertain

The cleanest way to avoid turning this into AI theatre is to separate the layers.

Layer	Claim	Status
What the paper directly shows	End-to-end agentic RL improves diagnostic benchmark accuracy over vanilla and training-free RAG baselines across several ID and OOD datasets.	Supported by reported experiments
What the paper directly shows	Intermediate policy rewards materially improve performance versus target-only RL.	Supported by reward ablation
What the paper directly shows	Patient-record retrieval is especially important among retrieval components.	Supported by component ablation
What Cognaptus infers	Clinical AI products should train and evaluate retrieval behaviour as a first-class capability, not treat retrieval as plumbing.	Reasonable product inference
What Cognaptus infers	Case-based retrieval may be a key differentiator for rare-disease and complex-case support.	Plausible, but depends on compliant local data access and validation
What remains uncertain	Whether Deep-DxSearch improves real clinician performance in live workflows.	Not evaluated
What remains uncertain	Whether the approach remains safe and effective under local hospital documentation, multilingual inputs, coding variation, and changing clinical guidelines.	Requires deployment studies
What remains uncertain	Whether the cost and latency of agentic multi-step retrieval are acceptable in production triage or bedside use.	Not resolved by benchmark accuracy

That separation is the difference between useful strategy and brochure copy. Operators should care about the middle column, but buy only after the bottom column is addressed.

The business value is not “AI doctor”; it is trainable diagnostic evidence work

The lazy commercial pitch is obvious: “AI doctor gets better accuracy.” That pitch should be taken outside, given a reflective vest, and assigned to parking duty.

The stronger business interpretation is narrower and more useful. Deep-DxSearch points toward systems that automate parts of diagnostic evidence work: retrieving similar cases, structuring differential diagnosis, checking candidate diseases against guideline-like knowledge, searching literature for symptom-disease relationships, and preserving a trace of what was considered.

That has plausible business relevance in several settings.

For hospitals, the near-term use case is not autonomous diagnosis. It is diagnostic support in complex cases, second-opinion workflows, rare-disease triage, and pre-consult evidence packaging. The value is reducing time spent searching and synthesizing while improving the chance that the right differential diagnosis is at least considered.

For diagnostic support vendors, the lesson is architectural. A RAG layer is not enough. The system should log and optimize retrieval actions, query rewrites, source selection, stopping behaviour, and candidate elimination. Evaluation should inspect process metrics, not only final answer metrics.

For specialist networks and rare-disease platforms, the patient matching result is particularly relevant. Similar-case retrieval appears central to the system’s performance, especially in rare disease diagnosis. That suggests a defensible product moat may come from governed, high-quality case representations and consented cross-institutional case matching, not merely from attaching a bigger model to PubMed.

For insurers and payers, the implication is more delicate. Better diagnostic evidence assembly could reduce unnecessary referrals or shorten diagnostic odysseys, but only if the system is validated in real care pathways. Benchmark gains do not automatically become cost savings. They must pass through workflow adoption, clinician trust, liability design, and reimbursement logic. A classic enterprise funnel, except with blood tests.

Deployment boundaries: the model still needs the hospital, not just the benchmark

The authors state three limitations that matter commercially.

First, Deep-DxSearch has not yet been evaluated for real-time clinician support. This is the largest boundary. A system can perform well on curated diagnostic cases and still fail to improve actual clinical decisions, because clinical environments introduce time pressure, missing data, patient interaction, EHR quirks, institutional protocols, and human trust dynamics.

Second, the retrieval corpus is not fully customized to specific clinical centres. Local context matters in medicine: population, coding habits, available tests, referral pathways, disease prevalence, language, and documentation quality all affect diagnostic reasoning. A hospital-specific deployment would need local adaptation and monitoring.

Third, the evaluation is confined to diagnosis. Treatment planning and follow-up are different tasks with different risk profiles. A system that can retrieve evidence for diagnosis cannot automatically choose therapy or manage care plans. Medicine is inconveniently segmented by actual consequences.

There are also operator-level boundaries not fully resolved by the paper. Multi-step agentic retrieval can increase latency and infrastructure cost. Patient-record retrieval raises privacy, consent, and governance questions. Traceability helps auditability, but traces generated by LLMs still require validation. And clinical safety requires monitoring not only accuracy, but failure modes: missed critical diagnoses, overconfident rare-disease suggestions, bias across populations, and deterioration when source corpora age.

None of these boundaries negates the paper. They specify the distance between benchmark evidence and deployable product. That distance is where most health AI startups discover gravity.

The useful lesson: train the search policy, not just the answer machine

Deep-DxSearch is valuable because it reframes diagnostic RAG. The system is not a chatbot with a medical library. It is a learned policy for evidence gathering.

That reframing travels beyond medicine. Any domain with fragmented evidence, noisy records, costly mistakes, and long-tailed cases has a similar structure: law, compliance, insurance claims, financial due diligence, engineering incident analysis, and scientific research support. In each case, the hard part is not only retrieving documents. It is deciding what to retrieve next after the first retrieval changes the question.

For clinical AI, the paper’s message is sharper. Static medical copilots are useful but limited. Prompted RAG is better but still brittle. Agentic systems become materially more interesting when their intermediate actions are trained and evaluated.

The caution is equally sharp. Deep-DxSearch is a research system with promising benchmark performance, not a bedside replacement. Its most important contribution is architectural: make query formulation, case matching, evidence filtering, and diagnostic commitment learnable.

In medicine, that is a sensible direction. It is also a humble one. The machine does not become a doctor. It becomes better at doing the evidence chores doctors wish were less absurdly time-consuming. For once, that is probably enough.

Cognaptus: Automate the Present, Incubate the Future.

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, and Weidi Xie, “End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning,” arXiv:2508.15746, 2025, https://arxiv.org/pdf/2508.15746. ↩︎

TL;DR for operators#

The bedside problem is not missing facts; it is knowing which facts to chase#

The corpus matters, but it is the stage, not the actor#

The reward design trains clinical search behaviour, not just final answers#

Main evidence: RL changes how well the system uses the same tools#

Comparison with medical baselines: strong, but read the leaderboard carefully#

Ablations show what is doing the work#

The interpretability tests examine policy evolution, not bedside explainability#

What the paper directly shows, what Cognaptus infers, and what remains uncertain#

The business value is not “AI doctor”; it is trainable diagnostic evidence work#

Deployment boundaries: the model still needs the hospital, not just the benchmark#

The useful lesson: train the search policy, not just the answer machine#